9.13. Full Text Search Functions and Operators

This section outlines all the functions and operators that are available for full text searching.

Full text search vectors and queries both use lexemes, but for different purposes. A tsvector represents the lexemes (tokens) parsed out of a document, with an optional position. A tsquery specifies a boolean condition using lexemes.

All of the following functions that accept a configuration argument can use a textual configuration name to select a configuration. If the option is omitted the configuration specified by default_text_search_config is used. For more information on configuration, see Section 12.4.9.

9.13.1. Search

The operator @@ is used to perform full text searches:

      
      TSVECTOR @@ TSQUERY
      TSQUERY @@ TSVECTOR
      

Returns true if TSQUERY is contained in TSVECTOR, and false if not:

SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat & rat'::tsquery;
 ?column?
----------
 t

SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'fat & cow'::tsquery;
 ?column?
----------
 f

       text @@ tsquery
      

Returns true if TSQUERY is contained in TEXT, and false if not:

SELECT 'a fat cat sat on a mat and ate a fat rat'::text @@ 'cat & rat'::tsquery;
 ?column?
----------
 t

SELECT 'a fat cat sat on a mat and ate a fat rat'::text @@ 'cat & cow'::tsquery;
 ?column?
----------
 f

       
       text @@ text
      

Returns true if the right argument (the query) is contained in the left argument, and false otherwise:

SELECT 'a fat cat sat on a mat and ate a fat rat' @@ 'cat rat';
 ?column?
----------
 t

SELECT 'a fat cat sat on a mat and ate a fat rat' @@ 'cat cow';
 ?column?
----------
 f

For index support of full text operators consult Section 12.5.

9.13.2. tsvector

       to_tsvector([config_name],  document TEXT) returns TSVECTOR
      

Parses a document into tokens, reduces the tokens to lexemes, and returns a tsvector which lists the lexemes together with their positions in the document in lexicographic order.

       strip(vector TSVECTOR) returns TSVECTOR
      

Returns a vector which lists the same lexemes as the given vector, but which lacks any information about where in the document each lexeme appeared. While the returned vector is useless for relevance ranking it will usually be much smaller.

       setweight(vector TSVECTOR, letter) returns TSVECTOR
      

This function returns a copy of the input vector in which every location has been labeled with either the letter A, B, or C, or the default label D (which is the default for new vectors and as such is usually not displayed). These labels are retained when vectors are concatenated, allowing words from different parts of a document to be weighted differently by ranking functions.

       vector1 || vector2
       tsvector_concat(vector1 TSVECTOR, vector2 TSVECTOR) returns TSVECTOR
      

Returns a vector which combines the lexemes and positional information of the two vectors given as arguments. Positional weight labels (described in the previous paragraph) are retained during the concatenation. This has at least two uses. First, if some sections of your document need to be parsed with different configurations than others, you can parse them separately and then concatenate the resulting vectors. Second, you can weigh words from one section of your document differently than the others by parsing the sections into separate vectors and assigning each vector a different position label with the setweight() function. You can then concatenate them into a single vector and provide a weights argument to the ts_rank() function that assigns different weights to positions with different labels.

       length(vector TSVECTOR) returns INT4
      

Returns the number of lexemes stored in the vector.

       text::TSVECTOR returns TSVECTOR
      

Directly casting text to a tsvector allows you to directly inject lexemes into a vector with whatever positions and positional weights you choose to specify. The text should be formatted to match the way a vector is displayed by SELECT.

       tsvector_update_trigger(tsvector_column_name, config_name, text_column_name [, ... ])
       tsvector_update_trigger_column(tsvector_column_name, config_column_name, text_column_name [, ... ])
      

Two built-in trigger functions are available to automatically update a tsvector column from one or more textual columns. An example of their use is:

CREATE TABLE tblMessages (
    strMessage  text,
    tsv         tsvector
);

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON tblMessages FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(tsv, 'pg_catalog.english', strMessage);

Having created this trigger, any change in strMessage will be automatically reflected into tsv.

Both triggers require you to specify the text search configuration to be used to perform the conversion. For tsvector_update_trigger, the configuration name is simply given as the second trigger argument. It must be schema-qualified as shown above, so that the trigger behavior will not change with changes in search_path. For tsvector_update_trigger_column, the second trigger argument is the name of another table column, which must be of type regconfig. This allows a per-row selection of configuration to be made.

       ts_stat(sqlquery text [, weights text ]) returns SETOF statinfo
      

Here statinfo is a type, defined as:

CREATE TYPE statinfo AS (word text, ndoc integer, nentry integer);

and sqlquery is a text value containing a SQL query which returns a single tsvector column. ts_stat executes the query and returns statistics about the resulting tsvector data, i.e., the number of documents, ndoc, and the total number of words in the collection, nentry. It is useful for checking your configuration and to find stop word candidates. For example, to find the ten most frequent words:

SELECT * FROM ts_stat('SELECT vector from apod')
ORDER BY ndoc DESC, nentry DESC, word
LIMIT 10;

Optionally, one can specify weights to obtain statistics about words with a specific weight:

SELECT * FROM ts_stat('SELECT vector FROM apod','a')
ORDER BY ndoc DESC, nentry DESC, word
LIMIT 10;

       TSVECTOR < TSVECTOR
       TSVECTOR <= TSVECTOR
       TSVECTOR = TSVECTOR
       TSVECTOR >= TSVECTOR
       TSVECTOR > TSVECTOR
      

All btree operations are defined for the tsvector type. tsvectors are compared with each other using lexicographical ordering.

9.13.3. tsquery

       to_tsquery([config_name], querytext text) returns TSQUERY
      

Accepts querytext, which should consist of single tokens separated by the boolean operators & (and), | (or) and ! (not), which can be grouped using parentheses. In other words, to_tsquery expects already parsed text. Each token is reduced to a lexeme using the specified or current configuration. A weight class can be assigned to each lexeme entry to restrict the search region (see setweight for an explanation). For example:

'fat:a & rats'

The to_tsquery function can also accept a text string. In this case querytext should be quoted. This may be useful, for example, to use with a thesaurus dictionary. In the example below, a thesaurus contains rule supernovae stars : sn:

SELECT to_tsquery('''supernovae stars'' & !crab');
  to_tsquery
---------------
 'sn' & !'crab'

Without quotes to_tsquery will generate a syntax error.

       plainto_tsquery([config_name],  querytext text) returns TSQUERY
      

Transforms unformatted text querytext to tsquery. It is the same as to_tsquery but accepts text without quotes and will call the parser to break it into tokens. plainto_tsquery assumes the & boolean operator between words and does not recognize weight classes.

       querytree(query TSQUERY) returns TEXT
      

This returns the query used for searching an index. It can be used to test for an empty query. The SELECT below returns NULL, which corresponds to an empty query since GIN indexes do not support queries with negation (a full index scan is inefficient):

SELECT querytree(to_tsquery('!defined'));
 querytree
-----------

       text::TSQUERY returns TSQUERY
      

Directly casting text to a tsquery allows you to directly inject lexemes into a query using whatever positions and positional weight flags you choose to specify. The text should be formatted to match the way a vector is displayed by SELECT.

       numnode(query TSQUERY) returns INTEGER
      

This returns the number of nodes in a query tree. This function can be used to determine if query is meaningful (returns > 0), or contains only stop words (returns 0):

SELECT numnode(plainto_tsquery('the any'));
NOTICE:  query contains only stopword(s) or does not contain lexeme(s), ignored
 numnode
---------
       0

SELECT numnode(plainto_tsquery('the table'));
 numnode
---------
       1

SELECT numnode(plainto_tsquery('long table'));
 numnode
---------
       3

       TSQUERY && TSQUERY returns TSQUERY
      

Returns AND-ed TSQUERY

       TSQUERY || TSQUERY returns TSQUERY
      

Returns OR-ed TSQUERY

       !! TSQUERY returns TSQUERY
      

negation of TSQUERY

       TSQUERY < TSQUERY
       TSQUERY <= TSQUERY
       TSQUERY = TSQUERY
       TSQUERY >= TSQUERY
       TSQUERY > TSQUERY
      

All btree operations are defined for the tsquery type. tsqueries are compared to each other using lexicographical ordering.

9.13.3.1. Query Rewriting

Query rewriting is a set of functions and operators for the tsquery data type. It allows control at search query time without reindexing (the opposite of the thesaurus). For example, you can expand the search using synonyms (new york, big apple, nyc, gotham) or narrow the search to direct the user to some hot topic.

The ts_rewrite() function changes the original query by replacing part of the query with some other string of type tsquery, as defined by the rewrite rule. Arguments to ts_rewrite() can be names of columns of type tsquery.

CREATE TABLE aliases (t TSQUERY PRIMARY KEY, s TSQUERY);
INSERT INTO aliases VALUES('a', 'c');

        ts_rewrite (query TSQUERY, target TSQUERY, sample TSQUERY) returns TSQUERY
       

SELECT ts_rewrite('a & b'::tsquery, 'a'::tsquery, 'c'::tsquery);
 ts_rewrite
------------
 'b' & 'c'

        ts_rewrite(ARRAY[query TSQUERY, target TSQUERY, sample TSQUERY]) returns TSQUERY
       

SELECT ts_rewrite(ARRAY['a & b'::tsquery, t,s]) FROM aliases;
 ts_rewrite
------------
 'b' & 'c'

        ts_rewrite (query TSQUERY,'SELECT target ,sample FROM test'::text) returns TSQUERY
       

SELECT ts_rewrite('a & b'::tsquery, 'SELECT t,s FROM aliases');
 ts_rewrite
------------
 'b' & 'c'

What if there are several instances of rewriting? For example, query 'a & b' can be rewritten as 'b & c' and 'cc'.

SELECT * FROM aliases;
     t     |  s
-----------+------
 'a'       | 'c'
 'x'       | 'z'
 'a' & 'b' | 'cc'

This ambiguity can be resolved by specifying a sort order:

SELECT ts_rewrite('a & b', 'SELECT t, s FROM aliases ORDER BY t DESC');
 ts_rewrite
   ---------
 'cc'

SELECT ts_rewrite('a & b', 'SELECT t, s FROM aliases ORDER BY t ASC');
  ts_rewrite
--------------
 'b' & 'c'

Let's consider a real-life astronomical example. We'll expand query supernovae using table-driven rewriting rules:

CREATE TABLE aliases (t tsquery primary key, s tsquery);
INSERT INTO aliases VALUES(to_tsquery('supernovae'), to_tsquery('supernovae|sn'));

SELECT ts_rewrite(to_tsquery('supernovae'),  'SELECT * FROM aliases') && to_tsquery('crab');
           ?column?
-------------------------------
( 'supernova' | 'sn' ) & 'crab'

Notice, that we can change the rewriting rule online:

UPDATE aliases SET s=to_tsquery('supernovae|sn & !nebulae') WHERE t=to_tsquery('supernovae');
SELECT ts_rewrite(to_tsquery('supernovae'),  'SELECT * FROM aliases') && to_tsquery('crab');
                   ?column?
-----------------------------------------------
 'supernova' | 'sn' & !'nebula' ) & 'crab'

9.13.3.2. Operators For tsquery

Rewriting can be slow for many rewriting rules since it checks every rule for a possible hit. To filter out obvious non-candidate rules there are containment operators for the tsquery type. In the example below, we select only those rules which might contain the original query:

SELECT ts_rewrite(ARRAY['a & b'::tsquery, t,s])
FROM aliases
WHERE 'a & b' @> t;
 ts_rewrite
------------
 'b' & 'c'

Two operators are defined for tsquery:

        TSQUERY @> TSQUERY
       

Returns true if the right argument might be contained in left argument.

        TSQUERY <@ TSQUERY
       

Returns true if the left argument might be contained in right argument.

9.13.3.3. Index For tsquery

To speed up operators <@ and @> for tsquery one can use a GiST index with a tsquery_ops opclass:

CREATE INDEX t_idx ON aliases USING gist (t tsquery_ops);