12.3. Additional Controls

To implement full text searching there must be a function to create a tsvector from a document and a tsquery from a user query. Also, we need to return results in some order, i.e., we need a function which compares documents with respect to their relevance to the tsquery. Full text searching in PostgreSQL provides support for all of these functions.

12.3.1. Parsing

Full text searching in PostgreSQL provides function to_tsvector, which converts a document to the tsvector data type. More details are available in Section 9.13.2, but for now consider a simple example:

SELECT to_tsvector('english', 'a fat  cat sat on a mat - it ate a fat rats');
                  to_tsvector
-----------------------------------------------------
 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4

In the example above we see that the resulting tsvector does not contain the words a, on, or it, the word rats became rat, and the punctuation sign - was ignored.

The to_tsvector function internally calls a parser which breaks the document (a fat cat sat on a mat - it ate a fat rats) into words and corresponding types. The default parser recognizes 23 types. Each word, depending on its type, passes through a group of dictionaries (Section 12.4). At the end of this step we obtain lexemes. For example, rats became rat because one of the dictionaries recognized that the word rats is a plural form of rat. Some words are treated as "stop words" (Section 12.4.1) and ignored since they occur too frequently and have little informational value. In our example these are a, on, and it. The punctuation sign - was also ignored because its type (Space symbols) is not indexed. All information about the parser, dictionaries and what types of lexemes to index is documented in the full text configuration section (Section 12.4.9). It is possible to have several different configurations in the same database, and many predefined system configurations are available for different languages. In our example we used the default configuration english for the English language.

As another example, below is the output from the ts_debug function ( Section 12.8 ), which shows all details of the full text machinery:

SELECT * FROM ts_debug('english','a fat  cat sat on a mat - it ate a fat rats');
 Alias |  Description  | Token | Dictionaries | Lexized token
-------+---------------+-------+--------------+----------------
 lword | Latin word    | a     | {english}    | english: {}
 blank | Space symbols |       |              |
 lword | Latin word    | fat   | {english}    | english: {fat}
 blank | Space symbols |       |              |
 lword | Latin word    | cat   | {english}    | english: {cat}
 blank | Space symbols |       |              |
 lword | Latin word    | sat   | {english}    | english: {sat}
 blank | Space symbols |       |              |
 lword | Latin word    | on    | {english}    | english: {}
 blank | Space symbols |       |              |
 lword | Latin word    | a     | {english}    | english: {}
 blank | Space symbols |       |              |
 lword | Latin word    | mat   | {english}    | english: {mat}
 blank | Space symbols |       |              |
 blank | Space symbols | -     |              |
 lword | Latin word    | it    | {english}    | english: {}
 blank | Space symbols |       |              |
 lword | Latin word    | ate   | {english}    | english: {ate}
 blank | Space symbols |       |              |
 lword | Latin word    | a     | {english}    | english: {}
 blank | Space symbols |       |              |
 lword | Latin word    | fat   | {english}    | english: {fat}
 blank | Space symbols |       |              |
 lword | Latin word    | rats  | {english}    | english: {rat}
   (24 rows)

Function setweight() is used to label tsvector. The typical usage of this is to mark out the different parts of a document, perhaps by importance. Later, this can be used for ranking of search results in addition to positional information (distance between query terms). If no ranking is required, positional information can be removed from tsvector using the strip() function to save space.

Because to_tsvector(NULL) can return NULL, it is recommended to use coalesce. Here is the safe method for creating a tsvector from a structured document:

UPDATE tt SET ti=
    setweight(to_tsvector(coalesce(title,'')), 'A')    || ' ' ||
    setweight(to_tsvector(coalesce(keyword,'')), 'B')  || ' ' ||
    setweight(to_tsvector(coalesce(abstract,'')), 'C') || ' ' ||
    setweight(to_tsvector(coalesce(body,'')), 'D');

The following functions allow manual parsing control:

        ts_parse(parser, document text, OUT tokid integer, OUT token text) returns SETOF RECORD
       

Parses the given document and returns a series of records, one for each token produced by parsing. Each record includes a tokid giving its type and a token which gives its content:

SELECT * FROM ts_parse('default','123 - a number');
 tokid | token
-------+--------
    22 | 123
    12 |
    12 | -
     1 | a
    12 |
     1 | number

        ts_token_type(parser, OUT tokid integer, OUT alias text, OUT description text) returns SETOF RECORD
       

Returns a table which describes each kind of token the parser might produce as output. For each token type the table gives the tokid which the parser uses to label each token of that type, the alias which names the token type, and a short description:

SELECT * FROM ts_token_type('default');
 tokid |    alias     |            description
-------+--------------+-----------------------------------
     1 | lword        | Latin word
     2 | nlword       | Non-latin word
     3 | word         | Word
     4 | email        | Email
     5 | url          | URL
     6 | host         | Host
     7 | sfloat       | Scientific notation
     8 | version      | VERSION
     9 | part_hword   | Part of hyphenated word
    10 | nlpart_hword | Non-latin part of hyphenated word
    11 | lpart_hword  | Latin part of hyphenated word
    12 | blank        | Space symbols
    13 | tag          | HTML Tag
    14 | protocol     | Protocol head
    15 | hword        | Hyphenated word
    16 | lhword       | Latin hyphenated word
    17 | nlhword      | Non-latin hyphenated word
    18 | uri          | URI
    19 | file         | File or path name
    20 | float        | Decimal notation
    21 | int          | Signed integer
    22 | uint         | Unsigned integer
    23 | entity       | HTML Entity

12.3.2. Ranking Search Results

Ranking attempts to measure how relevant documents are to a particular query by inspecting the number of times each search word appears in the document, and whether different search terms occur near each other. Full text searching provides two predefined ranking functions which attempt to produce a measure of how a document is relevant to the query. In spite of that, the concept of relevancy is vague and very application-specific. These functions try to take into account lexical, proximity, and structural information. Different applications might require additional information for ranking, e.g. document modification time.

The lexical part of ranking reflects how often the query terms appear in the document, how close the document query terms are, and in what part of the document they occur. Note that ranking functions that use positional information will only work on unstripped tsvectors because stripped tsvectors lack positional information.

The two ranking functions currently available are:

        ts_rank([ weights float4[]], vector TSVECTOR, query TSQUERY, [ normalization int4 ]) returns float4
       

This ranking function offers the ability to weigh word instances more heavily depending on how you have classified them. The weights specify how heavily to weigh each category of word:

{D-weight, C-weight, B-weight, A-weight}

If no weights are provided, then these defaults are used:

{0.1, 0.2, 0.4, 1.0}

Often weights are used to mark words from special areas of the document, like the title or an initial abstract, and make them more or less important than words in the document body.

        ts_rank_cd([ weights float4[], ] vector TSVECTOR, query TSQUERY, [ normalization int4 ]) returns float4
       

This function computes the cover density ranking for the given document vector and query, as described in Clarke, Cormack, and Tudhope's "Relevance Ranking for One to Three Term Queries" in the "Information Processing and Management", 1999.

Since a longer document has a greater chance of containing a query term it is reasonable to take into account document size, i.e. a hundred-word document with five instances of a search word is probably more relevant than a thousand-word document with five instances. Both ranking functions take an integer normalization option that specifies whether a document's length should impact its rank. The integer option controls several behaviors which is done using bit-wise fields and | (for example, 2|4):

It is important to note that ranking functions do not use any global information so it is impossible to produce a fair normalization to 1% or 100%, as sometimes required. However, a simple technique like rank/(rank+1) can be applied. Of course, this is just a cosmetic change, i.e., the ordering of the search results will not change.

Several examples are shown below; note that the second example uses normalized ranking:

SELECT title, ts_rank_cd('{0.1, 0.2, 0.4, 1.0}',textsearch, query) AS rnk
FROM apod, to_tsquery('neutrino|(dark & matter)') query
WHERE query @@ textsearch
ORDER BY rnk DESC LIMIT 10;
                     title                     |   rnk
-----------------------------------------------+----------
 Neutrinos in the Sun                          |      3.1
 The Sudbury Neutrino Detector                 |      2.4
 A MACHO View of Galactic Dark Matter          |  2.01317
 Hot Gas and Dark Matter                       |  1.91171
 The Virgo Cluster: Hot Plasma and Dark Matter |  1.90953
 Rafting for Solar Neutrinos                   |      1.9
 NGC 4650A: Strange Galaxy and Dark Matter     |  1.85774
 Hot Gas and Dark Matter                       |   1.6123
 Ice Fishing for Cosmic Neutrinos              |      1.6
 Weak Lensing Distorts the Universe            | 0.818218

SELECT title, ts_rank_cd('{0.1, 0.2, 0.4, 1.0}',textsearch, query)/
(ts_rank_cd('{0.1, 0.2, 0.4, 1.0}',textsearch, query) + 1) AS rnk
FROM apod, to_tsquery('neutrino|(dark & matter)') query
WHERE  query @@ textsearch
ORDER BY rnk DESC LIMIT 10;
                     title                     |        rnk
-----------------------------------------------+-------------------
 Neutrinos in the Sun                          | 0.756097569485493
 The Sudbury Neutrino Detector                 | 0.705882361190954
 A MACHO View of Galactic Dark Matter          | 0.668123210574724
 Hot Gas and Dark Matter                       |  0.65655958650282
 The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973
 Rafting for Solar Neutrinos                   | 0.655172410958162
 NGC 4650A: Strange Galaxy and Dark Matter     | 0.650072921219637
 Hot Gas and Dark Matter                       | 0.617195790024749
 Ice Fishing for Cosmic Neutrinos              | 0.615384618911517
 Weak Lensing Distorts the Universe            | 0.450010798361481

The first argument in ts_rank_cd ('{0.1, 0.2, 0.4, 1.0}') is an optional parameter which specifies the weights for labels D, C, B, and A used in function setweight. These default values show that lexemes labeled as A are ten times more important than ones that are labeled with D.

Ranking can be expensive since it requires consulting the tsvector of all documents, which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since full text searching in a database should work without indexes . Moreover an index can be lossy (a GiST index, for example) so it must check documents to avoid false hits.

Note that the ranking functions above are only examples. You can write your own ranking functions and/or combine additional factors to fit your specific needs.

12.3.3. Highlighting Results

To present search results it is ideal to show a part of each document and how it is related to the query. Usually, search engines show fragments of the document with marked search terms. PostgreSQL full text searching provides the function headline that implements such functionality.

       ts_headline([ config_name text], document text, query TSQUERY, [ options text ]) returns text
      

The ts_headline function accepts a document along with a query, and returns one or more ellipsis-separated excerpts from the document in which terms from the query are highlighted. The configuration used to parse the document can be specified by its config_name; if none is specified, the current configuration is used.

If an options string is specified it should consist of a comma-separated list of one or more 'option=value' pairs. The available options are:

Any unspecified options receive these defaults:

StartSel=<b>, StopSel=</b>, MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE

For example:

SELECT ts_headline('a b c', 'c'::tsquery);
   headline
--------------
 a b <b>c</b>

SELECT ts_headline('a b c', 'c'::tsquery, 'StartSel=<,StopSel=>');
 ts_headline
-------------
 a b  <c>

headline uses the original document, not tsvector, so it can be slow and should be used with care. A typical mistake is to call headline() for every matching document when only ten documents are shown. SQL subselects can help here; below is an example:

SELECT id,ts_headline(body,q), rank
FROM (SELECT id,body,q, ts_rank_cd (ti,q) AS rank FROM apod, to_tsquery('stars') q
WHERE ti @@ q
ORDER BY rank DESC LIMIT 10) AS foo;

Note that the cascade dropping of the parser function causes dropping of the ts_headline used in the full text search configuration config_name.