PostgreSQL 8.3beta1 Documentation | ||||
---|---|---|---|---|
Prev | Fast Backward | Chapter 12. Full Text Search | Fast Forward | Next |
There are two kinds of indexes which can be used to speed up full text operators (Section 12.1.2). Note that indexes are not mandatory for full text searching.
CREATE INDEX name ON table USING gist(column);
Creates a GiST (Generalized Search Tree)-based index. The column can be of tsvector or tsquery type.
CREATE INDEX name ON table USING gin(column);
Creates a GIN (Generalized Inverted Index)-based index. The column must be of tsvector type.
A GiST index is lossy, meaning it is necessary to check the actual table row to eliminate false matches. PostgreSQL does this automatically; for example, in the query plan below, the Filter: line indicates the index output will be rechecked:
EXPLAIN SELECT * FROM apod WHERE textsearch @@ to_tsquery('supernovae'); QUERY PLAN ------------------------------------------------------------------------- Index Scan using textsearch_gidx on apod (cost=0.00..12.29 rows=2 width=1469) Index Cond: (textsearch @@ '''supernova'''::tsquery) Filter: (textsearch @@ '''supernova'''::tsquery)
GiST index lossiness happens because each document is represented by a fixed-length signature. The signature is generated by hashing (crc32) each word into a random bit in an n-bit string and all words combine to produce an n-bit document signature. Because of hashing there is a chance that some words hash to the same position and could result in a false hit. Signatures calculated for each document in a collection are stored in an RD-tree (Russian Doll tree), invented by Hellerstein, which is an adaptation of R-tree for sets. In our case the transitive containment relation is realized by superimposed coding (Knuth, 1973) of signatures, i.e., a parent is the result of 'OR'-ing the bit-strings of all children. This is a second factor of lossiness. It is clear that parents tend to be full of 1s (degenerates) and become quite useless because of the limited selectivity. Searching is performed as a bit comparison of a signature representing the query and an RD-tree entry. If all 1s of both signatures are in the same position we say that this branch probably matches the query, but if there is even one discrepancy we can definitely reject this branch.
Lossiness causes serious performance degradation since random access of heap records is slow and limits the usefulness of GiST indexes. The likelihood of false hits depends on several factors, like the number of unique words, so using dictionaries to reduce this number is recommended.
Actually, this is not the whole story. GiST indexes have an optimization for storing small tsvectors (< TOAST_INDEX_TARGET bytes, 512 bytes). On leaf pages small tsvectors are stored unchanged, while longer ones are represented by their signatures, which introduces some lossiness. Unfortunately, the existing index API does not allow for a return value to say whether it found an exact value (tsvector) or whether the result needs to be checked. This is why the GiST index is currently marked as lossy. We hope to improve this in the future.
GIN indexes are not lossy but their performance depends logarithmically on the number of unique words.
There is one side-effect of the non-lossiness of a GIN index when using query labels/weights, like 'supernovae:a'. A GIN index has all the information necessary to determine a match, so the heap is not accessed. However, label information is not stored in the index, so if the query involves label weights it must access the heap. Therefore, a special full text search operator @@@ was created which forces the use of the heap to get information about labels. GiST indexes are lossy so it always reads the heap and there is no need for a special operator. In the example below, fulltext_idx is a GIN index:
EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a'); QUERY PLAN ------------------------------------------------------------------------ Index Scan using textsearch_idx on apod (cost=0.00..12.30 rows=2 width=1469) Index Cond: (textsearch @@@ '''supernova'':A'::tsquery) Filter: (textsearch @@@ '''supernova'':A'::tsquery)
In choosing which index type to use, GiST or GIN, consider these differences:
GiN index lookups are three times faster than GiST
GiN indexes take three times longer to build than GiST
GiN is about ten times slower to update than GiST
GiN indexes are two-to-three times larger than GiST
In summary, GIN indexes are best for static data because the indexes are faster for lookups. For dynamic data, GiST indexes are faster to update. Specifically, GiST indexes are very good for dynamic data and fast if the number of unique words (lexemes) is under 100,000, while GIN handles +100,000 lexemes better but is slower to update.
Partitioning of big collections and the proper use of GiST and GIN indexes allows the implementation of very fast searches with online update. Partitioning can be done at the database level using table inheritance and constraint_exclusion, or distributing documents over servers and collecting search results using the contrib/dblink extension module. The latter is possible because ranking functions use only local information.