F.15. ltree

ltree is a PostgreSQL module that contains implementation of data types, indexed access methods and queries for data organized as a tree-like structures.

F.15.1. Definitions

A label of a node is a sequence of one or more words separated by blank character '_' and containing letters and digits ( for example, [a-zA-Z0-9] for C locale). The length of a label is limited by 256 bytes.

Example: 'Countries', 'Personal_Services'

A label path of a node is a sequence of one or more dot-separated labels l1.l2...ln, represents path from root to the node. The length of a label path is limited by 65Kb, but size <= 2Kb is preferrable. We consider it's not a strict limitation (maximal size of label path for DMOZ catalogue - http://www.dmoz.org, is about 240 bytes!)

Example: 'Top.Countries.Europe.Russia'

We introduce several datatypes:

F.15.2. Operations

The following operations are defined for type ltree:

Operations for arrays of ltree (ltree[]):

F.15.3. Remark

Operations <@, @>, @ and ~ have analogues - ^<@, ^@>, ^@, ^~, which don't use indices!

F.15.4. Indices

Various indices could be created to speed up execution of operations:

F.15.5. Functions

F.15.6. Installation

  cd contrib/ltree
  make
  make install
  make installcheck
  

F.15.7. Example

 createdb ltreetest
 psql ltreetest < /usr/local/pgsql/share/contrib/ltree.sql
 psql ltreetest < ltreetest.sql
  

Now, we have a database ltreetest populated with a data describing hierarchy shown below:


 
                            TOP
                         /   |  \     
                 Science Hobbies Collections  
                     /       |              \
            Astronomy   Amateurs_Astronomy Pictures
               /  \                            |
    Astrophysics  Cosmology                Astronomy
                                            /  |    \
                                     Galaxies Stars Astronauts
  

Inheritance:

ltreetest=# select path from test where path <@ 'Top.Science';
                path                
------------------------------------
 Top.Science
 Top.Science.Astronomy
 Top.Science.Astronomy.Astrophysics
 Top.Science.Astronomy.Cosmology
(4 rows)
  

Matching:

ltreetest=# select path from test where path ~ '*.Astronomy.*';
                     path                      
-----------------------------------------------
 Top.Science.Astronomy
 Top.Science.Astronomy.Astrophysics
 Top.Science.Astronomy.Cosmology
 Top.Collections.Pictures.Astronomy
 Top.Collections.Pictures.Astronomy.Stars
 Top.Collections.Pictures.Astronomy.Galaxies
 Top.Collections.Pictures.Astronomy.Astronauts
(7 rows)
ltreetest=# select path from test where path ~ '*.!pictures@.*.Astronomy.*';
                path                
------------------------------------
 Top.Science.Astronomy
 Top.Science.Astronomy.Astrophysics
 Top.Science.Astronomy.Cosmology
(3 rows)
  

Full text search:

ltreetest=# select path from test where path @ 'Astro*% & !pictures@';
                path                
------------------------------------
 Top.Science.Astronomy
 Top.Science.Astronomy.Astrophysics
 Top.Science.Astronomy.Cosmology
 Top.Hobbies.Amateurs_Astronomy
(4 rows)

ltreetest=# select path from test where path @ 'Astro* & !pictures@';
                path                
------------------------------------
 Top.Science.Astronomy
 Top.Science.Astronomy.Astrophysics
 Top.Science.Astronomy.Cosmology
(3 rows)
  

Using Functions:

ltreetest=# select subpath(path,0,2)||'Space'||subpath(path,2) from test where path <@ 'Top.Science.Astronomy';
                 ?column?                 
------------------------------------------
 Top.Science.Space.Astronomy
 Top.Science.Space.Astronomy.Astrophysics
 Top.Science.Space.Astronomy.Cosmology
(3 rows)
We could create SQL-function:
CREATE FUNCTION ins_label(ltree, int4, text) RETURNS ltree 
AS 'select subpath($1,0,$2) || $3 || subpath($1,$2);'
LANGUAGE SQL IMMUTABLE;
  

and previous select could be rewritten as:

ltreetest=# select ins_label(path,2,'Space') from test where path <@ 'Top.Science.Astronomy';
                ins_label                 
------------------------------------------
 Top.Science.Space.Astronomy
 Top.Science.Space.Astronomy.Astrophysics
 Top.Science.Space.Astronomy.Cosmology
(3 rows)
  

Or with another arguments:

CREATE FUNCTION ins_label(ltree, ltree, text) RETURNS ltree
AS 'select subpath($1,0,nlevel($2)) || $3 || subpath($1,nlevel($2));'
LANGUAGE SQL IMMUTABLE;

ltreetest=# select ins_label(path,'Top.Science'::ltree,'Space') from test where path <@ 'Top.Science.Astronomy';
                ins_label                 
------------------------------------------
 Top.Science.Space.Astronomy
 Top.Science.Space.Astronomy.Astrophysics
 Top.Science.Space.Astronomy.Cosmology
(3 rows)
  

F.15.8. Additional data

To get more feeling from our ltree module you could download dmozltree-eng.sql.gz (about 3Mb tar.gz archive containing 300,274 nodes), available from http://www.sai.msu.su/~megera/postgres/gist/ltree/ dmozltree-eng.sql.gz, which is DMOZ catalogue, prepared for use with ltree. Setup your test database (dmoz), load ltree module and issue command:

   zcat dmozltree-eng.sql.gz| psql dmoz
  

Data will be loaded into database dmoz and all indices will be created.

F.15.9. Benchmarks

All runs were performed on my IBM ThinkPad T21 (256 MB RAM, 750Mhz) using DMOZ data, containing 300,274 nodes (see above for download link). We used some basic queries typical for walking through catalog.

F.15.9.1. Queries

  • Q0: Count all rows (sort of base time for comparison)

         select count(*) from dmoz;
          count  
         --------
          300274
         (1 row)
         
  • Q1: Get direct children (without inheritance)

         select path from dmoz where path ~ 'Top.Adult.Arts.Animation.*{1}';
                        path                
         -----------------------------------
          Top.Adult.Arts.Animation.Cartoons
          Top.Adult.Arts.Animation.Anime
         (2 rows)
         
  • Q2: The same as Q1 but with counting of successors

         select path as parentpath , (select count(*)-1 from dmoz where path <@
         p.path) as count from dmoz p where path ~ 'Top.Adult.Arts.Animation.*{1}';
                     parentpath             | count 
         -----------------------------------+-------
          Top.Adult.Arts.Animation.Cartoons |     2
          Top.Adult.Arts.Animation.Anime    |    61
         (2 rows)
         
  • Q3: Get all parents

         select path from dmoz where path @> 'Top.Adult.Arts.Animation' order by
         path asc;
                    path           
         --------------------------
          Top
          Top.Adult
          Top.Adult.Arts
          Top.Adult.Arts.Animation
         (4 rows)
         
  • Q4: Get all parents with counting of children

         select path, (select count(*)-1 from dmoz where path <@ p.path) as count
         from dmoz p where path @> 'Top.Adult.Arts.Animation' order by path asc;
                    path           | count  
         --------------------------+--------
          Top                      | 300273
          Top.Adult                |   4913
          Top.Adult.Arts           |    339
          Top.Adult.Arts.Animation |     65
         (4 rows)
         
  • Q5: Get all children with levels

         select path, nlevel(path) - nlevel('Top.Adult.Arts.Animation') as level
         from dmoz where path ~ 'Top.Adult.Arts.Animation.*{1,2}' order by path asc;
                               path                      | level 
         ------------------------------------------------+-------
          Top.Adult.Arts.Animation.Anime                 |     1
          Top.Adult.Arts.Animation.Anime.Fan_Works       |     2
          Top.Adult.Arts.Animation.Anime.Games           |     2
          Top.Adult.Arts.Animation.Anime.Genres          |     2
          Top.Adult.Arts.Animation.Anime.Image_Galleries |     2
          Top.Adult.Arts.Animation.Anime.Multimedia      |     2
          Top.Adult.Arts.Animation.Anime.Resources       |     2
          Top.Adult.Arts.Animation.Anime.Titles          |     2
          Top.Adult.Arts.Animation.Cartoons              |     1
          Top.Adult.Arts.Animation.Cartoons.AVS          |     2
          Top.Adult.Arts.Animation.Cartoons.Members      |     2
         (11 rows)
         

F.15.9.2. Timings

+---------------------------------------------+
|Query|Rows|Time (ms) index|Time (ms) no index|
|-----+----+---------------+------------------|
|   Q0|   1|             NA|           1453.44|
|-----+----+---------------+------------------|
|   Q1|   2|           0.49|           1001.54|
|-----+----+---------------+------------------|
|   Q2|   2|           1.48|           3009.39|
|-----+----+---------------+------------------|
|   Q3|   4|           0.55|            906.98|
|-----+----+---------------+------------------|
|   Q4|   4|       24385.07|           4951.91|
|-----+----+---------------+------------------|
|   Q5|  11|           0.85|           1003.23|
+---------------------------------------------+
   

Timings without indices were obtained using operations which doesn't use indices (see above)

F.15.9.3. Remarks

We didn't run full-scale tests, also we didn't present (yet) data for operations with arrays of ltree (ltree[]) and full text searching. We'll appreciate your input. So far, below some (rather obvious) results:

  • Indices does help execution of queries

  • Q4 performs bad because one needs to read almost all data from the HDD

F.15.10. Some Backgrounds

The approach we use for ltree is much like one we used in our other GiST based contrib modules (intarray, tsearch, tree, btree_gist, rtree_gist). Theoretical background is available in papers referenced from our GiST development page (http://www.sai.msu.su/~megera/postgres/gist).

A hierarchical data structure (tree) is a set of nodes. Each node has a signature (LPS) of a fixed size, which is a hashed label path of that node. Traversing a tree we could *certainly* prune branches if

   LQS (bitwise AND) LPS != LQS
  

where LQS is a signature of lquery or ltxtquery, obtained in the same way as LPS.

   ltree[]:
  

For array of ltree LPS is a bitwise OR-ed signatures of *ALL* children reachable from that node. Signatures are stored in RD-tree, implemented using GiST, which provides indexed access.

   ltree:
  

For ltree we store LPS in a B-tree, implemented using GiST. Each node entry is represented by (left_bound, signature, right_bound), so that we could speedup operations <, <=, =, >=, > using left_bound, right_bound and prune branches of a tree using signature.

F.15.11. Authors

All work was done by Teodor Sigaev () and Oleg Bartunov (). See http://www.sai.msu.su/~megera/postgres/gist for additional information. Authors would like to thank Eugeny Rodichev for helpful discussions. Comments and bug reports are welcome.