Daycart Searching

3. Searching:

3.1 General Comments

At present, Postgresql does not support a viable interface for adding custom index capabilities. It's index customization capabilities are cumbersome. Fortunately, calling individual custom functions within Postgresql is very fast, on the order of one million calls per second. This allows one to perform reasonable searching operations using very simple row-at-a-time SQL queries. This section discusses how to best implement various types of queries within Postgresql using a combination of indexes and row-at-a-time functions.

3.2 Exact Searching

An 'exact structure' search can be implemented as an index query against a canonical SMILES column. One should store structures and reactions in their canonical form, with isomeric information (eg. use smi2cansmi(smiles, 1)).

The following simple example creates a table with a SMILES column and a simple built-in text index on the SMILES column. The insert normalizes the SMILES to canonical form.

SQL> create table mytable (id number, smiles text);
SQL> create index mytable_smi_i on mytable(smiles);
SQL> insert into mytable (1, smi2cansmi('NCCC', 1));

In order to retrieve exact structure matches, a query of the following type is used:

SQL> select * from mytable where smiles = smi2cansmi(query, 1);

By canonicalizing both the stored SMILES and the query SMILES, a simple text lookup can be used to find exact SMILES matches. That being said, the next section describes the general approach of using a reduced graph as the index lookup key. The reduced graph is more flexible as it allows one to search for either exact matches, tautomers, or graphs with a common index and query idiom.

3.3 Graph and Tautomer Searching

The approach here is to store the canonical graph as an additional column for each table which contains either structures or reactions within the database. A query on the graph column will return the set of molecules or reactions which share the same graph. These can then be filtered with additional query terms to provide fast searching for isomeric and unique exact lookup, tautomer lookup, and graph lookup.

The following simple example creates a table with a SMILES column and an additional graph column. The graph column has a simple built-in text index. The insert normalizes the graph using smi2graph(), which generates a canonical reduced graph from a molecule or reaction SMILES.

SQL> create table mytable (id number, smiles text, graph text);
SQL> create index mytable_grf_i on mytable(graph);
SQL> insert into mytable (1, 'NCCC', smi2graph('NCCC'));

Note that these queries do not require that the SMILES column be canonicalized, nor do they use an index on the SMILES column. All of the following example queries use the index against the graph column plus row-at-a-time operations on the subset of rows which match the graph step.

The following example queries can be used against this pair of database tables:

-- Find same Isomeric SMILES
--
SQL> select * from mytable
      where graph = smi2graph(query)
      and asmiles(smiles, query) = 1;
--
-- Find same Unique SMILES
-- 
SQL> select * from mytable
      where graph = smi2graph(query)
      and usmiles(smiles, query) = 1;
--
-- Find same Tautomers
--
SQL> select * from mytable
      where graph = smi2graph(query)
      and tautomer(smiles, query) = 1;
--
-- Find same Graph
--
SQL> select * from mytable
      where graph = smi2graph(query);

A way to conceptualize these queries is as follows: for any query, one can quickly find the set of rows in the database with the same reduced graph. Once one has the set of rows with the same reduced graph, one can further prune that set to find the tautomers, isomers, etc.

3.4 SMARTS and Similarity Searching

The arguments to the tanimoto(), tversky(), euclid(), similarity(), and fingertest() functions can be either a fingerprint, screenfp, or SMILES. All of these functions actually perform their computations on Daylight fingerprints. When needed, the functions will use the input SMILES to create a normal fingerprint of the appropriate size for comparison. Fingerprint creation, however is expensive, so for efficient queries one should precompute and store the screenfp (which contains the fingerprint) with the SMILES and use the screenfp exclusively in queries.

For substructure search, using the contains() and matches() functions, there are special forms of these two functions which take an extra screenfp argument. This allows the search code to use the fingeprint and additional statistics contained in the screenfp data for filtering of the rows. This dramatically improves the efficiency of substructure searching.

SQL> create table mytable (id number, smiles text, screenfp text);
SQL> insert into mytable (1, 'NCCC', smi2screen('NCCC'));

Note that these queries do not require that the SMILES column be canonicalized, nor do they use an index at all. These are the most resource-intensive queries as they perform a full tablescan for every query (unless other non-chemistry query clauses can prune). Typically, tables of several hundred thousand rows are easily handled using this approach.

The following example queries can be used against this pair of database tables:

-- Find structures by tanimoto similarity:
--
SQL> select * from mytable
      where tanimoto(screenfp, smi2fp(query, 0, 7, 512)) > 0.9;
--
-- Find structures by euclidean distance:
--
SQL> select * from mytable
      where euclid(screenfp, smi2fp(query, 0, 7, 512)) < 0.1;
--
-- Find matching structures of a SMILES query
--
SQL> select * from mytable
      where contains(smiles, screenfp, smiles_query) = 1;
--
-- Find matching structures of a SMARTS query
--
SQL> select * from mytable
      where matches(smiles, screenfp, smarts_query) = 1;

3.5 Best Practices

This section will describe a recommended organization of ones structure tables. It is devised to provide complete searching functionality with the minimum set of columns and indexes. It's a simple combination of the graph and screenfp cases described earlier in this section.

SQL> create table mytable (id number, smiles text,
                           graph text, smiscreen text);
SQL> create index mytable_grf_i on mytable(graph);
SQL> insert into mytable (1, newsmi',
                          smi2graph(newsmi), smi2screen(newsmi));

Optionally, one may wish to canonicalize the smiles column. In particular, if one wishes to use a database uniqueness constraint to prevent duplicate SMILES from being stored, one should always canonicalize the SMILES on insert and then set the uniqueness constraint on the SMILES column.

Under this organization, all of the queries described in the 'Graph and Tautomer Search' section, as well as all of the queries in the 'SMARTS and Similarity Search' section may be used. Note that the queries in the 'Exact Search' section are covered by the graph searches.