Fusion Simliarity Searching in Daycart

Jack Delany

DAYLIGHT Chemical Information Systems, Inc. Aliso Viejo, CA USA


Introduction:

Recent work by Peter Willetts research group has focused on the use of data fusion techniques to enhance similarity searching (Hert, et. al., JCICS 44, 2004, 1177, and Whittle, et. al., JCICS 44, 2004, 1840). Through a combination of relatively simple SQL queries, and an extension to the searching functionality, Daycart can be used to perform several types of data fusion searches with good performance.

Similarity fusion:

In 4.91, Daycart provides the generic similarity() function, which allows the computation of a similarity measure in terms of a,b,c,d, using a simple expression(5) language.

Best Similarity for multiple input queries:

This will be an experimental feature for 4.92, used to support "best similarity" fusion searching. A tanimoto() search with a query containing a comma-separated list of SMILES or ASCII fingerprints will be interpreted as the best similarity.

select smi from nci where tanimoto(smi, 'c1ccccc1,c1ccccn1') > 0.8;

is equivalent to:

select smi from nci where
  MAX(tanomoto(smi, 'c1ccccc1'), tanomoto(smi, 'c1ccccn1')) > 0.8;

In general this isn't identical to:

select smi from nci where
  tanomoto(smi, 'c1ccccc1') > 0.8
   or
  tanomoto(smi, 'c1ccccn1') > 0.8;

A query with a range of tanimoto values, or a "less-than" query doesn't have the same meaning in the two cases. Since the functional and index forms must have the same semantics, the index must necessarily perform the implicit MAX() function internally.


SQL> select smi, tanimoto(smi, 'c1ccccc1') t1, 
                 tanimoto(smi, 'c1ccccn1') t2,
                 tanimoto(smi, 'c1ccccc1,c1ccccn1') t3 from nci
           where tanimoto(smi, 'c1ccccc1,c1ccccn1') > 0.8

SMI                                    T1         T2         T3
------------------------------ ---------- ---------- ----------
c1ccccc1                                1 .269230769          1
c1ccncc1                       .269230769          1          1
Oc1ccncc1                      .222222222 .810344828 .810344828

Elapsed: 00:00:00.15
Turbo Similarity:

In order to perform a turbo similarity, we first need the set of N nearest neighbors from an initial query; this can then be fed into the new "Best similarity" search in a two-pass query.

For this we use the new (with 9i) ability to create user-defined aggregate functions in Oracle. Basically you can create your own aggregate fucnction, which can be called from SQL as output during "group by" operations.

select employee, max(bonus) from monthly_bonuses
  group by employee;

In this case we create an aggregate function called comma_concat() which simply concatenates together it's string arguments into a single comma-separated list. So,

SQL> select smi from nci where rownum < 5;

SMI
----------------------------------------
CC1=CC(=O)C=CC1=O
S(Sc1nc2ccccc2s1)c3nc4ccccc4s3
Oc1c(Cl)cc(cc1N(=O)=O)[N+](=O)[O-]
N=c1[nH]cc(s1)N(=O)=O

SQL> select comma_concat(smi) from nci where rownum < 5;

comma_concat(smi)
------------------------------------------------------------
CC1=CC(=O)C=CC1=O,S(Sc1nc2ccccc2s1)c3nc4ccccc4s3,
Oc1c(Cl)cc(cc1N(=O)=O)[N+](=O)[O-],N=c1[nH]cc(s1)N(=O)=O
Taking this further, we can use the comma_concat() function in the processing of hits from a simliarity search to reformat the hits into a string suitable for input to the new best similarity routine.
SQL> select comma_concat(smi) from nci where
        tanimoto(smi, 'NCCc1ccccc1', 10) > 0.0;

comma_concat(smi)
------------------------------------------------------------
Cc1ccc(CCN)cc1,CC(CN)c1ccccc1,CCc1ccc(CCN)cc1,CNCCc1ccccc1,
CC(C)c1ccc(CCN)cc1,NCCc1ccccc1,NCCc1ccccc1,Cc1cccc(CCN)c1,
Cc1cc(C)cc(CCN)c1,CC(C)(CN)c1ccccc1

SQL> select smi, tanimoto(smi, 'NCCc1ccccc1') from nci
       where tanimoto(smi,
        (select comma_concat(distinct smi) from nci where
         tanimoto(smi, 'NCCc1ccccc1', 10) > 0.0) , 10) > 0.0 
      order by tanimoto(smi, 'NCCc1ccccc1') desc;

SMI                                      TANIMOTO(SMI,'NCCC1CCCCC1')
---------------------------------------- ---------------------------
NCCc1ccccc1                                                        1
CCc1ccc(CCN)cc1                                           .949152542
Cc1ccc(CCN)cc1                                            .949152542
CC(CN)c1ccccc1                                            .933333333
Cc1cccc(CCN)c1                                            .918032787
Cc1cc(C)cc(CCN)c1                                         .918032787
CNCCc1ccccc1                                              .903225806
CC(C)c1ccc(CCN)cc1                                        .903225806
CC(C)(CN)c1ccccc1                                         .903225806
CNCCc1ccccc1                                              .903225806

10 rows selected.

Elapsed: 00:00:00.38

Rank-based fusion for multiple input queries:

We can use the "rownum" feature of SQL to generate rank orders on sorted output (neglecting ties).

SQL> select smi, tanimoto(smi, 'NCCc1ccccc1'), rownum rank from
        (select smi from nci where
        tanimoto(smi, 'NCCc1ccccc1', 10) > 0.8
        order by tanimoto(smi, 'NCCc1ccccc1') desc);

SMI                                 TANIMOTO(SMI,'NCCC1CCCCC1')       RANK
----------------------------------- --------------------------- ----------
NCCc1ccccc1                                                   1          1
Cc1ccc(CCN)cc1                                       .949152542          2
CCc1ccc(CCN)cc1                                      .949152542          3
CC(CN)c1ccccc1                                       .933333333          4
Cc1cccc(CCN)c1                                       .918032787          5
Cc1cc(C)cc(CCN)c1                                    .918032787          6
CNCCc1ccccc1                                         .903225806          7
CC(C)c1ccc(CCN)cc                                    .903225806          8
CC(C)(CN)c1ccccc1                                    .903225806          9
CNCCc1ccccc1                                         .903225806         10

10 rows selected.

Elapsed: 00:00:00.10

SQL> select smi, rownum rank from
        (select smi from nci where
        tanimoto(smi, 'NCCc1ccccc1', 10) > 0.8
        order by tanimoto(smi, 'NCCc1ccccc1') desc)
      union all
      select smi, rownum rank from
        (select smi from nci where
        tanimoto(smi, 'NCCCc1ccccc1', 10) > 0.8
        order by tanimoto(smi, 'NCCCc1ccccc1') desc)

SMI                                            RANK
---------------------------------------- ----------
CC(C)(CN)c1ccccc1                                 9
CC(C)C(N)CCc1ccccc1                               6
CC(C)c1ccc(CCN)cc1                                8
CC(CN)c1ccccc1                                    4
CC(N)CCc1ccccc1                                   3
CC(N)CCc1ccccc1                                   4
CC(N)Cc1ccccc1                                    9
CC(N)Cc1ccccc1                                   10
CCc1ccc(CCN)cc1                                   3
CNCCCc1ccccc1                                     5
CNCCc1ccccc1                                      7

SMI                                            RANK
---------------------------------------- ----------
CNCCc1ccccc1                                     10
Cc1cc(C)cc(CCN)c1                                 6
Cc1ccc(CCN)cc1                                    2
Cc1cccc(CCN)c1                                    5
NC1CC1c2ccccc2                                    7
NC1CC1c2ccccc2                                    8
NCCCc1ccccc1                                      1
NCCCc1ccccc1                                      2
NCCc1ccccc1                                       1

20 rows selected.

Elapsed: 00:00:00.17

The final queries, using 10 targets (either as ten individual targets or as the result of a first nearest-neighbors tanimoto search) are long (see the query 10-neighbor best rank turbo similarity here) but provide reasonable results with decent performance. Note also that Oracle does not provide a built-in median function. The example below uses a function named agg_median(), again, implemented using the user-defined aggregate functionality available starting with Oracle 9i.

SQL> select smi, min(rank), tanimoto(smi, 'NCCc1ccccc1')
            from (-- UNION --)
            group by smi
            order by 2 asc, 3 desc;

SMI                                       MIN(RANK)       TANI
---------------------------------------- ---------- ----------
NCCc1ccccc1                                       1          1
CCc1ccc(CCN)cc1                                   1 .949152542
CC(CN)c1ccccc1                                    1 .933333333
Cc1cc(C)cc(CCN)c1                                 1 .918032787
CC(C)(CN)c1ccccc1                                 1 .903225806
CNCCc1ccccc1                                      1 .903225806
CC(C)c1ccc(CCN)cc1                                1 .903225806
Cc1ccc(CCN)cc1                                    2 .949152542
Cc1cccc(CCN)c1                                    2 .918032787
CN(C)CCc1ccccc1                                   2 .888888889
C[N+](C)(C)CCc1ccccc1                             3 .888888889

...

Elapsed: 00:00:02.75

SQL> select smi, agg_median(rank), tanimoto(smi, 'NCCc1ccccc1') tani
            from (-- UNION ALL --)
            group by smi
            order by 2 asc, 3 desc;

SMI                                      AGG_MEDIAN(RANK)       TANI
---------------------------------------- ---------------- ----------
CC(C)c1ccc(CCN)cc1                                      3 .903225806
NCCc1ccccc1                                             4          1
CC(CN)c1ccccc1                                          8 .933333333
Cc1cc(C)cc(CCN)c1                                       8 .918032787
Cc1ccc(CCN)cc1C                                         8 .848484848
CCc1ccc(CCN)cc1                                         9 .949152542
Cc1cccc(CCN)c1                                          9 .918032787
Cc1ccc(CCN)cc1                                         10 .949152542
Cc1cc(C)c(CCN)c(C)c1                                   11       .875
Cc1cc(C)c(CCN)cc1C                                     11 .835820896
Cc1ccc(CCN)c(C)c1                                      12       .875
 
...

Elapsed: 00:00:02.85