Jack Delany
Introduction:
Recent work by Peter Willetts research group has focused on the use of data fusion techniques to enhance similarity searching (Hert, et. al., JCICS 44, 2004, 1177, and Whittle, et. al., JCICS 44, 2004, 1840). Through a combination of relatively simple SQL queries, and an extension to the searching functionality, Daycart can be used to perform several types of data fusion searches with good performance.
Similarity fusion:
In 4.91, Daycart provides the generic similarity() function, which allows the computation of a similarity measure in terms of a,b,c,d, using a simple expression(5) language.
Best Similarity for multiple input queries:
This will be an experimental feature for 4.92, used to support "best similarity" fusion searching. A tanimoto() search with a query containing a comma-separated list of SMILES or ASCII fingerprints will be interpreted as the best similarity.
select smi from nci where tanimoto(smi, 'c1ccccc1,c1ccccn1') > 0.8;
is equivalent to:
select smi from nci where MAX(tanomoto(smi, 'c1ccccc1'), tanomoto(smi, 'c1ccccn1')) > 0.8;
In general this isn't identical to:
select smi from nci where tanomoto(smi, 'c1ccccc1') > 0.8 or tanomoto(smi, 'c1ccccn1') > 0.8;
A query with a range of tanimoto values, or a "less-than" query doesn't have the same meaning in the two cases. Since the functional and index forms must have the same semantics, the index must necessarily perform the implicit MAX() function internally.
SQL> select smi, tanimoto(smi, 'c1ccccc1') t1, tanimoto(smi, 'c1ccccn1') t2, tanimoto(smi, 'c1ccccc1,c1ccccn1') t3 from nci where tanimoto(smi, 'c1ccccc1,c1ccccn1') > 0.8 SMI T1 T2 T3 ------------------------------ ---------- ---------- ---------- c1ccccc1 1 .269230769 1 c1ccncc1 .269230769 1 1 Oc1ccncc1 .222222222 .810344828 .810344828 Elapsed: 00:00:00.15Turbo Similarity:
In order to perform a turbo similarity, we first need the set of N nearest neighbors from an initial query; this can then be fed into the new "Best similarity" search in a two-pass query.
For this we use the new (with 9i) ability to create user-defined aggregate functions in Oracle. Basically you can create your own aggregate fucnction, which can be called from SQL as output during "group by" operations.
select employee, max(bonus) from monthly_bonuses group by employee;
In this case we create an aggregate function called comma_concat() which simply concatenates together it's string arguments into a single comma-separated list. So,
SQL> select smi from nci where rownum < 5; SMI ---------------------------------------- CC1=CC(=O)C=CC1=O S(Sc1nc2ccccc2s1)c3nc4ccccc4s3 Oc1c(Cl)cc(cc1N(=O)=O)[N+](=O)[O-] N=c1[nH]cc(s1)N(=O)=O SQL> select comma_concat(smi) from nci where rownum < 5; comma_concat(smi) ------------------------------------------------------------ CC1=CC(=O)C=CC1=O,S(Sc1nc2ccccc2s1)c3nc4ccccc4s3, Oc1c(Cl)cc(cc1N(=O)=O)[N+](=O)[O-],N=c1[nH]cc(s1)N(=O)=OTaking this further, we can use the comma_concat() function in the processing of hits from a simliarity search to reformat the hits into a string suitable for input to the new best similarity routine.
SQL> select comma_concat(smi) from nci where tanimoto(smi, 'NCCc1ccccc1', 10) > 0.0; comma_concat(smi) ------------------------------------------------------------ Cc1ccc(CCN)cc1,CC(CN)c1ccccc1,CCc1ccc(CCN)cc1,CNCCc1ccccc1, CC(C)c1ccc(CCN)cc1,NCCc1ccccc1,NCCc1ccccc1,Cc1cccc(CCN)c1, Cc1cc(C)cc(CCN)c1,CC(C)(CN)c1ccccc1 SQL> select smi, tanimoto(smi, 'NCCc1ccccc1') from nci where tanimoto(smi, (select comma_concat(distinct smi) from nci where tanimoto(smi, 'NCCc1ccccc1', 10) > 0.0) , 10) > 0.0 order by tanimoto(smi, 'NCCc1ccccc1') desc; SMI TANIMOTO(SMI,'NCCC1CCCCC1') ---------------------------------------- --------------------------- NCCc1ccccc1 1 CCc1ccc(CCN)cc1 .949152542 Cc1ccc(CCN)cc1 .949152542 CC(CN)c1ccccc1 .933333333 Cc1cccc(CCN)c1 .918032787 Cc1cc(C)cc(CCN)c1 .918032787 CNCCc1ccccc1 .903225806 CC(C)c1ccc(CCN)cc1 .903225806 CC(C)(CN)c1ccccc1 .903225806 CNCCc1ccccc1 .903225806 10 rows selected. Elapsed: 00:00:00.38
Rank-based fusion for multiple input queries:
We can use the "rownum" feature of SQL to generate rank orders on sorted output (neglecting ties).
SQL> select smi, tanimoto(smi, 'NCCc1ccccc1'), rownum rank from (select smi from nci where tanimoto(smi, 'NCCc1ccccc1', 10) > 0.8 order by tanimoto(smi, 'NCCc1ccccc1') desc); SMI TANIMOTO(SMI,'NCCC1CCCCC1') RANK ----------------------------------- --------------------------- ---------- NCCc1ccccc1 1 1 Cc1ccc(CCN)cc1 .949152542 2 CCc1ccc(CCN)cc1 .949152542 3 CC(CN)c1ccccc1 .933333333 4 Cc1cccc(CCN)c1 .918032787 5 Cc1cc(C)cc(CCN)c1 .918032787 6 CNCCc1ccccc1 .903225806 7 CC(C)c1ccc(CCN)cc .903225806 8 CC(C)(CN)c1ccccc1 .903225806 9 CNCCc1ccccc1 .903225806 10 10 rows selected. Elapsed: 00:00:00.10 SQL> select smi, rownum rank from (select smi from nci where tanimoto(smi, 'NCCc1ccccc1', 10) > 0.8 order by tanimoto(smi, 'NCCc1ccccc1') desc) union all select smi, rownum rank from (select smi from nci where tanimoto(smi, 'NCCCc1ccccc1', 10) > 0.8 order by tanimoto(smi, 'NCCCc1ccccc1') desc) SMI RANK ---------------------------------------- ---------- CC(C)(CN)c1ccccc1 9 CC(C)C(N)CCc1ccccc1 6 CC(C)c1ccc(CCN)cc1 8 CC(CN)c1ccccc1 4 CC(N)CCc1ccccc1 3 CC(N)CCc1ccccc1 4 CC(N)Cc1ccccc1 9 CC(N)Cc1ccccc1 10 CCc1ccc(CCN)cc1 3 CNCCCc1ccccc1 5 CNCCc1ccccc1 7 SMI RANK ---------------------------------------- ---------- CNCCc1ccccc1 10 Cc1cc(C)cc(CCN)c1 6 Cc1ccc(CCN)cc1 2 Cc1cccc(CCN)c1 5 NC1CC1c2ccccc2 7 NC1CC1c2ccccc2 8 NCCCc1ccccc1 1 NCCCc1ccccc1 2 NCCc1ccccc1 1 20 rows selected. Elapsed: 00:00:00.17
The final queries, using 10 targets (either as ten individual targets or as the result of a first nearest-neighbors tanimoto search) are long (see the query 10-neighbor best rank turbo similarity here) but provide reasonable results with decent performance. Note also that Oracle does not provide a built-in median function. The example below uses a function named agg_median(), again, implemented using the user-defined aggregate functionality available starting with Oracle 9i.
SQL> select smi, min(rank), tanimoto(smi, 'NCCc1ccccc1') from (-- UNION --) group by smi order by 2 asc, 3 desc; SMI MIN(RANK) TANI ---------------------------------------- ---------- ---------- NCCc1ccccc1 1 1 CCc1ccc(CCN)cc1 1 .949152542 CC(CN)c1ccccc1 1 .933333333 Cc1cc(C)cc(CCN)c1 1 .918032787 CC(C)(CN)c1ccccc1 1 .903225806 CNCCc1ccccc1 1 .903225806 CC(C)c1ccc(CCN)cc1 1 .903225806 Cc1ccc(CCN)cc1 2 .949152542 Cc1cccc(CCN)c1 2 .918032787 CN(C)CCc1ccccc1 2 .888888889 C[N+](C)(C)CCc1ccccc1 3 .888888889 ... Elapsed: 00:00:02.75 SQL> select smi, agg_median(rank), tanimoto(smi, 'NCCc1ccccc1') tani from (-- UNION ALL --) group by smi order by 2 asc, 3 desc; SMI AGG_MEDIAN(RANK) TANI ---------------------------------------- ---------------- ---------- CC(C)c1ccc(CCN)cc1 3 .903225806 NCCc1ccccc1 4 1 CC(CN)c1ccccc1 8 .933333333 Cc1cc(C)cc(CCN)c1 8 .918032787 Cc1ccc(CCN)cc1C 8 .848484848 CCc1ccc(CCN)cc1 9 .949152542 Cc1cccc(CCN)c1 9 .918032787 Cc1ccc(CCN)cc1 10 .949152542 Cc1cc(C)c(CCN)c(C)c1 11 .875 Cc1cc(C)c(CCN)cc1C 11 .835820896 Cc1ccc(CCN)c(C)c1 12 .875 ... Elapsed: 00:00:02.85