Jack Delany
Introduction:
Recent work by Peter Willetts research group has focused on the use of data fusion techniques to enhance similarity searching (Hert, et. al., JCICS 44, 2004, 1177, and Whittle, et. al., JCICS 44, 2004, 1840). Through a combination of relatively simple SQL queries, and an extension to the searching functionality, Daycart can be used to perform several types of data fusion searches with good performance.
Similarity fusion:
In 4.91, Daycart provides the generic similarity() function, which allows the computation of a similarity measure in terms of a,b,c,d, using a simple expression(5) language.
Best Similarity for multiple input queries:
This will be an experimental feature for 4.92, used to support "best similarity" fusion searching. A tanimoto() search with a query containing a comma-separated list of SMILES or ASCII fingerprints will be interpreted as the best similarity.
select smi from nci where tanimoto(smi, 'c1ccccc1,c1ccccn1') > 0.8;
is equivalent to:
select smi from nci where MAX(tanomoto(smi, 'c1ccccc1'), tanomoto(smi, 'c1ccccn1')) > 0.8;
In general this isn't identical to:
select smi from nci where tanomoto(smi, 'c1ccccc1') > 0.8 or tanomoto(smi, 'c1ccccn1') > 0.8;
A query with a range of tanimoto values, or a "less-than" query doesn't have the same meaning in the two cases. Since the functional and index forms must have the same semantics, the index must necessarily perform the implicit MAX() function internally.
SQL> select smi, tanimoto(smi, 'c1ccccc1') t1,
tanimoto(smi, 'c1ccccn1') t2,
tanimoto(smi, 'c1ccccc1,c1ccccn1') t3 from nci
where tanimoto(smi, 'c1ccccc1,c1ccccn1') > 0.8
SMI T1 T2 T3
------------------------------ ---------- ---------- ----------
c1ccccc1 1 .269230769 1
c1ccncc1 .269230769 1 1
Oc1ccncc1 .222222222 .810344828 .810344828
Elapsed: 00:00:00.15
Turbo Similarity:
In order to perform a turbo similarity, we first need the set of N nearest neighbors from an initial query; this can then be fed into the new "Best similarity" search in a two-pass query.
For this we use the new (with 9i) ability to create user-defined aggregate functions in Oracle. Basically you can create your own aggregate fucnction, which can be called from SQL as output during "group by" operations.
select employee, max(bonus) from monthly_bonuses group by employee;
In this case we create an aggregate function called comma_concat() which simply concatenates together it's string arguments into a single comma-separated list. So,
SQL> select smi from nci where rownum < 5; SMI ---------------------------------------- CC1=CC(=O)C=CC1=O S(Sc1nc2ccccc2s1)c3nc4ccccc4s3 Oc1c(Cl)cc(cc1N(=O)=O)[N+](=O)[O-] N=c1[nH]cc(s1)N(=O)=O SQL> select comma_concat(smi) from nci where rownum < 5; comma_concat(smi) ------------------------------------------------------------ CC1=CC(=O)C=CC1=O,S(Sc1nc2ccccc2s1)c3nc4ccccc4s3, Oc1c(Cl)cc(cc1N(=O)=O)[N+](=O)[O-],N=c1[nH]cc(s1)N(=O)=OTaking this further, we can use the comma_concat() function in the processing of hits from a simliarity search to reformat the hits into a string suitable for input to the new best similarity routine.
SQL> select comma_concat(smi) from nci where
tanimoto(smi, 'NCCc1ccccc1', 10) > 0.0;
comma_concat(smi)
------------------------------------------------------------
Cc1ccc(CCN)cc1,CC(CN)c1ccccc1,CCc1ccc(CCN)cc1,CNCCc1ccccc1,
CC(C)c1ccc(CCN)cc1,NCCc1ccccc1,NCCc1ccccc1,Cc1cccc(CCN)c1,
Cc1cc(C)cc(CCN)c1,CC(C)(CN)c1ccccc1
SQL> select smi, tanimoto(smi, 'NCCc1ccccc1') from nci
where tanimoto(smi,
(select comma_concat(distinct smi) from nci where
tanimoto(smi, 'NCCc1ccccc1', 10) > 0.0) , 10) > 0.0
order by tanimoto(smi, 'NCCc1ccccc1') desc;
SMI TANIMOTO(SMI,'NCCC1CCCCC1')
---------------------------------------- ---------------------------
NCCc1ccccc1 1
CCc1ccc(CCN)cc1 .949152542
Cc1ccc(CCN)cc1 .949152542
CC(CN)c1ccccc1 .933333333
Cc1cccc(CCN)c1 .918032787
Cc1cc(C)cc(CCN)c1 .918032787
CNCCc1ccccc1 .903225806
CC(C)c1ccc(CCN)cc1 .903225806
CC(C)(CN)c1ccccc1 .903225806
CNCCc1ccccc1 .903225806
10 rows selected.
Elapsed: 00:00:00.38
Rank-based fusion for multiple input queries:
We can use the "rownum" feature of SQL to generate rank orders on sorted output (neglecting ties).
SQL> select smi, tanimoto(smi, 'NCCc1ccccc1'), rownum rank from
(select smi from nci where
tanimoto(smi, 'NCCc1ccccc1', 10) > 0.8
order by tanimoto(smi, 'NCCc1ccccc1') desc);
SMI TANIMOTO(SMI,'NCCC1CCCCC1') RANK
----------------------------------- --------------------------- ----------
NCCc1ccccc1 1 1
Cc1ccc(CCN)cc1 .949152542 2
CCc1ccc(CCN)cc1 .949152542 3
CC(CN)c1ccccc1 .933333333 4
Cc1cccc(CCN)c1 .918032787 5
Cc1cc(C)cc(CCN)c1 .918032787 6
CNCCc1ccccc1 .903225806 7
CC(C)c1ccc(CCN)cc .903225806 8
CC(C)(CN)c1ccccc1 .903225806 9
CNCCc1ccccc1 .903225806 10
10 rows selected.
Elapsed: 00:00:00.10
SQL> select smi, rownum rank from
(select smi from nci where
tanimoto(smi, 'NCCc1ccccc1', 10) > 0.8
order by tanimoto(smi, 'NCCc1ccccc1') desc)
union all
select smi, rownum rank from
(select smi from nci where
tanimoto(smi, 'NCCCc1ccccc1', 10) > 0.8
order by tanimoto(smi, 'NCCCc1ccccc1') desc)
SMI RANK
---------------------------------------- ----------
CC(C)(CN)c1ccccc1 9
CC(C)C(N)CCc1ccccc1 6
CC(C)c1ccc(CCN)cc1 8
CC(CN)c1ccccc1 4
CC(N)CCc1ccccc1 3
CC(N)CCc1ccccc1 4
CC(N)Cc1ccccc1 9
CC(N)Cc1ccccc1 10
CCc1ccc(CCN)cc1 3
CNCCCc1ccccc1 5
CNCCc1ccccc1 7
SMI RANK
---------------------------------------- ----------
CNCCc1ccccc1 10
Cc1cc(C)cc(CCN)c1 6
Cc1ccc(CCN)cc1 2
Cc1cccc(CCN)c1 5
NC1CC1c2ccccc2 7
NC1CC1c2ccccc2 8
NCCCc1ccccc1 1
NCCCc1ccccc1 2
NCCc1ccccc1 1
20 rows selected.
Elapsed: 00:00:00.17
The final queries, using 10 targets (either as ten individual targets or as the result of a first nearest-neighbors tanimoto search) are long (see the query 10-neighbor best rank turbo similarity here) but provide reasonable results with decent performance. Note also that Oracle does not provide a built-in median function. The example below uses a function named agg_median(), again, implemented using the user-defined aggregate functionality available starting with Oracle 9i.
SQL> select smi, min(rank), tanimoto(smi, 'NCCc1ccccc1')
from (-- UNION --)
group by smi
order by 2 asc, 3 desc;
SMI MIN(RANK) TANI
---------------------------------------- ---------- ----------
NCCc1ccccc1 1 1
CCc1ccc(CCN)cc1 1 .949152542
CC(CN)c1ccccc1 1 .933333333
Cc1cc(C)cc(CCN)c1 1 .918032787
CC(C)(CN)c1ccccc1 1 .903225806
CNCCc1ccccc1 1 .903225806
CC(C)c1ccc(CCN)cc1 1 .903225806
Cc1ccc(CCN)cc1 2 .949152542
Cc1cccc(CCN)c1 2 .918032787
CN(C)CCc1ccccc1 2 .888888889
C[N+](C)(C)CCc1ccccc1 3 .888888889
...
Elapsed: 00:00:02.75
SQL> select smi, agg_median(rank), tanimoto(smi, 'NCCc1ccccc1') tani
from (-- UNION ALL --)
group by smi
order by 2 asc, 3 desc;
SMI AGG_MEDIAN(RANK) TANI
---------------------------------------- ---------------- ----------
CC(C)c1ccc(CCN)cc1 3 .903225806
NCCc1ccccc1 4 1
CC(CN)c1ccccc1 8 .933333333
Cc1cc(C)cc(CCN)c1 8 .918032787
Cc1ccc(CCN)cc1C 8 .848484848
CCc1ccc(CCN)cc1 9 .949152542
Cc1cccc(CCN)c1 9 .918032787
Cc1ccc(CCN)cc1 10 .949152542
Cc1cc(C)c(CCN)c(C)c1 11 .875
Cc1cc(C)c(CCN)cc1C 11 .835820896
Cc1ccc(CCN)c(C)c1 12 .875
...
Elapsed: 00:00:02.85