Jack Delany
Introduction:
This whitepaper/presentation describes the new features and functionality planned for the 4.8 release of DayCart. There are numerous new administrative capabilities that make managing DayCart indexes simpler, several new chemical functions and capabilities, and an entirely new architecture for the SSS/similarity indextype.
Administrative Capabilities:
Renames and index and it's "backing tables".
Rebuilds the internal data structure for the index. Compacts the ddblob index, removing holes from deleted data. A index drop and recreation after lots of deletes (> 20% of base table) is no longer necessary; the rebuild is much faster. It is a no-op for ddexact, ddgraph, and ddrole indexes.
Each of the four DayCart indextypes creates tables and indexes for it's own use. The 4.7x DayCart didn't allow users to control the creation of those tables and indexes. Version 4.8 will.
Indextype | Schema Objects Created |
---|---|
ddexact | table, index |
ddgraph | table, index |
ddrole | table, index |
ddblob | table, 2 LOB columns |
Options are passed as name=value pairs in the "parameters" clause of the index creation. For example:
create index example_indx on mytable(smiles)
indextype is c$dcischem.ddexact
parameters ('table_tablespace=user_ts ' ||
'index_tablespace=index_ts ' ||
'table_storage="initial 20000000"');
There are three classes of options:
table_tablespace, table_logging, table_cache, table_initrans, table_pctfree, table_pctused, table_maxtrans, table_storage
index_tablespace, index_logging, index_initrans, index_pctfree, index_maxtrans, index_storage
lob_tablespace, lob_cache, lob_logging, lob_storage
Applies only to the ddblob indextype.
The value is the column from which the index code can take fingerprints for the initial index creation.
Table | Column | Parameters | Creation time (mm:ss) |
---|---|---|---|
NCI | SMILES | none | 6:59 |
NCI | FP | none | 0:40 |
NCI | SMILES | initfpcolumn=fp | 0:49 |
Spresipreps | SMILES | none | 75:05 |
Spresipreps | FP | none | 8:26 |
Spresipreps | SMILES | initfpcolumn=fp | 11:27 |
1x800Mhz PIII, Linux RH6.2, Oracle 8.1.6 SE, DayCart 4.8_020218
NCI (126705 SMILES), Spresipreps (1097027 SMILES).
IMPORTANT CAVEATS: The column is used as a convenience for index creation; no ongoing relationship between the columns is established.
The fingerprint column must contain regular structural fingerprints for the SMILES in the indexed column. Otherwise searching will be broken using that index.
If fingerprint values are null or the wrong size they are calculated like normal. After index creation, all fingerprints for inserts are calculated normally.
Returns information about DayCart.
extproc_pid - Process id for extproc process toolkit_version - Daylight toolkit runtime version (replaces fversion()) daycart_version - DayCart runtime version (replaces fversion())
New Functionality:
returns:
select smi2cansmi('CN(C)C.CI>>C[N+](C)(C)C.[I-]', 0),
partnorm('CN(C)C.CI>>C[N+](C)(C)C.[I-]',
'59.1,141.9,74.1,126.9', 1, 0) from dual;
CI.CN(C)C>>[I-].C[N+](C)(C)C
141.9,59.1,126.9,74.1
component('C1CCOC1.c1ccccc1', 'C1CCOC1') => TRUE
component('CI.CN(C)C>>[I-].C[N+](C)(C)C', 'CI') => TRUE
usmiles('CC[C@H](Cl)C', 'CCC(Cl)C') => TRUE
asmiles('CC[C@H](Cl)C', 'CC[C@@H](C)Cl') => TRUE
This allows the user more flexibility in designing database schema and application implementation. One choice is that the client can take responsibility for the normalization and use the slightly faster exact() query and ddexact indextype. A second choice, available in 4.8, is to not use the ddexact indextype at all but use the usmiles() and asmiles() operators with the ddgraph indextype. There is a slight searchtime performance penalty, but one less index is required.
isin('c1ccccc1', 'NCCc1cccc1') => TRUE
The above example is identical in behavior to:
contains('NCCc1ccccc1', 'c1ccccc1')
The isin() function is necessary in DayCart solely for fast searching. Because of the way that Oracle dispatches searches to external indexes (based on the column supplied as the first argument to an operator), isin() is needed.
In 4.8, every DayCart function and operator which takes or returns a string (VARCHAR2 in the 4.7x cartridge) will be able to take or return a CLOB.
Note that our actual implementation is different that what I've described at previous meetings. This is for compatability with Oracle's 9i direction with respect to CLOB support.
fsmi2amw(smi IN VARCHAR2) => NUMBER
fsmi2amw(smi IN CLOB) => NUMBER
fcontains(s1 IN VARCHAR2, s2 IN VARCHAR2) => NUMBER
fcontains(s1 IN VARCHAR2, s2 IN CLOB) => NUMBER
fcontains(s1 IN CLOB, s2 IN VARCHAR2) => NUMBER
fcontains(s1 IN CLOB, s2 IN CLOB) => NUMBER
All combinations of input string arguments are supported.
fsmi2cansmi(smi IN VARCHAR2, type IN NUMBER) => VARCHAR2
fsmi2cansmi(smi IN CLOB, type IN NUMBER) => CLOB
fatomnorm( smi IN VARCHAR2, list IN VARCHAR2, ...) => VARCHAR2
fatomnorm( smi IN CLOB, list IN VARCHAR2, ...) => VARCHAR2
fatomnorm( smi IN VARCHAR2, list IN CLOB, ...) => CLOB
fatomnorm( smi IN CLOB, list IN CLOB, ...) => CLOB
The returned type is determined by the type of the "key" input argument.
The Indextypes operate correctly on VARCHAR2 and CLOB columns and support all combinations of input string (query) arguments.
The cool thing about this is that it's handled transparently by Oracle. Operator overloading takes care of calling the "right" function, the API that you see as a user is unchanged.
SQL> desc test
Name Null? Type
------------- -------- -------------
SMI NOT NULL VARCHAR2(428)
NSC NOT NULL NUMBER(6)
CAS_RN NOT NULL VARCHAR2(11)
SQL> desc testc
Name Null? Type
------------- -------- -------------
SMI NOT NULL CLOB
NSC NOT NULL NUMBER(6)
CAS_RN NOT NULL VARCHAR2(11)
SQL> select smi, smi2amw(smi) from test;
SMI SMI2AMW(SMI)
---------------------------------------- ------------
CC1=CC(=O)C=CC1=O 122.12134
S(Sc1nc2ccccc2s1)c3nc4ccccc4s3 332.4908
Nc1ccc2C(=O)c3ccccc3C(=O)c2c1 223.2268
Cc1ccc2C(=O)c3ccccc3C(=O)c2c1N(=O)=O 267.2363
...
SQL> select smi, smi2amw(smi) from testc;
SMI SMI2AMW(SMI)
---------------------------------------- ------------
CC1=CC(=O)C=CC1=O 122.12134
S(Sc1nc2ccccc2s1)c3nc4ccccc4s3 332.4908
Nc1ccc2C(=O)c3ccccc3C(=O)c2c1 223.2268
Cc1ccc2C(=O)c3ccccc3C(=O)c2c1N(=O)=O 267.2363
...
SQL> select count(1) from test where contains(smi, 'NCCc1ccccc1') = 1
COUNT(1)
----------
19
SQL> select count(1) from testc where contains(smi, 'NCCc1ccccc1') = 1
COUNT(1)
----------
19
Now that CLOB can be used to store SMILES and fingerprints, much longer strings can be handled. Theoretical size limits are up to 2^29 bits (536,870,912 bits) for fingerprints and 2 GBytes for SMILES.
The 4.8 cartridge has a new data organization for the data in the DDBLOB indextype. Operationally this is transparent to the users, however the net result is that for most of the searches we can do more prescreening of the database. Performance improvements of 2 - 10X are seen in some searches.
Tanimoto(), tversky(), euclid() and "nearest" searches on all the queries benefit most, contains(), matches() and isin() are impacted less. No searches are slower in 4.8 (so far).
Index creation is somewhat slower; the time spent doing database I/O is somewhat higher (~20%).
The basic idea is that instead of a single data structure containing all the SMILES and fingerprints for searching, the index is organized into multiple data structures based on the fingerprints. Depending on the query, we can eliminate from consideration large percentages of the index data without ever having to look at it.
contains(column, query) = 1
contains(column, query) = 0
tanimoto(column, query) > {0.0 - 1.0}
tanimoto(column, query) < {0.0 - 1.0}
tanimoto(column, query) >= {0.0 - 1.0}
tanimoto(column, query) <= {0.0 - 1.0}
tanimoto(column, query) = {0.0 - 1.0}
Also, as per the above examples, isin(), fingertest(), matches(), tverksy(), and euclid() behave properly.
The basic concept; find hits which match a criteria and then select based on their similarity to the target.
The early cartridge versions has a nearest() function, but that didn't go into the 4.7 release. The idea has resurfaced in a different form.
Using the "operator overloading", we're adding the nearest search to six of the seven search operators available in the ddblob indextype. The functions optionally take an additional numeric argument which is the maximum number of hits to return.
select ... where tanimoto(column, query, 10) > 0.5;
select ... where tversky(column, query, 0.8, 0.8, 20) > 0.8;
select ... where euclid(column, query, 100) < 0.2;
Conceptually, both the search criteria and the pruning of the hitlist based on number of nearest hits occur.
The nearest idea has a directionality; for euclid() lower numbers are better, for tanimoto() and tversky() higher numbers are better.
select ... where contains(column, query, 10) = 1
select ... where contains(column, query, 10) = 0
select ... where isin(column, query, 10) = 1
select ... where fingertest(column, query, 10) = 1
Contains(), isin(), and fingertest() use the tanimoto metric as the criteria for the nearest sort.
Table name | Query | Hits | Red | Xmas | ||
---|---|---|---|---|---|---|
4.73 | 4.80 | 4.73 | 4.80 | |||
nci | contains(smi, 'OC(=O)CS') = 1 | 507 | 1.48 | 1.26 | 0.82 | 0.74 |
nci | matches(smi, '[OH]C(=O)CS') = 1 | 279 | 1.39 | 1.33 | 0.86 | 0.78 |
nci | fingertest(smi, 'OC(=O)CS') = 1 | 1237 | 0.54 | 0.45 | 0.42 | 0.43 |
nci | tanimoto(smi, 'OC(=O)CS') > 0.8 | 7 | 0.46 | 0.03 | 0.49 | 0.05 |
nci | tversky(smi, 'c1ccccc1', 0.5, 0.5) > 0.75 | 44 | 0.50 | 0.12 | 0.54 | 0.13 |
savant_smi | contains(smi, 'O1C(=O)CCS1') = 1 | 2 | 4.0 | 2.9 | nd | 5.9 |
savant_smi | matches(smi, '[OH]C(=O)CS') = 1 | 1575 | 14.0 | 18.6 | nd | 17.0 |
savant_smi | tanimoto(smi, 'OC(=O)CS') > 0.8 | 7 | 5.0 | 0.16 | nd | 0.30 |
savant_smi | tanimoto(fp, 'OC(=O)CCS') > 0.8 | 24 | 4.0 | 0.14 | nd | nd |
savant_smi | fingertest(fp, 'OC(=O)CCS') = 1 | 18620 | 4.0 | 3.5 | nd | 6.8 |
nci | contains(smi, 'OC(=O)CS', 20) = 1 | 20 | n/a | 0.08 | n/a | 0.11 |
nci | contains(smi, 'OC(=O)CS', 20) = 0 | 20 | n/a | 0.09 | n/a | 0.13 |
nci | tanimoto(smi, 'OC(=O)CS', 20) > 0.5 | 20 | n/a | 0.07 | n/a | 0.11 |
savant_smi | contains(smi, 'OC(=O)CS', 20) = 1 | 20 | n/a | 0.22 | n/a | 0.36 |
savant_smi | contains(smi, 'OC(=O)CS', 20) = 0 | 20 | n/a | 0.40 | n/a | 0.78 |
savant_smi | contains(smi, 'NCCc1ccccc1', 20) = 1 | 20 | n/a | 0.23 | n/a | 0.30 |
savant_smi | tanimoto(smi, 'NCCc1ccccc1') > 0.9 | 19 | n/a | 0.20 | n/a | 0.45 |
savant_smi | tanimoto(smi, 'OC(=O)CS', 20) > 0.5 | 20 | n/a | 0.23 | n/a | 0.50 |
chopsuey | contains(smi, 'NCCc1ccccc1', 100) = 1 | 100 | n/a | 2.00 | n/a | nd |
chopsuey | tanimoto(smi, 'NCCc1ccccc1', 100) > 0.8 | 100 | n/a | 4.75 | n/a | nd |
4.8 Upgrade/Compatibility: