MUG'02: DayCart - What's New

DayCart - What's New

Jack Delany

DAYLIGHT Chemical Information Systems, Inc. Mission Viejo, CA USA

Introduction:

This whitepaper/presentation describes the new features and functionality planned for the 4.8 release of DayCart. There are numerous new administrative capabilities that make managing DayCart indexes simpler, several new chemical functions and capabilities, and an entirely new architecture for the SSS/similarity indextype.

Administrative Capabilities:

Alter index ... rename ...
Renames and index and it's "backing tables".
Alter index ... rebuild
Rebuilds the internal data structure for the index. Compacts the ddblob index, removing holes from deleted data. A index drop and recreation after lots of deletes (> 20% of base table) is no longer necessary; the rebuild is much faster. It is a no-op for ddexact, ddgraph, and ddrole indexes.
Toolkit error queue now dumped to the log for debugging.
Import and Export of tables with DayCart indexes.
One outstanding issue; parameters to the index in the export file.
Physical Attributes and Storage Parameters on index creation.
Each of the four DayCart indextypes creates tables and indexes for it's own use. The 4.7x DayCart didn't allow users to control the creation of those tables and indexes. Version 4.8 will.

Indextype Schema Objects Created
ddexact table, index
ddgraph table, index
ddrole table, index
ddblob table, 2 LOB columns

Options are passed as name=value pairs in the "parameters" clause of the index creation. For example:
```
 create index example_indx on mytable(smiles) 
   indextype is c$dcischem.ddexact
   parameters ('table_tablespace=user_ts ' ||
               'index_tablespace=index_ts ' ||
               'table_storage="initial 20000000"');
```
There are three classes of options:
- table_### - passed to the table creation.
  table_tablespace, table_logging, table_cache, table_initrans, table_pctfree, table_pctused, table_maxtrans, table_storage
- index_### - passed to the index creation
  index_tablespace, index_logging, index_initrans, index_pctfree, index_maxtrans, index_storage
- lob_### - passed to the table creation as lob parameters
  lob_tablespace, lob_cache, lob_logging, lob_storage

Indextype	Schema Objects Created
ddexact	table, index
ddgraph	table, index
ddrole	table, index
ddblob	table, 2 LOB columns

Index Creation Option: fpinitcolumn=<column>

Applies only to the ddblob indextype.

The value is the column from which the index code can take fingerprints for the initial index creation.

Table	Column	Parameters	Creation time (mm:ss)
NCI	SMILES	none	6:59
NCI	FP	none	0:40
NCI	SMILES	initfpcolumn=fp	0:49
Spresipreps	SMILES	none	75:05
Spresipreps	FP	none	8:26
Spresipreps	SMILES	initfpcolumn=fp	11:27

1x800Mhz PIII, Linux RH6.2, Oracle 8.1.6 SE, DayCart 4.8_020218
NCI (126705 SMILES), Spresipreps (1097027 SMILES).

IMPORTANT CAVEATS: The column is used as a convenience for index creation; no ongoing relationship between the columns is established.

The fingerprint column must contain regular structural fingerprints for the SMILES in the indexed column. Otherwise searching will be broken using that index.

If fingerprint values are null or the wrong size they are calculated like normal. After index creation, all fingerprints for inserts are calculated normally.

New Functionality:

New normalization function & operator - partnorm(). An n-tuple part reordering function for normalization of part data. Works for molecules, mixtures, and reactions. For example:

select smi2cansmi('CN(C)C.CI>>C[N+](C)(C)C.[I-]', 0),
  partnorm('CN(C)C.CI>>C[N+](C)(C)C.[I-]',
           '59.1,141.9,74.1,126.9', 1, 0) from dual;

returns:

CI.CN(C)C>>[I-].C[N+](C)(C)C
141.9,59.1,126.9,74.1

New comparison function & index/search operator - component(). This is the 'part' query. It uses the ddrole index and works for molecules, mixtures, and reactions. For example:
component('C1CCOC1.c1ccccc1, 'C1CCOC1') => TRUE component('CI.CN(C)C>>[I-].C[N+](C)(C)C', 'CI') => TRUE
New comparison function & index/search operator - isin(). This is the equivalent of the SMILES substructure search. That is, find all molecules in the database which can be embedded in the query, assuming opened valences (Hydrogens removed).
isin('c1ccccc1', 'NCCc1cccc1') => TRUE
The above example is identical in behavior to:
contains('NCCc1ccccc1', 'c1ccccc1')
The isin() function is necessary in DayCart solely for fast searching. Because of the way that Oracle dispatches searches to external indexes (based on the column supplied as the first argument to an operator), isin() is needed.

CLOB Support - Long Strings in the cartridge.

In 4.8, every DayCart function and operator which takes or returns a string (VARCHAR2 in the 4.7x cartridge) will be able to take or return a CLOB.

Note that our actual implementation is different that what I've described at previous meetings. This is for compatability with Oracle's 9i direction with respect to CLOB support.

Functions which return a NUMBER:


fsmi2amw(smi IN VARCHAR2) => NUMBER
fsmi2amw(smi IN CLOB) => NUMBER

fcontains(s1 IN VARCHAR2, s2 IN VARCHAR2) => NUMBER
fcontains(s1 IN VARCHAR2, s2 IN CLOB) => NUMBER
fcontains(s1 IN CLOB, s2 IN VARCHAR2) => NUMBER
fcontains(s1 IN CLOB, s2 IN CLOB) => NUMBER

All combinations of input string arguments are supported.

Functions which return a STRING:


fsmi2cansmi(smi IN VARCHAR2, type IN NUMBER) => VARCHAR2
fsmi2cansmi(smi IN CLOB, type IN NUMBER) => CLOB

fatomnorm( smi IN VARCHAR2, list IN VARCHAR2, ...) => VARCHAR2
fatomnorm( smi IN CLOB, list IN VARCHAR2, ...) => VARCHAR2
fatomnorm( smi IN VARCHAR2, list IN CLOB, ...) => CLOB
fatomnorm( smi IN CLOB, list IN CLOB, ...) => CLOB

The returned type is determined by the type of the "key" input argument.

DayCart Indextypes.
The Indextypes operate correctly on VARCHAR2 and CLOB columns and support all combinations of input string (query) arguments.

The cool thing about this is that it's handled transparently by Oracle. Operator overloading takes care of calling the "right" function, the API that you see as a user is unchanged.


SQL> desc test
 Name          Null?    Type
 ------------- -------- -------------
 SMI           NOT NULL VARCHAR2(428)
 NSC           NOT NULL NUMBER(6)
 CAS_RN        NOT NULL VARCHAR2(11)

SQL> desc testc
 Name          Null?    Type
 ------------- -------- -------------
 SMI           NOT NULL CLOB
 NSC           NOT NULL NUMBER(6)
 CAS_RN        NOT NULL VARCHAR2(11)

SQL> select smi, smi2amw(smi) from test;
SMI                                      SMI2AMW(SMI)
---------------------------------------- ------------
CC1=CC(=O)C=CC1=O                           122.12134
S(Sc1nc2ccccc2s1)c3nc4ccccc4s3               332.4908
Nc1ccc2C(=O)c3ccccc3C(=O)c2c1                223.2268
Cc1ccc2C(=O)c3ccccc3C(=O)c2c1N(=O)=O         267.2363
...

SQL> select smi, smi2amw(smi) from testc;
SMI                                      SMI2AMW(SMI)
---------------------------------------- ------------
CC1=CC(=O)C=CC1=O                           122.12134
S(Sc1nc2ccccc2s1)c3nc4ccccc4s3               332.4908
Nc1ccc2C(=O)c3ccccc3C(=O)c2c1                223.2268
Cc1ccc2C(=O)c3ccccc3C(=O)c2c1N(=O)=O         267.2363
...

SQL> select count(1) from test where contains(smi, 'NCCc1ccccc1') = 1

  COUNT(1)
----------
        19

SQL> select count(1) from testc where contains(smi, 'NCCc1ccccc1') = 1

  COUNT(1)
----------
        19

New architecture for ddblob indextype.
The 4.8 cartridge will have an new data organization for the data in the DDBLOB indextype. Operationally this is transparent to the users, however the net result is that for most of the searches we can do more prescreening of the database. Performance improvements of 2 - 10X are seen in some searches.
Tanimoto(), tversky(), euclid() and "nearest" searches on all the queries benefit most, contains(), matches() and isin() are impacted less. No searches are slower in 4.8 (so far).
Index creation is somewhat slower; the time spent doing database I/O is somewhat higher.
The basic idea is that instead of a single data structure containing all the SMILES and fingerprints for searching, the index is organized into multiple data structures based on the fingerprints. Depending on the query, we can eliminate from consideration large percentages of the index data without ever having to look at it.

Better Predicate Handling. Comparison predicates in the indexes are handled more robustly. Screening is used for all cases.


contains(column, query) = 1
contains(column, query) = 0
tanimoto(column, query) > {0.0 - 1.0}
tanimoto(column, query) < {0.0 - 1.0}
tanimoto(column, query) >= {0.0 - 1.0}
tanimoto(column, query) <= {0.0 - 1.0}
tanimoto(column, query) = {0.0 - 1.0}

Also, as per the above examples, isin(), fingertest(), matches(), tverksy(), and euclid() behave properly.

The "Nearest" search.
The basic concept; find hits which match a criteria and then select based on their similarity to the target.
The early cartridge versions has a nearest() function, but that didn't go into the 4.7 release. The idea has resurfaced in a different form.
Using the "operator overloading", we're adding the nearest search to six of the seven search operators available in the ddblob indextype. The functions optionally take an additional numeric argument which is the maximum number of hits to return.
```
select ... where tanimoto(column, query, 10) > 0.5;
select ... where tversky(column, query, 0.8, 0.8, 20) > 0.8;
select ... where euclid(column, query, 100) < 0.2;
```
Conceptually, both the search criteria and the pruning of the hitlist based on number of nearest hits occur.
The nearest idea has a directionality; for euclid() lower numbers are better, for tanimoto() and tversky() higher numbers are better.
```
select ... where contains(column, query, 10) = 1
select ... where contains(column, query, 10) = 0
select ... where isin(column, query, 10) = 1
select ... where fingertest(column, query, 10) = 1
```
Contains(), isin(), and fingertest() use the tanimoto metric as the criteria for the nearest sort.

Table name	Query	Hits	Red		Xmas
			4.73	4.80	4.73	4.80
nci	contains(smi, 'OC(=O)CS') = 1	507	1.48	1.26	0.82	0.74
nci	matches(smi, '[OH]C(=O)CS') = 1	279	1.39	1.33	0.86	0.78
nci	fingertest(smi, 'OC(=O)CS') = 1	1237	0.54	0.45	0.42	0.43
nci	tanimoto(smi, 'OC(=O)CS') > 0.8	7	0.46	0.03	0.49	0.05
nci	tversky(smi, 'c1ccccc1', 0.5, 0.5) > 0.75	44	0.50	0.12	0.54	0.13
savant_smi	contains(smi, 'O1C(=O)CCS1') = 1	2	4.0	2.9	nd	5.9
savant_smi	matches(smi, '[OH]C(=O)CS') = 1	1575	14.0	18.6	nd	17.0
savant_smi	tanimoto(smi, 'OC(=O)CS') > 0.8	7	5.0	0.16	nd	0.30
savant_smi	tanimoto(fp, 'OC(=O)CCS') > 0.8	24	4.0	0.14	nd	nd
savant_smi	fingertest(fp, 'OC(=O)CCS') = 1	18620	4.0	3.5	nd	6.8
nci	contains(smi, 'OC(=O)CS', 20) = 1	20	n/a	0.08	n/a	0.11
nci	contains(smi, 'OC(=O)CS', 20) = 0	20	n/a	0.09	n/a	0.13
nci	tanimoto(smi, 'OC(=O)CS', 20) > 0.5	20	n/a	0.07	n/a	0.11
savant_smi	contains(smi, 'OC(=O)CS', 20) = 1	20	n/a	0.22	n/a	0.36
savant_smi	contains(smi, 'OC(=O)CS', 20) = 0	20	n/a	0.40	n/a	0.78
savant_smi	contains(smi, 'NCCc1ccccc1', 20) = 1	20	n/a	0.23	n/a	0.30
savant_smi	tanimoto(smi, 'NCCc1ccccc1') > 0.9	19	n/a	0.20	n/a	0.45
savant_smi	tanimoto(smi, 'OC(=O)CS', 20) > 0.5	20	n/a	0.23	n/a	0.50
chopsuey	contains(smi, 'NCCc1ccccc1', 100) = 1	100	n/a	2.00	n/a	nd
chopsuey	tanimoto(smi, 'NCCc1ccccc1', 100) > 0.8	100	n/a	4.75	n/a	nd

(nd - no data, n/a - search not available in 4.7x)
RED: 2x360Mhz UltraSparcII, Solaris 7, Oracle 8.1.5, DayCart 4.8_020218
XMAS: 1x800Mhz PIII, Linux RH6.2, Oracle 8.1.6 SE, DayCart 4.8_020218
NCI (126705 SMILES), Spresipreps (1,097,027 SMILES), chopsuey (21,860,000 SMILES)

4.8 Upgrade/Compatibility:

Versions: 8.1.7, 9.0.1+, Standard edition, Enterprise edition.
Platforms: Solaris 7 & 8, Linux RH 7.x, IRIX6.5x for 8.1.7.
Since the 4.8 data structures are different, a full DayCart upgrade will be required. The basic procedure will be:
1. Drop all 4.7 DayCart indexes,
2. Drop 4.7 DayCart (clean.sql),
3. Create 4.8 DayCart (create.sql),
4. Recreate 4.8 indexes.

All 4.7x DayCart entrypoints remain unchanged except the progob() function which supports CLOB. Packages which refer to DayCart functions will not require recreation.