MUG'01: The Daylight Chemistry Cartridge

The Daylight Chemistry Cartridge

Jack Delany

DAYLIGHT Chemical Information Systems, Inc. Mission Viejo, CA USA

Introduction:

This whitepaper/presentation will describe Oracle Cartridge technology, the current Daylight/Oracle cartridge, and our current development and future plans for the cartridge.

Oracle Cartridge Technology:

An Oracle Data Cartridge is a bundled set of tools which extends Oracle clients and servers with new capabilities. Typically, a single data cartridge deals with a particular 'information domain'. Some examples of current data cartridges include: image processing, spatial processing, audio processing.

A cartridge consists of a number of different components which work together to provide complete database capabilities within the information domain. The components include:

User-defined Object types,
Packages - PL/SQL utilities/libraries,
Interfaces to store, retrieve, search, and index the new types.

A cartridge extends the server. The new capabilites are tightly integrated with the database engine. The Cartridge interface specification provides interfaces to the query parser, optimizer, indexing engine, etc. Each of these subsystems of the server learns about the cartridge capabilities through these interfaces. The cartridge can use SQL, PL/SQL, Java, C, etc. to implement the functions.

From Oracle's point of view, the cartridge idea allows third-party organizations to expand the capabilities of the Oracle database server in a modular, supportable fashion.

Design Goals:

The key design goals for the cartridge are:

Provide high performance chemical functionality within an RDBMS framework,
Deliver this functionality in a robust, flexible, supportable, expandable fashion,
Deliver the chemical functionality in a seamless, integrated fashion, from the perspective of the RDBMS user.

The main point here is that we recognize that in addition to the differences in the database itself (the servers, databases, tables, indexes), the members of the Oracle user community (the developers, database administrators, users) have very different perspectives than those which we are used to dealing with.

In order for the product to be successful, the Daylight cartridge must behave in a predictable, understandable way. The paradigms we use within our cartridge must match those which Oracle developers and DBAs already use and understand.

So, what is the Daylight Cartridge? First, what it isn't. It isn't a full-functioned toolkit interface / environment. It is a set of chemical utilities (normalizations, data conversions, comparison functions) which provide the missing pieces required to manage chemical data in an Oracle database.

Architecture:

The Daylight toolkit interfaces with the Oracle server via callouts to the "extproc" utility. This utility provides a RPC-like mechanism for performing C-language function calls. Daylight toolkit code is wrapped inside this RPC layer for each of the defined cartridge functions.

The main concern about this architecture is the efficiency of the RPC mechanism, and the potential that extproc will be a performance bottleneck. Simple tests indicate that this isn't a problem.

The three important communication bandwidths between the Oracle Server and Extproc are as follows:

RPC function execution: ~ 5000 - 10000 functions / sec
OCI data retrieval, tabular: ~100,000 rows / sec
OCI data retrieval, raw: 30 - 60 MB / sec

Each of these data throughput and round-trip limits represents a design constraint which must be considered in the overall cartridge design. The design described in this whitepaper is what we consider to be the optimum tradeoff between resource efficiency, search performance, and the ability to maintain the full transaction concurrency model within Oracle for all cartridge data.

Cartridge Specification:

As indicated earlier, an Oracle Cartridge typically consists of three sets of functionalities: object type definitions, packaged functions, and indexing / data access tools. In turn, each of these three areas will be discussed with respect to the Daylight cartridge implementation.

Object Types:

The daylight cartridge does not define any new Oracle object types!!!

To review, within the Daylight Toolkit, each object has an internal object type and an external representation. The external representation is used to store objects in files, Thor databases, to communicate objects between processes, etc.

Daylight Object External Language

molecule SMILES

reaction SMILES

pattern SMARTS

transform SMIRKS

fingerprint Encoded ASCII

depiction 2D coordinate list

conformation 3D coordinate list

Note that our external languages are all very expressive and well-behaved. That is, they are compact, with high information density. They are all printable ASCII strings. They all have well defined syntax and semantics.

Daylight Object	External Language
molecule	SMILES
reaction	SMILES
pattern	SMARTS
transform	SMIRKS
fingerprint	Encoded ASCII
depiction	2D coordinate list
conformation	3D coordinate list

So, our strategy with respect to the Oracle Cartridge is the following:

Oracle clients and servers manipulate Daylight objects exclusively via their external representations.

When needed, the cartridge instantiates internal Daylight objects within the cartridge to perform a specific task (eg. calculate a molecular weight from a SMILES). The interfaces between Oracle and Daylight always pass objects as their external string representations.

Implications of this "Object Model":

Since our objects are represented as strings within the RDBMS, any Oracle, Informix, Sybase, JDBC, ODBC, CORBA, etc., etc., etc., client, server, middleware, application layer, etc., etc., etc. can handle these "objects" in their external form.

Only the endpoints of communication must understand what the objects mean: the Daylight Oracle cartridge provides the server-side endpoint, and a front end user interface provides the client-side endpoint. Otherwise, the objects effectively "tunnel" through the middle layers.

This does not preclude the design of an Oracle-specific object layer on top of the cartridge system (eg. ODBC); we simply don't require one, and don't dictate which model, if any, you use.

PL/SQL Functions/Operators

General:

progob()
hostid()
version()
testlicense()
setdebug()

Molecule/Reaction:

smi2cansmi()
smi2graph()
smi2netch()
smi2hcount()
smi2mf()
smi2amw()
atomnorm()
bondnorm()

Fingerprint:

smi2fp()
smi2xfp()
foldfp()
bitcount()
nbits()

Comparison Operators (optional indexes apply)

exact()
graph()
tautomer()
reactant()
agent()
product()
contains()
matches()
tanimoto()
tversky()
euclid()
fingertest()

For a detailed discussion of the PL/SQL functions, SQL operators, and extensible indexes for the cartridge, see the official cartridge documentation.

Timing Tests:

The following table lists some representative search times for the blob-based indexes. The databases are:

nci - 126,705 SMILES, from NCI,
rxn - 104,200 reaction SMILES, from Chemsynth97.
savant_smi - 1,097,027 SMILES, from spresi95preps, (Chop Suey),

The tests are on three different machines: Red is a Sun Ultra 60 (2x360MHz), with 768 MB of real memory. Xmas is a Dell laptop (700MHz PentiumIII) with 512 MB memory, and green is a Origin 200 (2x270MHz R12000) with 2GB memory.

In all cases the Oracle installation is a vanilla 8.1.5 install, with the only changes being the increase of db_block_buffers and log_buffers parameters from default values in init.ora. All database and program files reside on a single disk partition (no RAID). All transactions use full rollback (size of rollback segments was increased from default values), but not archivelog.

Index Creation Times ( [hh:]mm:ss )

Index type Column red xmas green
nci exact 2:10 3:16 2:36
nci graph 7:10 8:21 10:20
nci blob 12:51* 9:37* 13:17*
rxn exact 1:44 2:46 2:12
rxn role 14:21 34:51 25:05
rxn blob(smi) 20:34* 14:41* 21:57*
rxn blob(fp) 00:47 01:47 01:03
savant_smi exact 22:01 32:30 22:54
savant_smi graph 1:12:59 1:25:08 1:33:38
savant_smi blob(smi) 2:07:00* 1:39:55* 2:15:58*
savant_smi blob(fp) 5:26 12:18 7:07
(* - Includes fingerprint generation time)

Index type	Column	red	xmas	green
nci	exact	2:10	3:16	2:36
nci	graph	7:10	8:21	10:20
nci	blob	12:51*	9:37*	13:17*
rxn	exact	1:44	2:46	2:12
rxn	role	14:21	34:51	25:05
rxn	blob(smi)	20:34*	14:41*	21:57*
rxn	blob(fp)	00:47	01:47	01:03
savant_smi	exact	22:01	32:30	22:54
savant_smi	graph	1:12:59	1:25:08	1:33:38
savant_smi	blob(smi)	2:07:00*	1:39:55*	2:15:58*
savant_smi	blob(fp)	5:26	12:18	7:07

Query Performance (seconds elapsed):

Table name Query Hits Red Xmas Green
nci exact(smi, 'c1ccccc1') = 1 1 0.07 0.12 0.12
nci graph(smi, 'c1ccccc1') = 1 3 0.17 0.17 0.21
nci contains(smi, 'OC(=O)C1') = 1 0* 0.12 0.16 0.21
nci contains(smi, 'OC(=O)CS') = 1 507 1.48 1.83 2.03
nci matches(smi, '[OH]C(=O)CS') = 1 279 1.39 1.81 1.90
nci fingertest(smi, 'OC(=O)CS') = 1 1237 0.54 1.10 0.76
nci tanimoto(smi, 'OC(=O)CS') > 0.8 7 0.81 1.15 0.84
nci tversky(smi, 'OC(=O)CS', 0.5, 0.5) > 0.8 56 0.80 1.26 0.87
rxn reactant(smiles, 'Sc1ccccc1') = 1 282 1.45 4.30 0.83
rxn product(smiles, 'Oc1ccccc1') = 1 13 0.15 1.51 0.14
rxn contains(smiles, 'OC(=O)CS') = 1 1136 9.2 6.8 11.4
rxn contains(smiles, '>>OC(=O)CS') = 1 739 4.3 3.9 5.6
rxn contains(smiles, 'OC(=O)CS>>') = 1 782 8.2 6.6 10.5
rxn matches(smiles, '>>[OH]C(=O)CS') = 1 117 3.8 3.6 5.1
rxn tanimoto(smiles, 'OC(=O)CCl>>OC(=O)CS') > 0.5 68 0.60 1.24 0.92
savant_smi exact(smi, 'Oc1ccccc1') = 1 15 0.10 0.38 0.09
savant_smi graph(smi, 'Oc1ccccc1') = 1 51 0.16 0.61 0.15
savant_smi contains(smi, 'O1C(=O)CCS1') = 1 2 4.0 8.9 8.9
savant_smi matches(smi, '[OH]C(=O)CS') = 1 1575 14.0 19.7 28.3
savant_smi tanimoto(smi, 'OC(=O)CS') > 0.8 7 5.0 9.6 9.6
savant_smi tanimoto(fp, 'OC(=O)CCS') > 0.8 24 4.0 4.3 4.4
savant_smi fingertest(fp, 'OC(=O)CCS') = 1 18620 4.0 4.4 4.9
(* - invalid query, ** - Disk I/O observed)

Table name	Query	Hits	Red	Xmas	Green
nci	exact(smi, 'c1ccccc1') = 1	1	0.07	0.12	0.12
nci	graph(smi, 'c1ccccc1') = 1	3	0.17	0.17	0.21
nci	contains(smi, 'OC(=O)C1') = 1	0*	0.12	0.16	0.21
nci	contains(smi, 'OC(=O)CS') = 1	507	1.48	1.83	2.03
nci	matches(smi, '[OH]C(=O)CS') = 1	279	1.39	1.81	1.90
nci	fingertest(smi, 'OC(=O)CS') = 1	1237	0.54	1.10	0.76
nci	tanimoto(smi, 'OC(=O)CS') > 0.8	7	0.81	1.15	0.84
nci	tversky(smi, 'OC(=O)CS', 0.5, 0.5) > 0.8	56	0.80	1.26	0.87
rxn	reactant(smiles, 'Sc1ccccc1') = 1	282	1.45	4.30	0.83
rxn	product(smiles, 'Oc1ccccc1') = 1	13	0.15	1.51	0.14
rxn	contains(smiles, 'OC(=O)CS') = 1	1136	9.2	6.8	11.4
rxn	contains(smiles, '>>OC(=O)CS') = 1	739	4.3	3.9	5.6
rxn	contains(smiles, 'OC(=O)CS>>') = 1	782	8.2	6.6	10.5
rxn	matches(smiles, '>>[OH]C(=O)CS') = 1	117	3.8	3.6	5.1
rxn	tanimoto(smiles, 'OC(=O)CCl>>OC(=O)CS') > 0.5	68	0.60	1.24	0.92
savant_smi	exact(smi, 'Oc1ccccc1') = 1	15	0.10	0.38	0.09
savant_smi	graph(smi, 'Oc1ccccc1') = 1	51	0.16	0.61	0.15
savant_smi	contains(smi, 'O1C(=O)CCS1') = 1	2	4.0	8.9	8.9
savant_smi	matches(smi, '[OH]C(=O)CS') = 1	1575	14.0	19.7	28.3
savant_smi	tanimoto(smi, 'OC(=O)CS') > 0.8	7	5.0	9.6	9.6
savant_smi	tanimoto(fp, 'OC(=O)CCS') > 0.8	24	4.0	4.3	4.4
savant_smi	fingertest(fp, 'OC(=O)CCS') = 1	18620	4.0	4.4	4.9

Availability:

The cartridge for Solaris, SGI, and Linux is available as part of 4.72 now.

We support Releases 8.1.5, 8.1.6, 8.1.7 under Solaris 7 (Solaris 8 has been tested but not officially supported yet), Oracle 64-bit releases under Irix 6.5.8+ (m-releases only), and Linux under Red Hat 6.2.

Support the standard edition of 8i on all platforms. Longer term, parallelization and partitioned indexes may cause us to revise this position.

Ongoing Efforts:

More screening (see Roger Sayle's EMUG'00 Talk)
- Triage - Find matches lexically without doing dt_smilin(), dt_match(). (eg. that the query 'C' matches C1CCCCC1 is evident from the strings).
- Partial molecule fingerprints - Screen based upon partial SMILES (the ability to fingerprint molecule fragments like 'cc').
- SMARTS enumeration - Screen based upon logical OR of multiple enumerated SMILES queries from a single SMARTS ([O,N]ccC=[O,N] is identical to 'OccC=O or NccC=O or OccC=N or NccC=N').
- SMILES compression - Use compressed SMILES for internal processing of cartridge indexes.
Optimizer support - Continue refining cost and selectivity functions as needed for current indextypes.
New comparison/index operator - component(). This is the 'part' query, arguably it doesn't belong in a normalized database, however it's cheap and easy to add to the ddrole index. For example:
component('c1ccccc1.C1CCCCC1', 'C1CCCCC1') => TRUE

CLOB Support - Support long SMILES in the cartridge.

Oracle's VARCHAR2 datatype definition is somewhat confusing. PL/SQL (the language which Oracle uses for procedural operations) defines the maximum VARCHAR2 as 32767 characters. SQL and the database define the maximum VARCHAR2 as 4000 characters. Since the cartridge indexes are invoked from SQL against a database column, the maximum VARCHAR2 in the indexes is 4000 characters.

In 4.72 all the PL/SQL functions support VARCHAR2 strings up to 32767 characters. You can use them today provided you don't try to store the longer VARCHAR2 strings in a VARCHAR2 datatype in the database. You'll need to use a CLOB or LONG type.

If you want (in preparation for 4.8), you can write wrappers around the cartridge functions to support longer SMILES. An example like below will have the same interface and behavior as the proposed 4.8 interface, it will simply be limited to 32767 characters:


create function csmi2cansmi(smiles IN CLOB, type in NUMBER, 
  cansmi IN OUT NOCOPY CLOB) return number
  as
   smi varchar2(32767);
   rc varchar2(32767);
   slen number;
   rlen number;
  begin
   slen := dbms_lob.getlength(smiles);
   dbms_lob.read(smiles, slen, 1, smi);

   rc := ddpackage.fsmi2cansmi(smi, type);
   if (rc is NULL) then
     return ODCIConst.Error;
   end if;

   rlen := length(rc);
   dbms_lob.write(retlob, rlen, 1, rc);
   dbms_lob.trim(retlob, rlen);

   return ODCIConst.Success;

   exception
      when others then
        raise;
 end;

In 4.8, we'll support CLOB datatypes everywhere we currently use VARCHAR2.

Functions which return VARCHAR2. The return CLOB will be passed as an additional argument.


  fsmi2cansmi(smi IN VARCHAR2, type IN NUMBER) => VARCHAR2
  fsmi2cansmi(smi IN CLOB, type IN NUMBER, can IN OUT CLOB) => NUMBER

Functions which return NUMBER.


  fsmi2amw(smi IN VARCHAR2) => NUMBER
  fsmi2amw(smi IN CLOB) => NUMBER

  fcontains(s1 IN VARCHAR2, s2 IN VARCHAR2) => NUMBER
  fcontains(s1 IN VARCHAR2, s2 IN CLOB) => NUMBER
  fcontains(s1 IN CLOB, s2 IN VARCHAR2) => NUMBER
  fcontains(s1 IN CLOB, s2 IN CLOB) => NUMBER

Indexes
All operator combinations supported (eg. all four contains permutations).

Prototype 4.8 version running CLOB indexes on NCI:


SQL> desc nci;
 Name                       Null?    Type
 -------------------------- -------- ---------------
 SMI                        NOT NULL VARCHAR2(428)
 NSC                        NOT NULL NUMBER(6)
 CAS_RN                     NOT NULL VARCHAR2(11)
 CL_1                                NUMBER(5)
 CL_2                                NUMBER(3)
 CL_3                                NUMBER(5,4)

SQL> desc ncib;
 Name                       Null?    Type
 -------------------------- -------- ---------------
 SMI                                 CLOB
 NSC                        NOT NULL NUMBER(6)
 CAS_RN                     NOT NULL VARCHAR2(11)
 CL_1                                NUMBER(5)
 CL_2                                NUMBER(3)
 CL_3                                NUMBER(5,4)

SQL> create index ncii on nci(smi) 
       indextype is c$dcischem.ddblob;

Index created.
Elapsed: 00:12:56.46

SQL> create index nciib on ncib(smi) 
       indextype is c$dcischem.ddblob;

Index created.
Elapsed: 00:14:13.28

The following timings are on a Sun Ultra 60 (2x360MHz) with 768 MB of real memory. It's running Oracle 8.1.7 Enterprise Edition. The table is the 126705 structures from NCI95.

Index Creation Times ( mm:ss )

Index type VARCHAR2 CLOB
exact 2:10 3:08
graph 7:10 10:54
blob 12:57 14:13

Index type	VARCHAR2	CLOB
exact	2:10	3:08
graph	7:10	10:54
blob	12:57	14:13

Query Performance (seconds elapsed):

Table name Query Hits VARCHAR2 CLOB
nci contains(smi, 'OC(=O)CS') = 1 507 1.27 1.29
nci matches(smi, '[OH]C(=O)CS') = 1 279 1.31 1.34
nci fingertest(smi, 'OC(=O)CS') = 1 1237 0.41 0.43
nci tanimoto(smi, 'OC(=O)CS') > 0.75 16 0.48 0.48

Table name	Query	Hits	VARCHAR2	CLOB
nci	contains(smi, 'OC(=O)CS') = 1	507	1.27	1.29
nci	matches(smi, '[OH]C(=O)CS') = 1	279	1.31	1.34
nci	fingertest(smi, 'OC(=O)CS') = 1	1237	0.41	0.43
nci	tanimoto(smi, 'OC(=O)CS') > 0.75	16	0.48	0.48

New Cartridge Platform - Windows 2000 Professional (NT).

Why?

Important Oracle platform,
Relevant for "project-level" databases,
Requested by customers (both cartridge and toolkits).

The cartridge is in hand on NT. Works really well.

Caveats:

Installation is clunky (no GUI tool). Still UNIX-oriented, works exactly like the UNIX install, but in a DOS shell.
Program Object support not there yet. Not a show-stopper, just a porting issue (eg. fork(), exec() stuff). Program Objects not as relevant on NT as they don't exist now, and there are other ways to do the same thing in Oracle.
Will take some time for us to get more proficient with with NT.

The following tables compare times for index creation and search on identical hardware running either Linux Red Hat 6.2 with Oracle 8.1.6 or Windows 2000 Professional with Oracle 8.1.6. The hardware is a 800 Mhz PentiumIII-based Dell workstation with 256 MB memory. Again, the table is 126705 structures from NCI95.

Index Creation Times ( m:ss )

Index type NT Linux
exact 1:25 3:16
graph 5:12 8:21
blob 3:50 9:37

Index type	NT	Linux
exact	1:25	3:16
graph	5:12	8:21
blob	3:50	9:37

Query Performance:

Table name Query Hits NT Linux
nci exact(smi, 'c1ccccc1') = 1 1 9 ms 7 ms
nci graph(smi, 'c1ccccc1') = 1 3 10 ms 9 ms
nci tautomer(smi, 'c1ccccc1') = 1 1 9 ms 8 ms
nci fingertest(smi, 'C1CC1') = 1 4358 0.40 0.47
nci contains(smi, 'C1CC1') = 1 863 2.96 3.01
nci contains(smi, 'OC(=O)CS') = 1 507 0.65 0.77
nci matches(smi, '[OH]C(=O)CS') = 1 279 0.67 0.80
nci tanimoto(smi, '[OH]C(=O)CS') > 0.75 16 0.38 0.46
nci tversky(smi, 'c1ccccc1', 0.5, 0.5) > 0.5 1465 0.44 0.56
nci contains(smi, 'NCCc1cc(O)c(O)cc1') = 1 829 0.84 0.97
nci tanimoto(smi, 'NCCc1cc(O)c(O)cc1') > 0.75 77 0.42 0.48
(*** ms is milliseconds, *.** is seconds)

Table name	Query	Hits	NT	Linux
nci	exact(smi, 'c1ccccc1') = 1	1	9 ms	7 ms
nci	graph(smi, 'c1ccccc1') = 1	3	10 ms	9 ms
nci	tautomer(smi, 'c1ccccc1') = 1	1	9 ms	8 ms
nci	fingertest(smi, 'C1CC1') = 1	4358	0.40	0.47
nci	contains(smi, 'C1CC1') = 1	863	2.96	3.01
nci	contains(smi, 'OC(=O)CS') = 1	507	0.65	0.77
nci	matches(smi, '[OH]C(=O)CS') = 1	279	0.67	0.80
nci	tanimoto(smi, '[OH]C(=O)CS') > 0.75	16	0.38	0.46
nci	tversky(smi, 'c1ccccc1', 0.5, 0.5) > 0.5	1465	0.44	0.56
nci	contains(smi, 'NCCc1cc(O)c(O)cc1') = 1	829	0.84	0.97
nci	tanimoto(smi, 'NCCc1cc(O)c(O)cc1') > 0.75	77	0.42	0.48