Daylight Chemistry Cartridge, EMUG '99

Daylight Chemistry Cartridge

Jack Delany

DAYLIGHT Chemical Information Systems, Inc. Mission Viejo, CA USA

Introduction:

The Daylight Chemistry cartridge is the result of several distinct collaborative development projects within Oracle, Novartis, Ontogen and Daylight. This whitepaper/presentation will describe Oracle Cartridge technology, review the history of the internal and external projects which have contributed to this effort, describe the current Daylight Cartridge implementation, and provide a view of future direction for the cartridge.

Oracle Cartridge Technology:

An Oracle Data Cartridge is a bundled set of tools which extends Oracle clients and servers with new capabilities. Typically, a single data cartridge deals with a particular 'information domain'. Some examples of current data cartridges include: image processing, spatial processing, audio processing.

A cartridge consists of a number of different components which work together to provide complete database capabilities within the information domain. The components include:

User-defined Object types,
Packages - PL/SQL utilities/libraries,
Interfaces to store, retrieve, search, and index the new types.

A cartridge extends the server. The new capabilites are tightly integrated with the database engine. The Cartridge interface specification provides interfaces to the query parser, optimizer, indexing engine, etc. Each of these subsystems of the server learns about the cartridge capabilities through these interfaces. The cartridge can use SQL, PL/SQL, Java, C, etc. to implement the functions.

From Oracle's point of view, the cartridge idea allows third-party organizations to expand the capabilities of the Oracle database server in a modular, supportable fashion.

Project History:

In late 1997 - early 1998, two independent projects were undertaken relating to 'alternate database interfaces' to the Daylight system.

Alberto Gobbi, Paul Gatewood (Novartis CP), Johnny Peterson (Oracle CH) collaborated on a Pilot project for data warehousing with Oracle. Results from this project were discussed and demo'ed at Mug98.
John Cargill (Ontogen) and Norah MacCuish (Daylight) collaborated on a prototype Informix Datablade. This work was also presented in a Mug98 talk.

Interest generated from Mug98 resulted in an ongoing Daylight / Oracle project to generate requirements, develop a cartridge prototype, and refine its functionality. The paradigm for this version of the cartridge was the "dayblob". Dayblob is a complete set of chemical functions and a merlin-pool-like implementation of chemical searching embedded completely within Oracle. This was most recently discussed in a talk by Dave Weininger during Mug99.

In Spring, 1999, Oracle released Oracle 8i (version 8.1.5), the first production version to support full cartridge capabilities. Version 8.1.6 is projected for general release in December of this year. It will have some notable features which will impact the Daylight cartridge implementation for the long term.

In light of Oracle 8.1.5 and 8.1.6 functionalities and performance issues, we've revisited some of the design constraints related to the Dayblob architecture and have been working on "Version 09" of the cartridge.

So, what is the Daylight Cartridge? First, what it isn't. It isn't a full-functioned toolkit interface / environment. It is a set of chemical utilities (normalizations, data conversions, comparison functions) which provide the missing pieces required to manage chemical data in an Oracle database.

Design Goals:

The key design goals for this version haven't changed from those presented in previous discussions:

Provide high performance chemical functionality within an RDBMS framework,
Deliver this functionality in a robust, flexible, supportable, expandable fashion,

The one new design goal added to the equation is the following:

Deliver the chemical functionality in a seamless, integrated fashion, from the perspective of the RDBMS user.

The main point here is that we recognize that in addition to the differences in the database itself (the servers, databases, tables, indexes), the members of the Oracle user community (the developers, database administrators, users) have very different perspectives than those which we are used to dealing with.

In order for the product to be successful, the Daylight cartridge must behave in a predictable, understandable way. The paradigms we use within our cartridge must match those which Oracle developers and DBAs already use and understand.

Architecture:

The Daylight toolkit interfaces with the Oracle server via callouts to the "extproc" utility. This utility provides a RPC-like mechanism for performing C-language function calls. Daylight toolkit code is wrapped inside this RPC layer for each of the defined cartridge functions.

One of the first concerns about this architecture is the efficiency of the RPC mechanism, and the potential that extproc will be a performance bottleneck. Some simple tests indicate that this isn't a problem. Consider the contrib program "cansmi", which takes as input SMILES and outputs canonical SMILES. A simple example, with 1999 SMILES from the medchem demo database runs rapidly:

$ time cansmi < test.smi
CN(C)CCCN1c2ccccc2Sc3ccccc13
COc1ccc(OC)c(c1)C(O)C(C)NC(C)(C)C
Nc1ccc(cc1)S(=O)(=O)Nc2nccs2
CCOC(=O)Nc1ccc(SCC2COC(Cn3ccnc3)(O2)c4ccc(Cl)cc4Cl)cc1
Cn1c(=O)n(C)c2ncn(CC(O)CN3CCN(CCCSc4ccccc4)CC3)c2c1=O
Cn1cnc2n(C)c(=O)[nH]c(=O)c12
...
SMILES in: 1999; SMILES out: 1999; SMILES changed: 0
So long, baby!

real    0m2.82s
user    0m2.78s
sys     0m0.03s

The analogous test can be run using the Oracle Cartridge by creating a table of the same 1999 SMILES, and running a SQL query:

SQL> describe cansmidemo;
 Name                 Null?   Type
 -------------------- ------- ----------------------------
 ID                           NUMBER
 SMILES                       VARCHAR2(4000)

SQL> select count(1) from cansmidemo;

  COUNT(1)
----------
      1999

SQL> select smi2cansmi(smiles, 0) from cansmidemo;
CN(C)CCCN1c2ccccc2Sc3ccccc13
COc1ccc(OC)c(c1)C(O)C(C)NC(C)(C)C
Nc1ccc(cc1)S(=O)(=O)Nc2nccs2
CCOC(=O)Nc1ccc(SCC2COC(Cn3ccnc3)(O2)c4ccc(Cl)cc4Cl)cc1
Cn1c(=O)n(C)c2ncn(CC(O)CN3CCN(CCCSc4ccccc4)CC3)c2c1=O
Cn1cnc2n(C)c(=O)[nH]c(=O)c12
...

1999 rows selected.

Elapsed: 00:00:03.80

As it turns out, a significant portion of the time spent is performing output to the screen. In order to eliminate that time, we'll rerun the queries as follows:

$ time cansmi < test.smi | wc
SMILES in: 1999; SMILES out: 1999; SMILES changed: 0
So long, baby!
    1999    1999   74725

real    0m2.80s
user    0m2.74s
sys     0m0.03s

And with Oracle:

SQL> select sum(length(smi2cansmi(smiles, 0))) from cansmidemo;

SUM(LENGTH(SMI2CANSMI(SMILES,0)))
---------------------------------
                            72725

Elapsed: 00:00:02.95

The overhead of executing functions through the extproc utility is roughly 250 ms in this case. This represents approximately 5000 - 10000 round-trips per second. For all but the most trivial toolkit processing, this will be a minor factor.

Another issue is data bandwidth between the extproc utility and the server. Tests on the blob-based index indicate that we can retrieve upwards of 100,000 rows per second from a tabular data source, and 60 MB per second from a blob-based data source.

Each of these data throughput and round-trip limits represents a design constraint which must be considered in the overall cartridge design. The design described in this whitepaper is what we consider to be the optimum tradeoff between resource efficiency, search performance, and the ability to maintain the full transaction concurrency model within Oracle for all cartridge data.

Cartridge Specification:

As indicated earlier, an Oracle Cartridge typically consists of three sets of functionalities: object type definitions, packaged functions, and indexing / data access tools. In turn, each of these three areas will be discussed with respect to the Daylight cartridge implementation.

Object Types:

The daylight cartridge does not define any new Oracle object types!!!

To review, within the Daylight Toolkit, each object has an internal object type and an external representation. The external representation is used to store objects in files, Thor databases, to communicate objects between processes, etc.

Daylight Object External Language

molecule SMILES

reaction SMILES

pattern SMARTS

transform SMIRKS

fingerprint Encoded ASCII

depiction 2D coordinate list

conformation 3D coordinate list

datatree TDT

monomer Monomer SMILES

multimer CHUCKLES

varimer CHORTLES

varipattern CHARTS

Daylight Object	External Language
molecule	SMILES
reaction	SMILES
pattern	SMARTS
transform	SMIRKS
fingerprint	Encoded ASCII
depiction	2D coordinate list
conformation	3D coordinate list
datatree	TDT
monomer	Monomer SMILES
multimer	CHUCKLES
varimer	CHORTLES
varipattern	CHARTS

Note that our external languages are all very expressive and well-behaved. That is, they are compact, with high information density. They are all printable ASCII strings. They all have well defined syntax and semantics.

So, our strategy with respect to the Oracle Cartridge is the following:

Oracle clients and servers manipulate Daylight objects exclusively via their external representations.

When needed, the cartridge instantiates internal Daylight objects within the cartridge to perform a specific task (eg. calculate a molecular weight from a SMILES). The interfaces between Oracle and Daylight always pass objects as their external string representations.

Implications of this "Object Model":

Since our objects are represented as strings within the RDBMS, any Oracle, Informix, Sybase, JDBC, ODBC, CORBA, etc., etc., etc., client, server, middleware, application layer, etc., etc., etc. can handle these "objects" in their external form.

Only the endpoints of communication must understand what the objects mean: the Daylight Oracle cartridge provides the server-side endpoint, and a front end user interface provides the client-side endpoint. Otherwise, the objects effectively "tunnel" through the middle layers.

This does not preclude the design of an Oracle-specific object layer on top of the cartridge system (eg. ODBC); we simply don't require one, and don't dictate which model, if any, you use.

There is an overhead associated with instantiating objects on demand. Naturally, the functions which are provided in the cartridge perform higher granularity operations than in the native Daylight toolkit.

For example, consider the SQL operator smi2cansmi(). It's data dictionary definitions follow:

create or replace function fsmi2cansmi( smiles in VARCHAR2, type in number )
  return  varchar2 
  is
  language C
  name "oci_smi2cansmi"
  library DDLIB
  with context
  parameters ( context, 
               smiles STRING,
               smiles LENGTH INT,
               smiles INDICATOR,
               type,
               type INDICATOR,
               RETURN );

create or replace operator smi2cansmi binding (varchar2, number)
  return varchar2 using fsmi2cansmi;

The cartridge C code which implements the function looks a lot like regular toolkit code, with the notable addition of a bunch of 'OCI' calls.

char *oci_smi2cansmi(OCIExtProcContext *context,
                     char *smi, int slen, short smi_ind,
		     OCINumber *ocitype, short type_ind)
{
  OCIEnv *envhp;
  OCISvcCtx *svchp;
  OCIError *errhp;
  char *retval;
  dt_Handle ob;
  int lens, type;
  char *str;

  /*** Deal with arguments ***/

  ...
  
  ob = dt_smilin(slen, smi);
  str = dt_cansmiles(&slen, ob, type);
  if ((ob != NULL_OB) && (str != NULL) && (slen > 0))
    {
      retval = (char *)OCIExtProcAllocCallMemory(context, slen);
      strncpy(retval, str, slen);
      retval[slen] = '\0'; 
    }
  else
    {
      retval = (char *)OCIExtProcAllocCallMemory(context, 4);
      *retval = '\0';
    }
  dt_dealloc(ob);

  return retval;
}

The important points to recognize are that the function takes a SMILES string as input, and returns a SMILES string as output. Inside the function, it creates an internal molecule or reaction object from the SMILES, canonicalizes it, and then destroys the internal molecule or reaction object before returning. The internal Daylight object only exists for the duration of the call. The function is stateless with respect to the Daylight toolkit.

Although there is some additional overhead for instantiating objects and throwing them away, this must be weighed against the cost of performing the extproc call, the overhead of performing the IPC for the call, and the overhead of providing state information (if one were to attempt to implement persistant objects within extproc). The additional overhead has minimal impact on overall throughput.

Furthermore, there are very real opportunities for optimizations and shortcuts which we can exploit at the string level. For example, caching the most recent calculation can benefit queries like:

select id, smi2cansmi(smiles, 0) from table order by length(smi2cansmi(smiles, 0)) asc;

Similarly, some operations can be performed without actually interpreting the external representation of the object. For example, it is simple to parse a SMILES lexically and count the net charge.

PL/SQL Functions, SQL Operators:

There are quite a few user-accessible PL/SQL functions and SQL operators implemented for the cartridge. All are stateless.

Molecule / Reaction functions:

function fsmi2cansmi ( smiles IN VARCHAR2, type IN NUMBER ) => VARCHAR2 operator smi2cansmi ( smiles IN VARCHAR2, type IN NUMBER ) => VARCHAR2

Canonicalize an input SMILES. Returns a canonical SMILES string. Type is either 0 or 1, for unique or absolute SMILES, respectively. [This might be better as two functions, eg; smi2asmi(), smi2usmi() without the type argument]

function fsmi2netch ( smiles IN VARCHAR2 ) => NUMBER operator smi2netch ( smiles IN VARCHAR2 ) => NUMBER

Return the net charge of the molecule or reaction.

function fsmi2hcount ( smiles IN VARCHAR2 ) => NUMBER operator smi2hcount ( smiles IN VARCHAR2 ) => NUMBER

Return the total hydrogen count for the molecule or reaction.

function fsmi2mf ( smiles IN VARCHAR2 ) => VARCHAR2 operator smi2mf ( smiles IN VARCHAR2 ) => VARCHAR2

Returns the molecular formula string for the molecule or reaction.

function fsmi2amw ( smiles IN VARCHAR2 ) => NUMBER operator smi2amw ( smiles IN VARCHAR2 ) => NUMBER

Return the average molecular weight for a molecule or reaction.

function fsmi2graph ( smiles IN VARCHAR2 ) => VARCHAR2 operator smi2graph ( smiles IN VARCHAR2 ) => VARCHAR2

Return the hydrogen- and charge-suppressed canonical graph string for a given molecule or reaction.

Fingerprint functions:
function fsmi2fp ( smiles IN VARCHAR2, min IN NUMBER, max IN NUMBER, nbits IN NUMBER ) => VARCHAR2 operator smi2fp ( smiles IN VARCHAR2, min IN NUMBER, max IN NUMBER, nbits IN NUMBER ) => VARCHAR2

Returns the ASCII fingerprint for a given molecule or reaction. Min and max are the minimum and maximum pathlengths, respectively, and size is the number of bits in the fingerprint.

function fsmi2xfp ( smiles IN VARCHAR2, min IN NUMBER, max IN NUMBER, nbits IN NUMBER ) => VARCHAR2 operator smi2xfp ( smiles IN VARCHAR2, min IN NUMBER, max IN NUMBER, nbits IN NUMBER ) => VARCHAR2

Return the ASCII difference fingerprint for a given molecule or reaction. Min and max are the minimum and maximum pathlengths, respectively, and size is the nubmer of bits in the fingerprint.

function ffoldfp ( fpstr IN VARCHAR2, nbits IN NUMBER, dens IN NUMBER ) => VARCHAR2 operator foldfp ( fpstr IN VARCHAR2, nbits IN NUMBER, dens IN NUMBER ) => VARCHAR2

Fold the given fingerprint to the minimum appropriate size or density, whichever is limiting, and returns the new, folded fingerprint.

function fbitcount ( fpstr IN VARCHAR2 ) => NUMBER operator bitcount ( fpstr IN VARCHAR2 ) => NUMBER

Return the number of bits on in the fingerprint.

function fnbits ( fpstr IN VARCHAR2 ) => NUMBER operator nbits ( fpstr IN VARCHAR2 ) => NUMBER

Return the total size of the fingerprint, in bits. In the current Daylight toolkit this will always be a power of two.

function feuclid ( fpstr1 IN VARCHAR2, fpstr2 IN VARCHAR2 ) => NUMBER operator euclid ( fpstr1 IN VARCHAR2, fpstr2 IN VARCHAR2 ) => NUMBER

Return the euclidean distance between two fingerprints. If the fingerprints are not the same size (nbits()), then the larger will be folded automatically to match the size of the smaller before comparison. The returned value is a floating point number between 0.0 and 1.0.

function ftanimoto ( fpstr1 IN VARCHAR2, fpstr2 IN VARCHAR2 ) => NUMBER operator tanimoto ( fpstr1 IN VARCHAR2, fpstr2 IN VARCHAR2 ) => NUMBER

Return the tanimoto distance between two fingerprints. If the fingerprints are not the same size (nbits()), then the larger will be folded automatically to match the size of the smaller before comparison. The returned value is a floating point nubmer between 0.0 and 1.0.

function ftversky ( fpstr1 IN VARCHAR2, fpstr2 IN VARCHAR2, alpha IN NUMBER, beta IN NUMBER ) => NUMBER operator tversky ( fpstr1 IN VARCHAR2, fpstr2 IN VARCHAR2, alpha IN NUMBER, beta IN NUMBER ) => NUMBER

Return the tversky distance between two fingerprints. If the fingerprints are not the same size (nbits()), then the larger will be folded automatically to match the size of the smaller before comparison. The parameters alpha and beta must be in the range of 0.0 - 1.0. The returned value is a floating point nubmer between 0.0 and 1.0.

function ffingertest ( fpstr1 IN VARCHAR2, fpstr2 IN VARCHAR2 ) => NUMBER operator fingertest ( fpstr1 IN VARCHAR2, fpstr2 IN VARCHAR2 ) => NUMBER

Returns 1 if all of the bits in fpstr1 are also present in fpstr2. That is, the fingerprint from fpstr1 represents a possible substructure of fpstr2. If the fingerprints are not the same size (nbits()), then the larger will be folded automatically to match the size of the smaller before comparison. The returned value is either 0 or 1.

function fatomnorm ( smiles IN VARCHAR2, list IN VARCHAR2, ntuple IN NUMBER, isotype IN NUMBER) => VARCHAR2 operator atomnorm ( smiles IN VARCHAR2, list IN VARCHAR2, ntuple IN NUMBER, isotype IN NUMBER) => VARCHAR2 function fbondnorm ( smiles IN VARCHAR2, list IN VARCHAR2, ntuple IN NUMBER, isotype IN NUMBER) => VARCHAR2 operator bondnorm ( smiles IN VARCHAR2, list IN VARCHAR2, ntuple IN NUMBER, isotype IN NUMBER) => VARCHAR2

Returns a potentially reordered N-tuple string for the given list VARCHAR2 input parameter. The list string is reordered based on the canonical atom or bond ordering of the input SMILES. ntuple is the number of comma-separated values per atom or bond, and isotype is 0 for unique SMILES canonicalization and 1 for absolute SMILES canonicalization.

Comparison functions:
function fexact ( a IN VARCHAR2, b IN VARCHAR2) => NUMBER operator exact ( a IN VARCHAR2, b IN VARCHAR2) => NUMBER

Returns 1 if the two input strings are identical, 0 otherwise. The operator is optionally backed by the ddexact indextype.

function fgraph ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER operator graph ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER

Returns 1 if the two input SMILES share the same canonical graph, 0 otherwise. The operator is optionally backed by the ddgraph indextype.

function ftautomer ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER operator tautomer ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER

Returns 1 if the two input SMILES share the same canonical graph, net charge, and total hydrogen count, 0 otherwise. The operator is optionally backed by the ddgraph indextype.

function freactant ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER operator reactant ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER function fagent ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER operator agent ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER function fproduct ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER operator product ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER

Returns 1 if smiles2 (a molecule SMILES) is a component of smiles1 (a reaction SMILES) with the appropriate role, otherwise returns 0. The operators are optionally backed by the ddrole indextype.

function fcontains ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2) => NUMBER operator contains ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2) => NUMBER

Returns 1 if smiles1 contains smiles2; that is, smiles2, assuming opened valences for all hydrogens, is a substructure of smiles1. The operator is optionally backed by the ddblob indextype.

function fmatches ( smiles IN VARCHAR2, smarts IN VARCHAR2) => NUMBER operator matches ( smiles IN VARCHAR2, smarts IN VARCHAR2) => NUMBER

Returns 1 if the smarts expression matches the given SMILES, 0 otherwise. The operator is optionally backed by the ddblob indextype.

function fsimilar ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2, threshold IN NUMBER ) => NUMBER operator similar ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2, threshold IN NUMBER ) => NUMBER operator score ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2, threshold IN NUMBER ) => NUMBER

Returns 1 if the tanimoto similarity between the two SMILES is greater than the threshold. Calculates normal fingerprints. The operators are optionally backed by the ddblob indextype. score() is an ancillary operator to the similar() operator, and allows the convenient selection of the tanimoto score during index-based queries.

function fsmi2hash ( smiles IN VARCHAR2 ) => NUMBER operator smi2hash ( smiles IN VARCHAR2 ) => NUMBER function fsmi2ghash ( smiles IN VARCHAR2 ) => NUMBER operator smi2ghash ( smiles IN VARCHAR2 ) => NUMBER

These functions are part of the interface, but are only needed by the extensible index code. These should not be used; they are only mentioned becuase they appear in the data dictionary. They may be hidden in the production release.

Extensible Indexes:

The the major change in implementation between version 08 and version 09 of the cartridge is the implementation of indexes. In version 08 and before, all index operations were implemented in a single blob-based index. Version 09 splits the indexes up into four distinct indextypes. The reasons for this change are:

Simplify the blob-based index. In order to support all of the search types required, the blob required multiple copies of the SMILES, multiple hash tables, etc. Very complex and resource intensive.
Descrease the size of the blob-based index. As the blob-based index is used for searching and must be sent between the Oracle server and the extproc for a query, minimizing the size of the index increases search performance. Caching of the blob in extproc is possible with version 09, however caching the blob requires the availability of memory in extproc sufficient to hold the entire blob.
Allow user chosen indexes. With the previous monolithic blob, all indexes were implemented all the time. In this version, one can choose any combination of the four indextypes for a given column. One can also use any combination of the indextypes on multiple columns in a single table.
Simplify maintainence of the transaction model. The three new tabular indexes easily obey Oracle's transaction concurrency model. The simplified blob-based index can also now obey the transaction model, whereas the previous version could not.

Exact lookup indextype (ddexact):

This indextype supports the exact() operator. This is the least interesting index, as it implements a simple string comparison between the indexed column and the query. In that sense, the built in BTree index within Oracle has the similar functionality. The BTree index even supports the "greater than" and "less than" operators, which this index doesn't.

The main advantage of the ddexact indextype is that it can index strings up to 4000 bytes (the maximum varchar2 length), while the internal BTree can only index strings up to approximately 40% of a block; for a database with a 2048 byte blocksize, this means that the largest indexable varchar2 string using the BTree is approximately 700 bytes. Similarly, in a future version of the cartridge this index will support the comparison of LOB datatypes; the BTree index does not.

This index requires approximately 22 bytes per row, independent of the length of the SMILES being indexed.

SQL> select * from medium where exact(smiles, 'Oc1ccccc1') = 1;

      ID SMILES
-------- ----------------------------------------
  220688 Oc1ccccc1

Elapsed: 00:00:00.10

Graph and Tautomer indextype (ddgraph):

This indextype supports the graph() and tautomer() operators. It creates a tabular index of the graph and tautomeric information (hydrogen count and net charge) for rapid retrieval. The indexed search only requires an index lookup based on the graph of the query, followed by comparison of the net charge and hydrogen count for a tautomer search.

This index requires approximately 30 bytes per row, independent of the length of the SMILES being indexed.

SQL> select * from medium where graph(smiles, 'Oc1ccccc1') = 1;

      ID SMILES
-------- ----------------------------------------
  153175 O=C1CCC=CC1
  181024 O=C1CCCC=C1
  220688 Oc1ccccc1
  247340 O=C1C=CCC=C1
  320108 OC1=CC=CC[CH]1
  332013 Oc1c[c]ccc1
  332014 Oc1cc[c]cc1
  333710 Oc1[c]cccc1
  347378 O=C1CC[CH]C=C1
  360322 O=C1[CH]C=CC=C1
  443810 [O-]C1CCCCC1
  486637 O=C1CCC[C]=C1
  568426 OC1CC=CC=C1
  740390 [OH2+]C1CCCCC1

14 rows selected.

Elapsed: 00:00:00.36

Reaction role index (ddrole):

This indextype supports component level searches for exact molecules within reactions by role. It creates a tabular index of each component in every reaction, by role. Component level searches are implemented simply by a index lookup of the component, followed by validation of the role.

This index requires approximately 22 bytes per component, independent of the length of the SMILES being indexed.

SQL> select * from rxn where reactant(smiles, 'Oc1ccccc1') = 1;

      ID SMILES
-------- ------------------------------------------------------
 1914882 CC1Sc2ccccc2N(C)C1=O.Oc1ccccc1>>CN1C(=O)C(C)(Sc2ccccc1
         2)c3ccc(O)cc3

  319303 CN(C)P1OCc2ccccc12.Oc1ccccc1>>C1OP(Oc2ccccc2)c3ccccc13

  ...

122 rows selected.

Elapsed: 00:00:00.32

SMARTS and Similarity search index (ddblob):

This index caches the SMILES and fingerprints in a single binary large object for rapid searching. Substructure and similarity searches are implemented through this blob-based image of the structural data. The blob is persistant; it is automatically stored in the Oracle database and retrieved for processing. The blob is kept synchronized with the base table in real time and strictly obeys the Oracle transaction model. In addition, this indextype supports a user-controlled extproc cache of the blob data. On a per-session basis, one can hold the entire blob in memory in extproc; this eliminates the overhead of passing the blob data from the Oracle server to extproc for each query.

SQL> select count(1) from medium where contains(smiles, 'O=c1ccocc1') = 1;

  COUNT(1)
----------
      6126

Elapsed: 00:00:20.87

SQL> select * from rxn where contains(smiles, '>>O=c1c(C)cocc1') = 1;

      ID SMILES
-------- ----------------------------------------
 2009323 CCOC(=O)C.CC(C)(C)[Si](C)(C)OC(=O)C(F)(F
         )F.O=Cc1coc2ccccc2c1=O>>CCOC(=O)CC(O[Si]
         (C)(C)C(C)(C)C)c1coc2ccccc2c1=O

  291327 CCOC(=O)CC(=O)c1ccccc1.Fc1ccccc1C(=O)Cl>
         >CCOC(=O)c1c(oc2ccccc2c1=O)c3ccccc3

...

111 rows selected.

Elapsed: 00:00:01.20

SQL> select count(1) from medium where similar(smiles, 'OC(=O)CS', 0.8) = 1;

  COUNT(1)
----------
        19

Elapsed: 00:00:13.82

SQL>  select smiles, score(1) from medium 
           where similar(smiles, 'OC(=O)CS', 0.8, 1) = 1

SMILES                                     SCORE(1)
---------------------------------------- ----------
OCC(S)C=O                                .800000012
[NH4+].[O-]C(=O)CS                       .918918908
[Na+].[O-]C(=O)CS                        .894736826
COC(=O)CS                                .871794879
O.O.O.[Ca+2].[O-]C(=O)C[S-]              .894736826
[Ca+2].[O-]C(=O)C[S-]                    .894736826
NCCO.OC(=O)CS                            .809523821
[K+].[O-]C(=O)CS                         .944444418
CCSCC(=O)O                               .809523821
[Ca+2].[O-]C(=O)CS                       .894736826
[Na+].[Bi+2].[O-]C(=O)C[S-]              .809523821

SMILES                                     SCORE(1)
---------------------------------------- ----------
[Sr+2].[O-]C(=O)CS                       .894736826
CCOC(=O)CS                               .809523821
CSCC(=O)O                                .871794879
C[S+](C)CC(=O)O                          .850000024
CCOC(=O)C[S]                             .809523821
[As].COC(=O)C[S].COC(=O)C[S].COC(=O)C[S] .809523821
CC(S)C(=O)[O-]                           .871794879
[Ba+2].[O-]C(=O)CS.[O-]C(=O)CS           .894736826

19 rows selected.

Elapsed: 00:00:13.70

Additional Search Times:

The following table lists some additional comparative search times for the blob-based indexes. The databases are:

small - 118,611 SMILES, from NCI95,
medium - 999,900 SMILES, subset from internal Daylight databases,
large - 5,464,800 SMILES, combined Daylight internal databases (Chop Suey),
rxn - 104,200 reaction SMILES, from Chemsynth97.

The timings are on a Sun Ultra 60, with 768 MB of real memory. The Oracle installation is a vanilla 8.1.5 install, with the only changes being the increase of db_block_buffers, shared_pool_size, log_buffers parameters from default values in init.ora.

Index Creation Times

Table name Index type Creation time (hh:mm:ss)
small exact 2:20
small graph 7:40
small blob 16:00*
medium exact 21:00
medium graph 1:15:00
medium blob 3:27:00*
large exact 2:46:00
large graph 7:24:00
rxn exact 2:01
rxn role 19:30
rxn blob 30:19*
(* - Includes fingerprint generation time)

Table name	Index type	Creation time (hh:mm:ss)
small	exact	2:20
small	graph	7:40
small	blob	16:00*
medium	exact	21:00
medium	graph	1:15:00
medium	blob	3:27:00*
large	exact	2:46:00
large	graph	7:24:00
rxn	exact	2:01
rxn	role	19:30
rxn	blob	30:19*

Query Performance:

Table name Query Hits Time (mm:ss) no cache / cache
small contains(smiles, 'OC(=O)C1') 0 00:00.35 00:00.02
small contains(smiles, 'OC(=O)CS') 492 00:01.45 00:01.13
small matches(smiles, '[OH]C(=O)CS') 268 00:01.27 00:00.95
small similar(smiles, 'OC(=O)CS', 0.8) 7 00:01.43 00:01.10
medium exact(smiles, 'Oc1ccccc1') 1 00:00.11
medium graph(smiles, 'Oc1ccccc1') 14 00:00.06
medium contains(smiles, 'OC(=O)C1') 0 00:09.50 00:00.03
medium contains(smiles, 'OC(=O)CS') 5597 00:47 00:33
medium matches(smiles, '[OH]C(=O)CS') 1592 00:35 00:28
medium similar(smiles, 'OC(=O)CS', 0.8) 19 00:18 00:09.10
large exact(smiles, 'NCCc1ccccc1') 1 00:00.12
large graph(smiles, 'c1ccccc1') 28 00:00.11
rxn reactant(smiles, 'Sc1ccccc1') 181 00:09.48
rxn product(smiles, 'Oc1ccccc1F') 1 00:00.11
rxn contains(smiles, 'OC(=O)C1') 0 00:00.50 00:00.03
rxn contains(smiles, '>>OC(=O)CS') 754 00:05.70 00:04.90
rxn matches(smiles, '[OH]C(=O)CS>>') 167 00:09.80 00:09.50
rxn similar(smiles, 'OC(=O)CCl>>OC(=O)CS', 0.5) 68 00:01.45 00:00.98

Table name	Query	Hits	Time (mm:ss) no cache / cache
small	contains(smiles, 'OC(=O)C1')	0	00:00.35	00:00.02
small	contains(smiles, 'OC(=O)CS')	492	00:01.45	00:01.13
small	matches(smiles, '[OH]C(=O)CS')	268	00:01.27	00:00.95
small	similar(smiles, 'OC(=O)CS', 0.8)	7	00:01.43	00:01.10
medium	exact(smiles, 'Oc1ccccc1')	1	00:00.11
medium	graph(smiles, 'Oc1ccccc1')	14	00:00.06
medium	contains(smiles, 'OC(=O)C1')	0	00:09.50	00:00.03
medium	contains(smiles, 'OC(=O)CS')	5597	00:47	00:33
medium	matches(smiles, '[OH]C(=O)CS')	1592	00:35	00:28
medium	similar(smiles, 'OC(=O)CS', 0.8)	19	00:18	00:09.10
large	exact(smiles, 'NCCc1ccccc1')	1	00:00.12
large	graph(smiles, 'c1ccccc1')	28	00:00.11
rxn	reactant(smiles, 'Sc1ccccc1')	181	00:09.48
rxn	product(smiles, 'Oc1ccccc1F')	1	00:00.11
rxn	contains(smiles, 'OC(=O)C1')	0	00:00.50	00:00.03
rxn	contains(smiles, '>>OC(=O)CS')	754	00:05.70	00:04.90
rxn	matches(smiles, '[OH]C(=O)CS>>')	167	00:09.80	00:09.50
rxn	similar(smiles, 'OC(=O)CCl>>OC(=O)CS', 0.5)	68	00:01.45	00:00.98

Summary:

To summarize, we can deliver a set of Oracle utilities via a cartridge which will provide high performance chemical information processing within the Oracle system. These tools allows the handling of molecules and reactions, complete with coordinates for display and fingerprints for searching efficiency. Four chemistry-specific indextypes are provided, which allow the database creator to tailor the database for specific searches and applications.

To Do:

These are several outstanding issues which I haven't covered in the above:

SMI->MDL and MDL->SMI functions.
"Packaging" of the cartridge. Oracle has a specific process whereby a new third-party cartridge is registered. This includes the assignment of unique naming conventions specific to the third-party developer. We haven't gone through this registry process yet. Hence, function and index names may change.
Fingerprint handling in the current blob-based index. The index currently computes its own fingerprints. We will provide the option for the blob to get its fingerprints from a user column. The blob-based index can and will support tversky, and fingertest operators.
We will provide the ability to create a "fingerprint-only" index, which will support fingerprint searching and screening based on user-supplied fingerprints.

Timetable:

Our target date for a limited release of Version 09 is the end of November, 1999. Our target date for full release of Version 09 is February 1, 2000. This release will be supported on Solaris 2.7. Other platforms will be made available as dictated by demand.

Futures:

Oracle indicated that version 8.1.6 will be available in production by December, 1999. One of the important new features in 8.1.6 is the ability to run Java in the server. This will impact the design for future versions of the Daylight cartridge, specifically the design of the blob-based indexes.

Informix has been "back-burnered". Note that all of the various interfaces and tools described here have direct parallels within the Informix Datablade model. It is possible to provide an Informix datablade; this will be dictated by demand.

How about:

select clogp('c1ccccc1') from dual;

CLOGP('C1CCCCC1')
-----------------
            2.142

Elapsed: 00:00:00.02

Program objects can be called from extproc. It is feasible to provide an extensible interface through program objects which doesn't require any user extproc programming.

One last possibility down the road. Consider the implications of the following PL/SQL code, which runs under 8.1.5.. The first section is a simple PL/SQL program. The second is a user-defined function which takes a molecule or reaction object handle as an argument.

declare
  subtype dt_Handle is number;
  mol dt_Handle;
  dummy number;
  ismiles varchar2(4000);
begin
  ismiles := 'Oc1ccccc1';
  mol := dt_smilin( ismiles );
  dbms_output.put_line('Proton count for ' ||
                       ismiles || ' is ' || proton_count(mol));
  dummy := dt_dealloc(mol);
  ismiles := 'OC1CCCCC1';
  mol := dt_smilin( ismiles );
  dbms_output.put_line('Proton count for ' ||
                       ismiles || ' is ' || proton_count(mol));
  dummy := dt_dealloc(mol);
end;

create or replace function proton_count (ob in number)
 return number as
  subtype dt_Handle is number;
  atoms dt_Handle;
  atom dt_Handle;
  hcount number := 0;
begin
  atoms := dt_stream(ob, 19);
  loop
    atom := dt_next(atoms);
    if (atom = 0) then
      exit;
    end if;
    if (dt_number(atom) = 1) then
      hcount := hcount + 1;
    end if;
    hcount := hcount + dt_imp_hcount(atom);
  end loop;
  return hcount;
end;

OUTPUT:

Proton count for Oc1ccccc1 is 6
Proton count for OC1CCCCC1 is 12

It is possible for Daylight to provide a general PL/SQL toolkit interface within Oracle, including Smiles, Smarts, Fingerprint, Depict, Reaction, Monomer, and Program object but maybe not Thor or Merlin. Issues to be resolved include memory management, memory persistance and multiprocessing within extproc. We must understand Oracles long-term plans for extproc before proceeding. Of course, more than just a novelty, work on this product will be dictated by customer demand.