Daylight Chemistry Cartridge, Version 4.71

Jack Delany

DAYLIGHT Chemical Information Systems, Inc. Mission Viejo, CA USA


TABLE OF CONTENTS:

  1. Quick Start.

  2. Installation.

  3. Installation Troubleshooting.

  4. Verification.

  5. De-Installation.

  6. PL/SQL Functions, SQL Operators.

  7. Extensible Indexes.

  8. Using Program Objects.

  9. Optimizer Support.

  10. Error Codes.

  11. Tuning Hints.


Quick Start:

This section gives a quick overview of the installation procedures for the cartridge. Please consult the detailed installation section for more information on each of these items.

  1. Locate the directory 'dcischem' within your Daylight distribution. This directory contains the shared object library, and a number of SQL and wrapped SQL scripts (.PLB).

  2. Create the Oracle user 'c$dcischem', identified by 'secret'. This user should have DBA privileges, and reasonable default tablespaces.

  3. Edit the file create.sql. Make sure that the value for the path to 'c$dcischemlib' is correct. Run the script with: 'svrmgrl < create.sql'.

  4. Grant the role 'daycart' to any non-privileged user who requires cartridge functionality.

  5. Revoke privileges from user 'c$dcischem'. Change the password for 'c$dcischem'.

  6. Configure and start the network listener. A listener must be configured to listen for IPC connections using key EXTPROC0. A network Service Name for 'EXTPROC_CONNECTION_DATA' must also be configured.

  7. See the SQL scripts in the 'TEST' subdirectory for examples of use of the cartridge.

Installation:

This section gives detailed installation instructions.

Installation Troubleshooting:

There are a number of details which must be correct for the cartridge to work properly. For a new Oracle installation, these are typically already set correctly, however the list of issues is included here for diagnostic purposes:

Verification

The subdirectory 'dcischem/TEST' includes a number of stand-alone SQL scripts which test the cartridge installation. The shell script 'run_tests' will execute all the SQL scripts in turn and compare the output to reference output files. Any discrepencies will be reported. The script 'run_tests' does attempt to log in as 'c$dcischem/secret', so if you've changed the password and privileges on user 'c$dcischem' you'll need to edit run_tests to reflect this change.

The scripts do use the 'EXPLAIN PLAN' facility, which requires that the plan table be present for the test user. The script $ORACLE_HOME/rdbms/admin/utlxplan.sql creates the plan table; this should be executed before running the 'run_tests' script.

  $ run_tests
  Testing dd_basic_test ...
  Comparing output to reference file ...
  dd_basic_test test OK.

  Testing dd_exact_test ...
  Comparing output to reference file ...
  dd_exact_test test OK.
  < ... >
  All tests passed.

Also, note that the SQL scripts give useful examples of using the cartridge functionality in each category.

De-Installation

In order to cleanly de-install the cartridge, all dependent indexes must be dropped from the system first. That is, any indexes created using one of the Daylight indextypes (ddexact, ddrole, ddgraph, ddblob) should be dropped. If one does not drop all the indexes first, then they will be marked as invalid. If this occurs, the indexes and their associated index tables (<indexname> || '_DDT') must be dropped manually.

After any dependent indexes have been dropped, one can run the script 'clean.sql'. This script will remove all cartridge index definitions, functions, operators, packages, roles, and synonyms. The script 'clean.sql' will remove everything which is installed by the script 'create.sql' with the exception of the license and program object tables. These tables are preserved since it is convenient to not need to recreate and populate these tables. The user c$dcischem will require DBA privilege to be granted in order to run the script. Example output follows:

  $ svrmgrl < clean.sql  

  Oracle Server Manager Release 3.1.6.0.0 - Production

  Copyright (c) 1997, 1999, Oracle Corporation.  All Rights Reserved.

  Oracle8i Enterprise Edition Release 8.1.6.0.0 - Production
  With the Partitioning option
  JServer Release 8.1.6.0.0 - Production

  SVRMGR> Connected.
  SVRMGR> Echo                            OFF
  SVRMGR> SVRMGR> Statement processed.
  SVRMGR> SVRMGR> Connected.
  Echo                            OFF
  Statement processed.
  ...
  Server Manager complete.
  $ 

It is safe to repeatedly run the scripts create.sql and clean.sql. The create.sql script creates a bunch of objects, types, etc. The clean.sql script removes them. Neither script will modify any other objects, types, or tables owned by c$dcischem or any other user on the system. The only components created by the create.sql script which are not removed by the clean.sql script are the license and progob tables. These two tables are preserved by the clean.sql script for convenience.

When one runs create.sql after clean.sql, one will see warnings that the tables license and progob already exist, however these warnings don't impact the outcome of the cartridge re-installation.

After running clean.sql, the database will be left in its original pre-cartridge state. One can then delete the license and progob tables, if desired, to eliminate all remnants of the cartridge.

Note that the user c$dcischem will continue to exist after running clean.sql and any non-cartridge tables, objects, etc. owned by c$dcischem will be unaffected by either clean.sql or create.sql.

The other way to remove the cartridge is to drop the user c$dcischem. Note that this does not remove public synonyms or roles, so these should be dropped manually (or via clean.sql).

  SQL> drop user c$dcischem cascade;

This will remove the user c$dcischem and all dependent objects. It will be necessary to repeat the entire installation procedure if this method is chosen.

PL/SQL Functions, SQL Operators:

There are quite a few user-accessible PL/SQL functions and SQL operators implemented for the cartridge. All are stateless. All of the PL/SQL functions are contained in a PL/SQL package called 'ddpackage'. This package is owned by 'c$dcischem'. Privileges required to access these functions are granted by the 'daycart' role.

The default installation (create.sql) also creates public synonyms for the package 'ddpackage' and all of the operators described herein. Hence, from any Oracle user, one can access the functions and operators by the signature documented below (eg: select tanimoto(...) from table rather than select c$dcischem.tanimoto(...) from table).

The one place where synonyms aren't allowed is in indextype specifications; hence you must refer to indextypes by their full names when creating indexes as a user other than c$dcischem (eg. create index <name> on small(smi) indextype is c$dcischem.ddblob).

General Purpose Functions:

The general purpose functions are typically used for debugging only. Operators and public synonyms are not created for these functions during the installation. Hence, the functions fsetdebug(), ftestlicense(), fhostid(), and fversion() must be referenced by their fully-qualified names (eg. c$dcischem.ddpackage.ftestlicense()).

function ddpackage.fsetdebug ( value IN NUMBER ) => NUMBER

Controls the level of detail of messages logged to the file: "/tmp/extproc.log". Sets the new value of the logging level to 'value'. Returns the prior value. Valid values are integers the range of 0 - 9.

  0 - No logging at all.
  1 - Error messages only (the default value).
  5 - Warnings and errors.
  9 - Notes, warnings, and errors.

This value is set on a per-session basis. Other values are reserved for future use.

function ddpackage.ftestlicense ( product IN VARCHAR2 ) => NUMBER

Checks the cartridge license. Takes an input parameter, which is the license type to check. [ In the current release, the only recognized value is 'daycart' ]. Returns 1 if the Daylight cartridge has a valid license, and 0 otherwise.

The license is contained in a special table which must be created and populated at cartridge install time.

function ddpackage.fhostid ( ) => VARCHAR2

Returns the hardware hostid which is used for generation of the hardware- specific license key. You'll need to provide this value to Daylight for us to generate a license.

function ddpackage.fversion ( which in VARCHAR2 ) => VARCHAR2

Returns the string version number of the Daylight toolkit or cartridge shared-object library currently running. Valid input parameters are:
   'toolkit'  -   Returns the current Daylight Toolkit version.
   'daycart'  -   Returns the cartridge executable version.

Molecule / Reaction Functions:

function ddpackage.fsmi2cansmi ( smiles IN VARCHAR2, type IN NUMBER ) => VARCHAR2
operator smi2cansmi ( smiles IN VARCHAR2, type IN NUMBER ) => VARCHAR2

Canonicalize an input SMILES. Returns a canonical SMILES string. Type is either 0 or 1, for unique or absolute SMILES, respectively.

function ddpackage.fsmi2netch ( smiles IN VARCHAR2 ) => NUMBER
operator smi2netch ( smiles IN VARCHAR2 ) => NUMBER

Return the net charge of the molecule or reaction.

function ddpackage.fsmi2hcount ( smiles IN VARCHAR2 ) => NUMBER
operator smi2hcount ( smiles IN VARCHAR2 ) => NUMBER

Return the total hydrogen count for the molecule or reaction.

function ddpackage.fsmi2mf ( smiles IN VARCHAR2 ) => VARCHAR2
operator smi2mf ( smiles IN VARCHAR2 ) => VARCHAR2

Returns the molecular formula string for the molecule or reaction.

function ddpackage.fsmi2amw ( smiles IN VARCHAR2 ) => NUMBER
operator smi2amw ( smiles IN VARCHAR2 ) => NUMBER

Return the average molecular weight for a molecule or reaction. The weight used for any atoms which do not have specified isotopes is the average atomic weight. The weight used for atoms with a specified isotope is the high precision molecular weight for that atom. For example, "c1ccccc1" returns 78.1184, the average molecular weight, while "[1H][12c]1[12c]([1H])[12c]([1H])[12c]([1H])[12c]([1H])[12c]1[1H]" returns 78.0469502, the high precision molecular weight for that particular isotope.

function ddpackage.fsmi2graph ( smiles IN VARCHAR2 ) => VARCHAR2
operator smi2graph ( smiles IN VARCHAR2 ) => VARCHAR2

Return the hydrogen- and charge-suppressed canonical graph string for a given molecule or reaction.

Fingerprint Functions:

function ddpackage.fsmi2fp ( smiles IN VARCHAR2, min IN NUMBER, max IN NUMBER, nbits IN NUMBER ) => VARCHAR2
operator smi2fp ( smiles IN VARCHAR2, min IN NUMBER, max IN NUMBER, nbits IN NUMBER ) => VARCHAR2

Returns the ASCII fingerprint for a given molecule or reaction. Min and max are the minimum and maximum pathlengths, respectively, and size is the number of bits in the fingerprint.

function ddpackage.fsmi2xfp ( smiles IN VARCHAR2, min IN NUMBER, max IN NUMBER, nbits IN NUMBER ) => VARCHAR2
operator smi2xfp ( smiles IN VARCHAR2, min IN NUMBER, max IN NUMBER, nbits IN NUMBER ) => VARCHAR2

Return the ASCII difference fingerprint for a given molecule or reaction. Min and max are the minimum and maximum pathlengths, respectively, and size is the nubmer of bits in the fingerprint.

function ddpackage.ffoldfp ( fpstr IN VARCHAR2, nbits IN NUMBER, dens IN NUMBER ) => VARCHAR2
operator foldfp ( fpstr IN VARCHAR2, nbits IN NUMBER, dens IN NUMBER ) => VARCHAR2

Fold the given fingerprint to the minimum appropriate size or density, whichever is limiting, and returns the new, folded fingerprint.

function ddpackage.fbitcount ( fpstr IN VARCHAR2 ) => NUMBER
operator bitcount ( fpstr IN VARCHAR2 ) => NUMBER

Return the number of bits on in the fingerprint.

function ddpackage.fnbits ( fpstr IN VARCHAR2 ) => NUMBER
operator nbits ( fpstr IN VARCHAR2 ) => NUMBER

Return the total size of the fingerprint, in bits. In the current Daylight toolkit this will always be a power of two.

function ddpackage.fisfp ( fpstr IN VARCHAR2 ) => NUMBER
operator isfp ( fpstr IN VARCHAR2 ) => NUMBER

Return 1 if the string is a fingerprint, 0 otherwise. The syntax of a fingerprint can never be confused with a valid SMILES. This is the bit of cleverness which allows us to overload the searching functions.

function ddpackage.fatomnorm ( smiles IN VARCHAR2, list IN VARCHAR2, ntuple IN NUMBER, isotype IN NUMBER) => VARCHAR2
operator atomnorm ( smiles IN VARCHAR2, list IN VARCHAR2, ntuple IN NUMBER, isotype IN NUMBER) => VARCHAR2
function ddpackage.fbondnorm ( smiles IN VARCHAR2, list IN VARCHAR2, ntuple IN NUMBER, isotype IN NUMBER) => VARCHAR2
operator bondnorm ( smiles IN VARCHAR2, list IN VARCHAR2, ntuple IN NUMBER, isotype IN NUMBER) => VARCHAR2

Returns a potentially reordered N-tuple string for the given list VARCHAR2 input parameter. The list string is reordered based on the canonical atom or bond ordering of the input SMILES. ntuple is the number of comma-separated values per atom or bond, and isotype is 0 for unique SMILES canonicalization and 1 for absolute SMILES canonicalization.

Comparison Functions:

function ddpackage.fexact ( a IN VARCHAR2, b IN VARCHAR2) => NUMBER
operator exact ( a IN VARCHAR2, b IN VARCHAR2) => NUMBER

Returns 1 if the two input strings are identical, 0 otherwise. The operator is optionally backed by the ddexact indextype.

function ddpackage.fgraph ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER
operator graph ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER

Returns 1 if the two input SMILES share the same canonical graph, 0 otherwise. The operator is optionally backed by the ddgraph indextype.

function ddpackage.ftautomer ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER
operator tautomer ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER

Returns 1 if the two input SMILES share the same canonical graph, net charge, and total hydrogen count, 0 otherwise. The operator is optionally backed by the ddgraph indextype.

function ddpackage.freactant ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER
operator reactant ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER
function ddpackage.fagent ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER
operator agent ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER
function ddpackage.fproduct ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER
operator product ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER

Returns 1 if smiles2 (a molecule SMILES) is a component of smiles1 (a reaction SMILES) with the appropriate role, otherwise returns 0. The operators are optionally backed by the ddrole indextype.

function ddpackage.fcontains ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2) => NUMBER
operator contains ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2) => NUMBER

Returns 1 if smiles1 contains smiles2; that is, smiles2, assuming opened valences for all hydrogens, is a substructure of smiles1. The operator is optionally backed by the ddblob indextype.

function ddpackage.fmatches ( smiles IN VARCHAR2, smarts IN VARCHAR2) => NUMBER
operator matches ( smiles IN VARCHAR2, smarts IN VARCHAR2) => NUMBER

Returns 1 if the smarts expression matches the given SMILES, 0 otherwise. The operator is optionally backed by the ddblob indextype.

function ddpackage.feuclid ( fp_or_smi1 IN VARCHAR2, fp_or_smi2 IN VARCHAR2 ) => NUMBER
operator euclid ( fp_or_smi1 IN VARCHAR2, fp_or_smi2 IN VARCHAR2 ) => NUMBER

Return the euclidean distance between two fingerprints or SMILES. If both parameters are fingerprints and are not the same size (nbits()), then the larger will be folded automatically to match the size of the smaller before comparison. If one parameter is a SMILES, its fingerprint is generated automatically at the size of the other parameter. If both parameters are SMILES, then a fingerprint size of 512 bits is used. The returned value is a floating point number between 0.0 and 1.0. This is optionally backed by the ddblob indextype.

function ddpackage.ftanimoto ( fp_or_smi1 IN VARCHAR2, fp_or_smi2 IN VARCHAR2 ) => NUMBER
operator tanimoto ( fp_or_smi1 IN VARCHAR2, fp_or_smi2 IN VARCHAR2 ) => NUMBER

Return the tanimoto distance between two fingerprints or SMILES. If both parameters are fingerprints and are not the same size (nbits()), then the larger will be folded automatically to match the size of the smaller before comparison. If one parameter is a SMILES, its fingerprint is generated automatically at the size of the other parameter. If both parameters are SMILES, then a fingerprint size of 512 bits is used. The returned value is a floating point number between 0.0 and 1.0. This is optionally backed by the ddblob indextype.

function ddpackage.ftversky ( fp_or_smi1 IN VARCHAR2, fp_or_smi2 IN VARCHAR2, alpha IN NUMBER, beta IN NUMBER ) => NUMBER
operator tversky ( fp_or_smi1 IN VARCHAR2, fp_or_smi2 IN VARCHAR2, alpha IN NUMBER, beta IN NUMBER ) => NUMBER

Return the tversky distance between two fingerprints or SMILES. If both parameters are fingerprints and are not the same size (nbits()), then the larger will be folded automatically to match the size of the smaller before comparison. If one parameter is a SMILES, its fingerprint is generated automatically at the size of the other parameter. If both parameters are SMILES, then a fingerprint size of 512 bits is used. The returned value is a floating point number between 0.0 and 1.0. This is optionally backed by the ddblob indextype.

function ddpackage.ffingertest ( fp_or_smi1 IN VARCHAR2, fp_or_smi2 IN VARCHAR2 ) => NUMBER
operator fingertest ( fp_or_smi1 IN VARCHAR2, fp_or_smi2 IN VARCHAR2 ) => NUMBER

Returns 1 if all of the bits in fp_or_smi2 are also present in fp_or_smi1. That is, the fingerprint from fp_or_smi2 represents a possible substructure of fp_or_smi1. If both parameters are fingerprints and are not the same size (nbits()), then the larger will be folded automatically to match the size of the smaller before comparison. If one parameter is a SMILES, its fingerprint is generated automatically at the size of the other parameter. If both parameters are SMILES, then a fingerprint size of 512 bits is used. The returned value is a floating point number between 0.0 and 1.0. This is optionally backed by the ddblob indextype.

Program Object Functions:

There are two lower-level functions which invoke program objects. It is expected that they will almost always be called from a PL/SQL wrapper layer, which would be responsible for packaging the communication between Oracle and the program object in a meaningful way. Hence, the default cartridge installation does not create operators for these two functions.

function ddpackage.fprogob ( name IN VARCHAR2, message IN VARCHAR2) => VARCHAR2

Communicates with a program object running on the Oracle server. The parameter 'name' is the symbolic name of the program object. The table c$dcischem.progob contains the mappings of symbolic names to actual executable programs. The function fprogob() proceeds as follows:

  1. Looks up the symbolic name in the table c$dcischem.progob,
  2. If the executable is not already running, starts the program object,
  3. Parses the 'message' parameter into a sequence of strings,
  4. Sends the sequence of strings to the program object,
  5. Takes the returned sequence of strings and converts it into a newline delimited VARCHAR2,
  6. Returns the VARCHAR2 string.

Valid delimiters for the lines in the 'message' parameter include the normal line-termination characters for UNIX, Mac, and PC: '\n', '\r', and '\r\n'. The function properly handles all three termination cases. The returned VARCHAR2 string is delimited by UNIX line-termination '\n'.

function ddpackage.fprogob ( name IN VARCHAR2, message IN OUT CLOB, answer IN OUT CLOB) => NUMBER

Identical to the previous function in operation; the message to the program object and the response from the program object are passed as CLOB, rather than VARCHAR2. The parameter 'message' is a CLOB containing the message to be sent to the program object. The parameter 'answer' must be an empty CLOB, into which the response from the program object is copied.

Valid delimiters for the lines in the 'message' parameter include the normal line-termination characters for UNIX, Mac, and PC: '\n', '\r', and '\r\n'. The function properly handles all three termination cases. The returned VARCHAR2 string is delimited by UNIX line-termination '\n'.

The function returns success or failure.

Extensible Indexes:

The default installation (create.sql) creates public synonyms for the package 'ddpackage' and all of the operators. The one place where synonyms aren't allowed is in indextype specifications; indextypes must be refered to by their full names when creating indexes as a user other than c$dcischem (eg. create index <namegt; on small(smi) indextype is c$dcischem.ddblob).

All four of the extensible indexes create Oracle tables to maintain their required internal data. The ddexact, ddgraph, and ddrole indextypes use regular Oracle tables to store the data; the ddblob indextype uses an Oracle table with a single row containing BLOB data. These internal tables are maintained automatically by the index code; there is normally no need for a user or application developer to know about these auxillary tables.

There are two situations where the developer or DBA must pay attention to these auxillary tables. First, on index creation, the auxillary data tables are created in the users default tablespace. Second, if an index gets corrupted, the developer or DBA may need to delete the auxillary data table manually.

Exact lookup indextype (ddexact):

This indextype supports the exact() operator. This is the least interesting index, as it implements a simple string comparison between the indexed column and the query. In that sense, the built in BTree index within Oracle has the similar functionality. The BTree index even supports the "greater than" and "less than" operators, which this index doesn't.

The main advantage of the ddexact indextype is that it can index strings up to 4000 bytes (the maximum varchar2 length), while the internal BTree can only index strings up to approximately 40% of a block; for a database with a 2048 byte blocksize, this means that the largest indexable varchar2 string using the BTree is approximately 700 bytes. Similarly, in a future version of the cartridge this index will support the comparison of LOB datatypes; the BTree index does not.

This indextype uses an auxillary data table within Oracle to maintain its index data. This table is kept in sync with the base table in real time and strictly obeys the Oracle transaction model for all DML operations. The name of the auxillary data table is: <index name> || '_DDT'.

The index can be created on a VARCHAR2 column with the following general statement:

  SQL> create index <index name> on <table> ( <column> )
         indextype is c$dcischem.ddexact;

The valid form for the operator which uses the index is:

  SQL> select * from <table> where exact(<column>, <SMILES>) = 1;

This index makes no assumptions about the content of the indexed column. The applications are responsible for normalizing both the data in the indexed column and and queries to the exact() operator. This allows the ddexact indextype to properly handle both Unique and Absolute SMILES without complication. In the following example, the USMILES column is normalized using smi2cansmi(smiles, 0) and the ASMILES column is normalized using smi2cansmi(smiles, 1). Note also that when performing the exact() query on a column, the query is normalized using the same method as the column data.

  SQL> create table test (id number, usmiles varchar2(4000),
         asmiles varchar2(4000));

  SQL> desc test;
  Name                               Null?    Type
  --------------------------------- -------- ----------------------------
  ID                                         NUMBER
  USMILES                                    VARCHAR2(4000)
  ASMILES                                    VARCHAR2(4000)

  SQL> insert into test values (1234, smi2cansmi('Cl/C=C/Cl', 0),
                                      smi2cansmi('Cl/C=C/Cl', 1));

  SQL> select * from test;
  
        ID  USMILES   ASMILES
  --------  -------   -------
    1234    ClC=CCl   Cl/C=C/Cl

  SQL> create index usmi_index on test(usmiles) indextype 
         is c$dcischem.ddexact;
  SQL> create index asmi_index on test(asmiles) indextype 
         is c$dcischem.ddexact;

  SQL> select * from test 
         where exact(usmiles, smi2cansmi('Cl/C=C/Cl', 0)) = 1;

        ID  USMILES   ASMILES
  --------  -------   -------
    1234    ClC=CCl   Cl/C=C/Cl

  SQL> select * from test 
         where exact(asmiles, smi2cansmi('Cl/C=C/Cl', 1)) = 1;

        ID  USMILES   ASMILES
  --------  -------   -------
    1234    ClC=CCl   Cl/C=C/Cl

NOTE: Because this indextype accesses the base table during a index scan, operations which modify the base table based on the results of the index scan aren't allowed. This is similar to the case where a trigger on a column attempts to modify the column data. The message which one receives is the following:

  SQL> delete from <table> where exact(smiles, 'CCN') = 1;

  ERROR at line 1:
  ORA-29903: error in executing ODCIIndexFetch() routine
  ORA-29400: data cartridge error
  ERROR ORA-04091: table C$DCISCHEM.T1 is mutating, trigger/function may
  not see it
  C$DCISCHEM-101: CARTRIDGE ERROR
  ORA-06512: at "C$DCISCHEM.DDEXACT_IM", line 159
  ORA-06512: at line 1

In order to perform this function, one must use two steps. One option would be:

  SQL> create table <temptable> (zapcol) as
          select rowid from <table> where exact(smiles, 'CCN') = 1;
  SQL> delete from <table> where rowid in 
          (select zapcol from <temptable>)
Graph and Tautomer indextype (ddgraph):

This indextype supports the graph() and tautomer() operators. It creates a tabular index of the graph and tautomeric information (hydrogen count and net charge) for rapid retrieval. The indexed search only requires an index lookup based on the graph of the query, followed by comparison of the net charge and hydrogen count for a tautomer search.

This indextype uses a special data table within Oracle to maintain its index data. This table is kept in sync with the base table in real time and strictly obeys the Oracle transaction model for DML operations. The name of the auxillary data table is: <index name> || '_DDT'.

The index can be created on a VARCHAR2 column with the following general statement:

  SQL> create index <index name> on <table> ( <SMILES column> )
         indextype is c$dcischem.ddgraph;

The valid forms for the operators to use the index are:

  SQL> select * from <table> where graph(<SMILES column>, <SMILES>) = 1;
  SQL> select * from <table> where tautomer(<SMILES column>, <SMILES>) = 1;

Unlike the ddexact index, the ddgraph indextype always interprets the column as SMILES. Invalid SMILES are rejected. There is no concern with the ddgraph indextype as to whether the SMILES are Unique or Absolute, as stereochemical and isomeric information are discarded for the graph() and tautomer() lookups.

NOTE: As with the ddexact indextype, this indextype accesses the base table during a index scan. See the previous section on the limitations to performing modifications of the base table based on the results of an index scan.

Reaction role index (ddrole):

This indextype supports component level searches for exact molecules within reactions by role. It creates a tabular index of each component in every reaction, by role. Component level searches are implemented simply by a index lookup of the component, followed by validation of the role.

This indextype uses a special data table within Oracle to maintain its index data. This table is kept in sync with the base table in real time and strictly obeys the Oracle transaction model for all DML operations. The name of the auxillary data table is: <index name> || '_DDT'.

The index can be created on a VARCHAR2 column with the following general statement:

  SQL> create index <index name> on <table> ( <SMILES column> )
         indextype is c$dcischem.ddrole;

The valid forms for the operators to use the index are:

  SQL> select * from <table> where reactant(<SMILES column>, <SMILES>) = 1;
  SQL> select * from <table> where agent(<SMILES column>, <SMILES>) = 1;
  SQL> select * from <table> where product(<SMILES column>, <SMILES>) = 1;

If a row value contains a molecule SMILES, that row is ignored for the ddrole index. Like the ddexact index, the ddrole index makes no assumptions about the canonicalization of the SMILES in the index column. Hence, the column must be normalized by the application to the appropriate SMILES and the query on the column must match that normalization (see the explanation re. ddexact above).

NOTE: As with the ddexact indextype, this indextype accesses the base table during a index scan. See the previous section on the limitations to performing modifications of the base table based on the results of an index scan.

SMARTS and Similarity search index (ddblob):

This index caches the SMILES and fingerprints in a single binary large object for rapid searching. Substructure and similarity searches are implemented through this blob-based image of the structural data. The blob is persistant; it is automatically stored in the Oracle database and retrieved for processing. The blob is kept synchronized with the base table in real time and strictly obeys the Oracle transaction model for all DML operations. The name of the auxillary data table which contains the BLOBs is: <index name> || '_DDT'.

The index can be created on either a SMILES or fingerprint column, provided that the fingerprints are all the same size. If the index is created on a SMILES column, all six searches are implemented on that column: contains(), matches(), tanimoto(), tversky(), euclid(), and fingertest(). If the index is created on a fingerprint column, the four fingerprint-specific searches (tanimoto(), tversky(), euclid(), and fingertest()) are supported for that column.

The index creation command takes an optional parameter 'fpsize'. This value is the number of bits that the blob index will use for fingerprint creation.

When indexing a SMILES column, the blob will use the 'fpsize' parameter for the creation size for fingerprints within the internal blob index or a default value of 512 bits if the 'fpsize' is not specified.

When indexing a fingerprint column, all fingerprints within that column must match the 'fpsize' parameter. If the 'fpsize' parameter is not specified, the size of the first fingerprint encountered by the index code is used. Any fingerprints which do not match this size are rejected by the index.

The query for tanimoto(), tversky(), euclid(), and fingertest() can be either a fingerprint or a SMILES. If a SMILES, it will be used to create a normal fingerprint of the appropriate size for the index which will be used for comparison. If the query is a fingerprint, the size of the fingerprint must match the size of the fingerprint in the column (or in the blob for a SMILES index). The indexes and operators never perform automatic folding of fingerprints during searches.

The index can be created on a VARCHAR2 column with the following general statements:

  -- Default internal fingerprint size of 512 bits

  SQL> create index <index name> on 
         <table> ( <SMILES column> ) indextype is c$dcischem.ddblob;

  -- Specify internal fingerprint size of 64 bits

  SQL> create index <index name> on 
         <table> ( <SMILES column> ) indextype is c$dcischem.ddblob
         parameters ('fpsize=64');

  -- Fpsize taken from size of data in FP column (fpsize parameter, if it were
     specified, must match the size of data in the column)

  SQL> create index <index name> on 
         <table> ( <FP column> ) indextype is c$dcischem.ddblob;

The valid forms for the operators to use the index are:

  SQL> select * from <table> 
         where contains(<SMILES column>, <SMILES query>) = 1;
  SQL> select * from <table>
         where matches(<SMILES column>, <SMARTS query>) = 1;
  SQL> select * from <table>
         where tanimoto(<SMILES or FP column>, <SMILES or FP>) > 0.8;
  SQL> select * from <table>
         where tversky(<SMILES or FP column>, <SMILES or FP>, 0.5, 0.5) > 0.8;
  SQL> select * from <table>
         where euclid(<SMILES or FP column>, <SMILES or FP>) < 0.2;
  SQL> select * from <table>
         where fingertest(<SMILES or FP column>, <SMILES or FP>) = 1;

For a simple table, one need not create a fingerprint column or use the fingerprint indexes. One only needs to create an index on a SMILES column to enable all structural searching functions.

  -- We specified a preferred size for the internally generated fingerprints
  -- as 1024 bits, overriding the default of 512 bits.  These fingerprints
  -- are internal to the blob and aren't visible.

  SQL> create index asmi_blob_index on test(asmiles) indextype is
         c$dcischem.ddblob parameters ('fpsize=1024');
  
  SQL> select * from test where matches(asmiles, '[N;H1,H2]ccC=O') = 1;
  SQL> select * from test where tanimoto(asmiles, 'c1ccccc1') > 0.8;

  -- A fingerprint can be used directly as a query for tanimoto(),
  -- euclid(), fingertest() and tversky() instead of a SMILES.  The 
  -- fingerprint size must match the size of the fingerprints used in
  -- index creation.  In this case, 1024 bits.

  SQL> select * from test where 
         tanimoto(asmiles, smi2fp('OC(=O)CS', 0, 7, 1024)) > 0.8;

If one wants structural searches on multiple SMILES columns within a single table, then an index must be created on each SMILES column in the table. Furthermore, we can create and index fingerprint columns also. The index automatically will adjust fingerprint operations based on the size of the data in the fingerprint column. So, for example, if a column of fingerprints is added to our test table and populated:

  SQL> alter table test add (fp varchar (15));
  SQL> update test set fp = smi2fp(usmiles, 0, 7, 64);

  -- This index, on the FP column, uses the size of the fingerprints in 
  -- the column (64 bits, in this case) in index creation.

  SQL> create index fp_blob_index on test(fp) indextype is
         c$dcischem.ddblob;

  -- As with an index on a SMILES column, a SMILES query or a fingerprint
  -- can be used as a query for tanimoto(), euclid(), fingertest() and
  -- tversky().

  SQL> select smi2fp('OC(=O)CS', 0, 7, 64) from dual;

  SMI2FP('OC(=O)CS',0,7,64)
  --------------------------
  IM9vA4w127g.2

  -- The following two queries give identical results.

  SQL> select * from test where tanimoto(fp, 'IM9vA4w127g.2') > 0.8;
  SQL> select * from test where tanimoto(fp, 'OC(=O)CS') > 0.8;

There are some additional sublties explained using the following example queries:

  -- This screens and searches using the index; internally it uses 1024 bit
  -- fingerprints.  The output is generated using the functional form, at 
  -- its default of 512 bits.  Remember that the functional forms will use a
  -- fingerprint size of 512 bits unless one or both of the parameters is a 
  -- fingerprint.  The tanimoto values output by the functional form may not
  -- match our search criteria based on the folding-related differences
  -- between 1024-bit and 512-bit fingerprints.

  SQL> select smiles, tanimoto(asmiles, 'NCCc1ccccc1') from test
         where tanimoto(asmiles, 'NCCc1ccccc1') > 0.8;

  -- This screens and searches using the index; internally using 1024 bit
  -- fingerprints.  The output is generated using the functional form; this
  -- time calculated at 1024 bits.  By passing a fingerprint of 1024 bits
  -- to the functional form of tanimoto(), we force its use of 1024 bits.
  -- In this case, the tanimoto values output by the functional form will
  -- exactly match those used by the index.

  SQL> select smiles, tanimoto(asmiles, smi2fp('NCCc1ccccc1', 0, 7, 1024))
         from test where tanimoto(asmiles, 'NCCc1ccccc1') > 0.8;

  -- This searches using the index f1, using the 128 bit fingerprints from
  -- the actual column for comparison.  Again, the tanimoto values computed
  -- by the functional form are calculated at 512 bits and will likely not
  -- match those used by the index for the query.

  SQL> select smiles, tanimoto(fp, 'NCCc1ccccc1') from test
         where tanimoto(fp, 'NCCc1ccccc1') > 0.8;

Using Program Objects:

This section describes the configuration and use of program objects from within Oracle.

Program objects are stand-alone executable programs which communicate via standard input and standard output using the "Pipetalk" protocol (See the Daylight Theory Manual for more details). The cartridge supports the ability to execute external program objects on the same machine as the Oracle server, and to communicate with the program objects in a robust, general way.

The table c$dcischem.progob is the dictionary of program objects available to the Oracle server. In order to use a program object, it must be defined in this dictionary table.

  SQL>  desc progob
   Name                                      Null?    Type
   --------------------------------------- -------- -------------------------
   NAME                                               VARCHAR2(60)
   PATH                                               VARCHAR2(4000)
   ARGS                                               VARCHAR2(4000)

For example, consider the clogp program object. The code for the clogp example is included in the CONTRIB subdirectory. It can be defined to the system with:

  SQL> insert into progob values
     ('clogp', '/oracle/o815/progob/clogptalk.sh', NULL);

This defines a program object with the symbolic name 'clogp'. It corresponds to the absolute path: '/oracle/o815/progob/clogptalk.sh', which must be an executable program object. In this case, the arguments to the program object (whitespace delimited) are NULL.

The file '/oracle/o815/progob/clogptalk.sh' looks like the following:

  #!/bin/sh

  DY_ROOT=/usr3/thor/day462/v462
  export DY_ROOT

  DY_LICENSEDATA=/usr/local/daylight/dy_license.dat
  export DY_LICENSEDATA

  LD_LIBRARY_PATH=/opt/SUNWspro/SC2.0:/usr/openwin/lib:/usr/lib:$DY_ROOT/lib
  export LD_LIBRARY_PATH

  $DY_ROOT/bin/clogptalk

When Oracle executes the program object, the environment is empty. Hence, all required environment variables must be defined in a shell wrapper before calling the actual program object.

Since program objects are initially executed under the Oracle userid, a strict security policy must be obeyed to prevent access to unauthorized Oracle privileges. The policy, checked and enforced by the cartridge code before executing any external program object, is the following:

  1. If the program object is executable and SUID, then it is OK. (SUID requires privileges of the destination UID, so this means that whoever turned on the SUID bit has either root or owner privilege on the file.
  2. If the program object is executable by and owned by Oracle, and the directory is not group or world writeable, then OK.
  3. Otherwise, fail.

Running program objects are persistant on the server, and are executed on a per-user-session basis. The first time a program object is accessed within a user session it is started, and the program object will continue to run until the user session is terminated. If the program object crashes during the user session, the cartridge will restart it and retry the transaction. Because of the ability to restart program objects, program objects ideally should be stateless. All data required for a transaction should be passed to the program object within that transaction.

Program object communication through the ddpackage.fprogob() function is via VARCHAR2 or CLOB datatypes. The string types can contain one or more delimited lines of data. Because of this line-oriented data communication, most program objects will require a PL/SQL wrapper function to convert from Oracle datatypes to line-oriented data and to parse returned results. Continuing the clogp example, consider a simple function which will calculate clogp:

  create function fclogp (sosdata in varchar2) return number
   as
    v1 varchar2(4000);
    rc number;
    off1 number;
    off2 number;
   begin
    v1 := c$dcischem.ddpackage.fprogob('clogp', sosdata);
    off1 := instr(v1, ' ', 1, 1);
    off2 := instr(v1, ' ', 1, 2);
    rc := to_number(substr(v1, off1, off2 - off1));
    return rc;
   end;

Executing the fprogob() function directly using the clogptalk program object results in a full line of data being returned from the program object for each SMILES:

  SQL> select ddpackage.fprogob('clogp', 'c1ccccc1') from dual;

  DDPACKAGE.FPROGOB('CLOGP','C1CCCCC1')
  -----------------------------------------------------------------------------
  c1ccccc1 2.142 0 LogPstar: 2.13

The wrapper function fclogp() simply parses out the second field, converts it to a number, and returns it. The wrapper function hides the details from the user and allows the function to return the desired numeric value.

  SQL> select fclogp('c1ccccc1') from dual;

  FCLOGP('C1CCCCC1')
  ------------------
               2.142

More complex examples (eg. MOLFILE -> SMILES conversion) are included in the CONTRIB directory.

Optimizer Support.

The cartridge will provide estimates of the costs of extensible index queries to the Oracle optimizer. The Oracle optimizer uses these cost estimates for its decision making with respect to the execution plan for a given SQL query. The execution plan chosen by Oracle can have a dramatic effect on the time and resources required to perform a particular query. The cartridge has the ability to provide selectivity and resource estimates of varying precision to Oracle, depending upon the configuration of the database index.

The default cartridge installation includes the code for performing cost estimates, but does not enable the cartridge optimizer code. All costs for index queries default to zero. Any Oracle query will preferrentially execute the extensible index query first.

In order to activate the optimizer code for the Daylight cartridge, one must execute the following commands:

  SQL> connect c$dcischem/secret
  SQL> associate statistics with packages ddpackage using ddoptimizer;
  SQL> associate statistics with indextypes ddexact using ddoptimizer;
  SQL> associate statistics with indextypes ddgraph using ddoptimizer;
  SQL> associate statistics with indextypes ddrole using ddoptimizer;
  SQL> associate statistics with indextypes ddblob using ddoptimizer;

Once the optimizer code is enabled, one can view the cost estimates provided to Oracle and the execution plan chosen using the "explain plan" command in SQL.

The command "analyze index <indexname> compute statistics" will collect and store information about the size of the index table for the ddgraph, ddrole, and ddexact indextypes. This size information will then be used to refine the cost estimates provided to the Oracle optimizer. The ddblob indextype does not require a separate "analyze index" command; the blob index always knows its exact size.

Note that the cost and selectivity estimates are not going to be reasonable in all cases; developers may find that forcing a particular execution order will result in better query performance. Oracle SQL supports a number of hints which allow the developer to control the execution order. See the section on "Optimizer Hints" in the Oracle SQL Reference manual for more details.

It can also be useful to specify the functional form of a query in order to eliminate consideration of the index path for the execution plan. Consider the following query:

  SQL> select * from <table> where
          graph(smiles, 'c1ccccc1') = 1 and
          contains(smiles, 'C') = 1;

By default, (assuming that both required indexes exist) this query will use indexes for both the graph and contains search, and then will merge the results. In most cases, it would be much more efficient to perform the graph index search, then perform the functional version of contains() on the handful of hits returned from the graph search. The following query will force the contains() operation to occur functionally:

  SQL> select * from <table> where
          graph(smiles, 'c1ccccc1') = 1 and
          ddpackage.fcontains(smiles, 'C') = 1;

In order to disable the cartridge optimizer code, any computed statistics must be dropped with "analyze index <indexname> delete statistics". Then, The cartridge optimizer code can be disabled with the following commands:

  SQL> connect c$dcischem/secret
  SQL> disassociate statistics with indextypes ddblob using ddoptimizer;
  SQL> disassociate statistics with indextypes ddgraph using ddoptimizer;
  SQL> disassociate statistics with indextypes ddrole using ddoptimizer;
  SQL> disassociate statistics with indextypes ddexact using ddoptimizer;
  SQL> disassociate statistics with packages ddpackage using ddoptimizer;

Once the cartridge optimizer code is disabled, cost estimates revert to their default values of zero cost.

Error Codes:

This section lists the possible error codes returned from the cartridge and diagnositic actions:

Tuning Hints:

This section contains notes on Cartridge Tuning. Currently they aren't particularly well organized.