Daylight Chemistry Cartridge, Version 4.62.10d (beta)

Jack Delany

DAYLIGHT Chemical Information Systems, Inc. Mission Viejo, CA USA


PL/SQL Functions, SQL Operators:

There are quite a few user-accessible PL/SQL functions and SQL operators implemented for the cartridge. All are stateless. All of the PL/SQL functions are contained in a PL/SQL package called 'DDPACKAGE'. This package is owned by 'DDUSER' by default and has public execute privilege.

General Purpose Functions:

function ddpackage.fsetdebug ( value IN NUMBER ) => NUMBER
operator setdebug ( value IN NUMBER ) => NUMBER

Controls the level of detail of messages logged to the file: "/tmp/extproc.log". Sets the new value of the logging level to 'value'. Returns the prior value. Valid values are integers the range of 0 - 9.

  0 - No logging at all.
  1 - Error messages only (the default value).
  5 - Warnings and errors.
  9 - Notes, warnings, and errors.

This value is set on a per-session basis.

function ddpackage.ftestlicense ( product IN VARCHAR2 ) => NUMBER
operator testlicense ( product IN VARCHAR2 ) => NUMBER

Checks the cartridge license. Takes an input parameter, which is the license type to check. [ In the current release, the only possible value is 'daycart' ]. Returns 1 if the Daylight cartridge has a valid license, and 0 otherwise.

The license is contained in a special table which must be created and populated at cartridge install time:

SQL> desc dduser.license;

 Name                            Null?    Type
 ------------------------------- -------- ----------------------------
 PRODUCT                                  VARCHAR2(40)
 KEY                                      VARCHAR2(32)
 EXPIRATION                               DATE

SQL> Select * from dduser.license;

 PRODUCT       KEY                               EXPIRATION
 ------------- --------------------------------- ----------
 daycart       '8befd3d06296e9b89abfb158b431d278 01-JAN-00

The normal cartridge installation script will create this table. Upon optaining a license for the cartridge, you will populate the table with one or more rows of data, depending on the licensed products which you purchase.

Molecule / Reaction functions:

function ddpackage.fsmi2cansmi ( smiles IN VARCHAR2, type IN NUMBER ) => VARCHAR2
operator smi2cansmi ( smiles IN VARCHAR2, type IN NUMBER ) => VARCHAR2

Canonicalize an input SMILES. Returns a canonical SMILES string. Type is either 0 or 1, for unique or absolute SMILES, respectively. [This might be better as two functions, eg; smi2asmi(), smi2usmi() without the type argument]

function ddpackage.fsmi2netch ( smiles IN VARCHAR2 ) => NUMBER
operator smi2netch ( smiles IN VARCHAR2 ) => NUMBER

Return the net charge of the molecule or reaction.

function ddpackage.fsmi2hcount ( smiles IN VARCHAR2 ) => NUMBER
operator smi2hcount ( smiles IN VARCHAR2 ) => NUMBER

Return the total hydrogen count for the molecule or reaction.

function ddpackage.fsmi2mf ( smiles IN VARCHAR2 ) => VARCHAR2
operator smi2mf ( smiles IN VARCHAR2 ) => VARCHAR2

Returns the molecular formula string for the molecule or reaction.

function ddpackage.fsmi2amw ( smiles IN VARCHAR2 ) => NUMBER
operator smi2amw ( smiles IN VARCHAR2 ) => NUMBER

Return the average molecular weight for a molecule or reaction.

function ddpackage.fsmi2graph ( smiles IN VARCHAR2 ) => VARCHAR2
operator smi2graph ( smiles IN VARCHAR2 ) => VARCHAR2

Return the hydrogen- and charge-suppressed canonical graph string for a given molecule or reaction.

Fingerprint functions:

function ddpackage.fsmi2fp ( smiles IN VARCHAR2, min IN NUMBER, max IN NUMBER, nbits IN NUMBER ) => VARCHAR2
operator smi2fp ( smiles IN VARCHAR2, min IN NUMBER, max IN NUMBER, nbits IN NUMBER ) => VARCHAR2

Returns the ASCII fingerprint for a given molecule or reaction. Min and max are the minimum and maximum pathlengths, respectively, and size is the number of bits in the fingerprint.

function ddpackage.fsmi2xfp ( smiles IN VARCHAR2, min IN NUMBER, max IN NUMBER, nbits IN NUMBER ) => VARCHAR2
operator smi2xfp ( smiles IN VARCHAR2, min IN NUMBER, max IN NUMBER, nbits IN NUMBER ) => VARCHAR2

Return the ASCII difference fingerprint for a given molecule or reaction. Min and max are the minimum and maximum pathlengths, respectively, and size is the nubmer of bits in the fingerprint.

function ddpackage.ffoldfp ( fpstr IN VARCHAR2, nbits IN NUMBER, dens IN NUMBER ) => VARCHAR2
operator foldfp ( fpstr IN VARCHAR2, nbits IN NUMBER, dens IN NUMBER ) => VARCHAR2

Fold the given fingerprint to the minimum appropriate size or density, whichever is limiting, and returns the new, folded fingerprint.

function ddpackage.fbitcount ( fpstr IN VARCHAR2 ) => NUMBER
operator bitcount ( fpstr IN VARCHAR2 ) => NUMBER

Return the number of bits on in the fingerprint.

function ddpackage.fnbits ( fpstr IN VARCHAR2 ) => NUMBER
operator nbits ( fpstr IN VARCHAR2 ) => NUMBER

Return the total size of the fingerprint, in bits. In the current Daylight toolkit this will always be a power of two.

function ddpackage.fisfp ( fpstr IN VARCHAR2 ) => NUMBER
operator isfp ( fpstr IN VARCHAR2 ) => NUMBER

Return 1 if the string is a fingerprint, 0 otherwise. The syntax of a fingerprint can never be confused with a valid SMILES. This is the bit of cleverness which allows us to overload the searching functions.

function ddpackage.fatomnorm ( smiles IN VARCHAR2, list IN VARCHAR2, ntuple IN NUMBER, isotype IN NUMBER) => VARCHAR2
operator atomnorm ( smiles IN VARCHAR2, list IN VARCHAR2, ntuple IN NUMBER, isotype IN NUMBER) => VARCHAR2
function ddpackage.fbondnorm ( smiles IN VARCHAR2, list IN VARCHAR2, ntuple IN NUMBER, isotype IN NUMBER) => VARCHAR2
operator bondnorm ( smiles IN VARCHAR2, list IN VARCHAR2, ntuple IN NUMBER, isotype IN NUMBER) => VARCHAR2

Returns a potentially reordered N-tuple string for the given list VARCHAR2 input parameter. The list string is reordered based on the canonical atom or bond ordering of the input SMILES. ntuple is the number of comma-separated values per atom or bond, and isotype is 0 for unique SMILES canonicalization and 1 for absolute SMILES canonicalization.

Comparison functions:

function ddpackage.fexact ( a IN VARCHAR2, b IN VARCHAR2) => NUMBER
operator exact ( a IN VARCHAR2, b IN VARCHAR2) => NUMBER

Returns 1 if the two input strings are identical, 0 otherwise. The operator is optionally backed by the ddexact indextype.

function ddpackage.fgraph ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER
operator graph ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER

Returns 1 if the two input SMILES share the same canonical graph, 0 otherwise. The operator is optionally backed by the ddgraph indextype.

function ddpackage.ftautomer ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER
operator tautomer ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER

Returns 1 if the two input SMILES share the same canonical graph, net charge, and total hydrogen count, 0 otherwise. The operator is optionally backed by the ddgraph indextype.

function ddpackage.freactant ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER
operator reactant ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER
function ddpackage.fagent ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER
operator agent ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER
function ddpackage.fproduct ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER
operator product ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2 ) => NUMBER

Returns 1 if smiles2 (a molecule SMILES) is a component of smiles1 (a reaction SMILES) with the appropriate role, otherwise returns 0. The operators are optionally backed by the ddrole indextype.

function ddpackage.fcontains ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2) => NUMBER
operator contains ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2) => NUMBER

Returns 1 if smiles1 contains smiles2; that is, smiles2, assuming opened valences for all hydrogens, is a substructure of smiles1. The operator is optionally backed by the ddblob indextype.

function ddpackage.fmatches ( smiles IN VARCHAR2, smarts IN VARCHAR2) => NUMBER
operator matches ( smiles IN VARCHAR2, smarts IN VARCHAR2) => NUMBER

Returns 1 if the smarts expression matches the given SMILES, 0 otherwise. The operator is optionally backed by the ddblob indextype.

function ddpackage.feuclid ( fp_or_smi1 IN VARCHAR2, fp_or_smi2 IN VARCHAR2 ) => NUMBER
operator euclid ( fp_or_smi1 IN VARCHAR2, fp_or_smi2 IN VARCHAR2 ) => NUMBER

Return the euclidean distance between two fingerprints or SMILES. If the fingerprints are not the same size (nbits()), then the larger will be folded automatically to match the size of the smaller before comparison. If either parameter is a SMILES, the fingerprint is generated automatically. The returned value is a floating point number between 0.0 and 1.0. This is optionally backed by the ddblob indextype.

function ddpackage.ftanimoto ( fp_or_smi1 IN VARCHAR2, fp_or_smi2 IN VARCHAR2 ) => NUMBER
operator tanimoto ( fp_or_smi1 IN VARCHAR2, fp_or_smi2 IN VARCHAR2 ) => NUMBER

Return the tanimoto distance between two fingerprints or SMILES. If the fingerprints are not the same size (nbits()), then the larger will be folded automatically to match the size of the smaller before comparison. If either parameter is a SMILES, the fingerprint is generated automatically. The returned value is a floating point nubmer between 0.0 and 1.0. This is optionally backed by the ddblob indextype.

function ddpackage.ftversky ( fp_or_smi1 IN VARCHAR2, fp_or_smi2 IN VARCHAR2, alpha IN NUMBER, beta IN NUMBER ) => NUMBER
operator tversky ( fp_or_smi1 IN VARCHAR2, fp_or_smi2 IN VARCHAR2, alpha IN NUMBER, beta IN NUMBER ) => NUMBER

Return the tversky distance between two fingerprints or SMILES. If the fingerprints are not the same size (nbits()), then the larger will be folded automatically to match the size of the smaller before comparison. If either parameter is a SMILES, the fingerprint is generated automatically. The parameters alpha and beta must be in the range of 0.0 - 1.0. The returned value is a floating point nubmer between 0.0 and 1.0. This is optionally backed by the ddblob indextype.

function ddpackage.ffingertest ( fp_or_smi1 IN VARCHAR2, fp_or_smi2 IN VARCHAR2 ) => NUMBER
operator fingertest ( fp_or_smi1 IN VARCHAR2, fp_or_smi2 IN VARCHAR2 ) => NUMBER

Returns 1 if all of the bits in fp_or_smi2 are also present in fp_or_smi1. That is, the fingerprint from fp_or_smi2 represents a possible substructure of fp_or_smi1. If the fingerprints are not the same size (nbits()), then the larger will be folded automatically to match the size of the smaller before comparison. If either parameter is a SMILES, the fingerprint is generated automatically. The returned value is either 0 or 1. NOTE: the order of fingerprints for the comparison in this function is reversed from the order in the Daylight Toolkit. This is optionally backed by the ddblob indextype.

Extensible Indexes:

Exact lookup indextype (ddexact):

This indextype supports the exact() operator. This is the least interesting index, as it implements a simple string comparison between the indexed column and the query. In that sense, the built in BTree index within Oracle has the similar functionality. The BTree index even supports the "greater than" and "less than" operators, which this index doesn't.

The main advantage of the ddexact indextype is that it can index strings up to 4000 bytes (the maximum varchar2 length), while the internal BTree can only index strings up to approximately 40% of a block; for a database with a 2048 byte blocksize, this means that the largest indexable varchar2 string using the BTree is approximately 700 bytes. Similarly, in a future version of the cartridge this index will support the comparison of LOB datatypes; the BTree index does not.

The valid form for this operator to use the index is:

SQL> select * from table where exact(smiles, 'Oc1ccccc1') = 1;

Graph and Tautomer indextype (ddgraph):

This indextype supports the graph() and tautomer() operators. It creates a tabular index of the graph and tautomeric information (hydrogen count and net charge) for rapid retrieval. The indexed search only requires an index lookup based on the graph of the query, followed by comparison of the net charge and hydrogen count for a tautomer search.

The valid forms for the operators to use the index are:

SQL> select * from table where graph(smiles, 'Oc1ccccc1') = 1;
SQL> select * from table where tautomer(smiles, 'Oc1ccccc1') = 1;

Reaction role index (ddrole):

This indextype supports component level searches for exact molecules within reactions by role. It creates a tabular index of each component in every reaction, by role. Component level searches are implemented simply by a index lookup of the component, followed by validation of the role.

The valid forms for the operators to use the index are:

SQL> select * from rxn where reactant(smiles, 'Oc1ccccc1') = 1;
SQL> select * from rxn where agent(smiles, 'Oc1ccccc1') = 1;
SQL> select * from rxn where product(smiles, 'Oc1ccccc1') = 1;
SMARTS and Similarity search index (ddblob):

This index caches the SMILES and fingerprints in a single binary large object for rapid searching. Substructure and similarity searches are implemented through this blob-based image of the structural data. The blob is persistant; it is automatically stored in the Oracle database and retrieved for processing. The blob is kept synchronized with the base table in real time and strictly obeys the Oracle transaction model.

The index can be created on either a SMILES or fingerprint column, provided that the fingerprints are all the same size. If the index is created on a SMILES column, all six searches are implemented on that column: contains(), matches(), tanimoto(), tversky(), euclid(), and fingertest(). If the index is created on a fingerprint column, then only the four fingerprint-specific searches (tanimoto(), tversky(), euclid(), and fingertest()) are supported for that column.

The index creation command takes an optional parameter 'fpsize'. This value is the number of bits that the blob index will use for fingerprint creation.

When indexing a SMILES column, the blob will use the 'fpsize' parameter for the creation size for fingerprints within the internal blob index or a default value of 512 bits if the 'fpsize' is not specified.

When indexing a fingerprint column, all fingerprints within that column must match the 'fpsize' parameter. If the 'fpsize' parameter is not specified, the size of the first fingerprint encountered by the index code is used. Any fingerprints which do not match this size are rejected by the index.

The query for tanimoto(), tversky(), euclid(), and fingertest() can be either a fingerprint or a SMILES. If a SMILES, it will be used to create a normal fingerprint of the appropriate size for the index which will be used for comparison. If the query is a fingerprint, the size of the fingerprint must match the size of the fingerprint in the column (or in the blob for a SMILES index).

The valid forms for the operators to use the index are:

select count(1) from table where contains(smiles, 'O=c1ccocc1') = 1;
select count(1) from table where matches(smiles, 'O=c1ccocc1') = 1;

select count(1) from medium where tanimoto(smiles, 'OC(=O)CS') > 0.8;
select count(1) from medium where tversky(smiles, 'OC(=O)CS', 0.5, 0.5) > 0.8;
select count(1) from medium where euclid(smiles, 'OC(=O)CS') < 0.2;
select count(1) from medium where fingertest(smiles, 'OC(=O)CS') = 1;

select count(1) from medium where tanimoto(fp, 'OC(=O)CS') > 0.8;
select count(1) from medium where tversky(fp, 'OC(=O)CS', 0.5, 0.5) > 0.8;
select count(1) from medium where euclid(fp, 'OC(=O)CS') < 0.2;
select count(1) from medium where fingertest(fp, 'OC(=O)CS') = 1;

To further illustrate the blob index, consider the following sequence:


create table test (smiles varchar2(4000), fp varchar2(4000));

insert into test values (smi2cansmi('CCN', 0), NULL);

... load bunches of SMILES...

commit;

update test
  set fp = smi2fp(smiles, 0, 7, 128);

commit;

-- We specified a preferred size for the internally generated fingerprints
-- as 1024 bits.  These are internal to the blob and aren't visible.

create index s1 on test(smiles) indextype is ddblob parameters ('fpsize=1024');

create index f1 on test(fp) indextype is ddblob;

commit;



select count(1) from test where contains(smiles, 'NCCc1ccccc1') = 1;

-- This screens and searches using the index s1; internally using 1024 bit
-- fingerprints.  The output is generated using the functional form, at 
-- its default of 512 bits.

select smiles, tanimoto(smiles, 'NCCc1ccccc1') from test
  where tanimoto(smiles, 'NCCc1ccccc1') > 0.8;



-- This screens and searches using the index s1; internally using 1024 bit
-- fingerprints.  The output is generated using the functional form; this
-- time calculated at 1024 bits.

select smiles, tanimoto(smiles, smi2fp('NCCc1ccccc1', 0, 7, 1024)) from test
  where tanimoto(smiles, 'NCCc1ccccc1') > 0.8;

-- This searches using the index f1, using the 128 bit fingerprints from
-- the actual column for comparison.

select smiles, tanimoto(fp, 'NCCc1ccccc1') from test
  where tanimoto(fp, 'NCCc1ccccc1') > 0.8;

-- This does two similarity searches at once, using the two different
-- indexes.

select * from test where contains(smiles, 'Oc1ccccc1') = 1 and
  tanimoto(fp, 'Oc1ccccc1F') > 0.5;