Virtual Chemical Stores

The idea of a Virtual Chemical Stores, VCS, was introduced to describe a chemical data warehouse in which data, from a whole collection of chemical structure containing sources, were merged into a common format for ease of searching. The thor/merlin version has been available for some time, but the release of Daylight 4.82 sees the introduction of tools to run within DayCart to help users build such a data warehouse within the Oracle environment.

Whilst the original concept was aimed at storing data from external compound suppliers, it equally can be applied to any database containing structures and data about those structures. Such an example database would be the NCI Aids database. An example of how to build a version of this database within DayCart is described below.

In DayCart 4.82 two new supported sql functions for SMILES manipulation are supplied:-

vcs_normalize ( smiles, iso, class ), which works in conjunction with an in-built table, transform, to convert the input SMILES into a SMILES of a normalized valence bond representation using business rules represented as SMIRKS in the table transform. Such a SMIRKS would be

[O-:1][*+;#7;v4:2]>>[O+0:1]=[*+0;#7;v5:2]

representing the normalization of nitro groups and N-oxides either to the charged 4-valent nitrogen form or the 5-valent neutral form. The business rules can be grouped by class so they can be applied selectively. All of the transforms in the class are applied sequentially. Users should attempt to make the SMIRKS independent of other SMIRKS in the class, otherwise the result will be order dependent. This is the only vcs_*() function which is critical to the implimentation of this model.

vcs_desalt (smiles, iso, class ), which works in conjunction with an in-built table, salt, to convert the input multicomponent SMILES into a SMILES in which the components defined as salts in the salt table have been removed. Again these removable components, salts, solvents etc can be classified to allow users to be selective about removal, according to business rules.

When users upgrade DayCart to version 4.82, empty copies of these tables are generated. We provide an example sql file, normal_vcs.sql, to load business rules into the transform table and salts into the salt table. The key contents of this file for the transform ( business rules) section are

insert into transform values ('Deprotonate',
                '[*+;$([!#6;!$([+]~[-])]):1][H]>>[*+0:1]',
                'FORWARD',
                0,
                'Deprotonation of cations to neutral species where possible');

insert into transform values ('Protonate',
                '[*-;$([!#6;!$([-]~[+]);!$([v4;B,Fe]);!$([v5;S,Se,Te,Sn]);!$([v6;P,As,Sb])]):1]>>[*+0:1][H]',
                'FORWARD',
                0,
                'Protonation of anions to neutral species where valence allows' );

insert into transform values ('Nitros',
                '[O-:1][*+;#7;v4:2]>>[O+0:1]=[*+0;#7;v5:2]',
                'REVERSE',
                0,
                'Nitro groups and N-oxides including aromatics' ||
                'Business rules decide whether to allow 5-valent' ||
                'uncharged nitrogen');

insert into transform values ('P-oxides',
                '[O-:1][P+;v4:2]>>[O+0:1]=[P+0;v5:2]',
                'FORWARD',
                0,
                'P-oxides excluding aromatics' ||
                'Business rules decide whether to prefer 5-valent' ||
                'uncharged phosphorous. It would be unlikely' ||
                'this transform would be run in reverse.');

insert into transform values ('Azides',
                '[N-;X1:1]=[N+;v4:2]>>[N+0:1]#[N+0;v5:2]',
                'REVERSE',
                0,
                'Diazo alkanes and ketones, azides' ||
                'Business rules decide whether to allow 5-valent' ||
                'uncharged nitrogen');


insert into transform values ( 'S-Oxides',
                '[O-:1][*+;X3;$([#16]([#6])[#6]):2]>>[O+0:1]=[*+0;X3;$([#16]([#6])[#6]):2]',
                'FORWARD',
                0,
                'S-oxides including aromatics, but excluding oxyacids of sulphur' ||
                'Business rules decide whether to prefer 4-valent' ||
                'uncharged sulphur');


insert into transform values ('Ionic',
                '[*;$([Li,Na,K,Rb]):1][*;!#1:2]>>[*+:1].[*:2][H]',
                'FORWARD',
                0,
                'Ensure that there are no covalent bonds to Group I metals'||
                'MH is left alone');

commit;

The salts are loaded from the same file, only common examples are given. Users almost certainly will wish to extend this table. Note the use of the normalizing functions to ensure that
vcs_normalize(vcs_desalt(smiles,iso, class), iso , class )

returns the same SMILES as
vcs_desalt(vcs_normalize( smiles, iso, class), iso, class).

The SMILES returned from sequential application of vcs_desalt() and vcs_normalize() has been referred to as the ( isomeric ) parent in earlier documents. Note this parent structure may still be multicomponent, a mixture of xylenes, for instance, Cc1ccc(C)cc1.Cc1cccc(C)c1.Cc1ccccc1C. Whilst the parent is algorithmically defined as the non-salt component(s) of a SMILES, its value will be instance dependent. Care needs to be exercised in searching for components in mixtures, see below.
-- Add normalized version too. vcs_normalize ( smiles, iso, class ) 
-- Use standard class 0
--
-- These will not normalize
--
insert into salt values ('Lithium',   '[Li+]', 0, NULL);
insert into salt values ('Sodium',    '[Na+]', 0, NULL);
insert into salt values ('Potassium', '[K+]',  0, NULL);
insert into salt values ('Rubidium',  '[Rb+]', 0, NULL);
insert into salt values ('Cesium',    '[Cs+]', 0, NULL);
--
-- These will normalize
--
insert into salt values ('Ammonium',   '[NH4+]',
                                0, 'Cationic');
insert into salt values ('Ammonium', vcs_normalize( '[NH4+]',1 ,0 ),
                                0, 'Normalized neutral');
--
insert into salt values ('Fluoride',    '[F-]',
                                0, 'Anionic');
insert into salt values ('Fluoride', vcs_normalize( '[F-]',1 ,0 ),
                                0, 'Normalized neutral' );
insert into salt values ('Chloride',    '[Cl-]',
                                0, 'Anionic');
insert into salt values ('Chloride', vcs_normalize( '[Cl-]',1 ,0 ),
                                0, 'Normalized neutral' );
insert into salt values ('Bromide',     '[Br-]',
                                0, 'Anionic');
insert into salt values ('Bromide', vcs_normalize('[Br-]',1 ,0 ),
                                0, 'Normalized neutral' );
insert into salt values ('Iodide',      '[I-]',
                                0, 'Anionic');
insert into salt values ('Iodide', vcs_normalize('[I-]',1 ,0 ),
                                0, 'Normalized neutral');
--
insert into salt values ('Perchlorate', 'O=Cl(=O)(=O)[O-]',
                                0, 'Anionic');
insert into salt values ('Perchlorate', vcs_normalize( 'O=Cl(=O)(=O)[O-]',1 ,0 ),
                                0, 'Normalized neutral' );
--
-- Add additional states for poly oxyanions to allow the functions 
-- vcs_desalt() and vcs_normalize() to be commutative
-- These extra rows are unnecessary if
-- vcs_desalt ( vcs_normalize(smiles, 1, 0), 1, 0)
-- is used.
--

insert into salt values ('Sulphate', 'O=S(=O)([O-])[O-]',
                                0, 'Anionic');
insert into salt values ('Sulphate', vcs_normalize( 'O=S(=O)([O-])[O-]',1 ,0 ),
                                0, 'Normalized neutral' );
insert into salt values ('Sulphate', 'O=S(=O)(O)[O-]',
                                0, 'Anionic' );
--
insert into salt values ('Phosphate', 'O=P([O-])([O-])[O-]',
                                0, 'Anionic');
insert into salt values ('Phosphate', vcs_normalize( 'O=P([O-])([O-])[O-]',1 ,0 ),
                                0, 'Normalized neutral' );
insert into salt values ('Phosphate', 'O=P(O)([O-])[O-]',
                                0, 'Anionic' );
insert into salt values ('Phosphate', 'O=P(O)(O)[O-]',
                                0, 'Anionic' );
--
commit;

These functions can clearly be used independently of any data warehouse. An obvious place for them to be used is passing normalized parent structures to the physical property generator dayprop, which is also part of the 4.82 release.

Building a data warehouse

In addition to the script normal_vcs.sql described above, users can download an archive of scripts which can be used to build a vcs model within Oracle. Whilst this is a fully working model, users are encouraged to modify this model or create their own to more accurately reflect the way their data are searched. The archive is supplied in a directory ./vcs_scripts

$ > cd vcs_scripts
$ > ls -X
INSTALLER*              du_mol2smi.c        du_mol2smi.h
makefile                mol2smi.c           clean_vcs.sql
tdt2vcs_sql*            vcs_coords.ctl      create_vcs_indexes.sql
vcs_builder*            vcs_data.ctl        create_vcs.sql
vcs_env                 vcs_extra_data.ctl  drop_vcs_indexes.sql
vcs_extra_data_loader*  vcs_main.ctl        normal_vcs.sql
vcs_loader*             vcs_supplier.ctl

This archive contains

Once the schema have been created and the salt and transform tables populated, only the vcs_builder and vcs_loader programs need to be run to convert and load sd files. Both these scripts write and read from the local directory so it is essential each source sd file is in a separate directory.

VCS - Quick Start

For more information about each of these steps, please read on. Remember this is a regular Oracle installation, all the usual criterea about dropping indexes to load large data sets apply. A sql script drop_vcs_indexes.sql is provided to do this. Users are reminded to keep the create_vcs_indexes and drop_vcs_indexes scripts synchronized. The data model is built so that the table constraints can be checked for violations by loading the files into a holding area. With the exception of the VCS_SUPP value all uniqueness constraints are within a data set. This allows the constraints on the production tables to be turned off during a load, to further speed the process.

VCS - The Details

This is a more detailed explanation than the Quick Start above for those who wish more background or to tailor the defaults to their own environments ( recommended )

  1. Ensure that you have at least a 4.82 version of DayCart running.
    Within a SQL*Plus session do the following:
    SQL> select  ddpackage.fversion('daycart') from dual;
    Should return something like
    DDPACKAGE.FVERSION('DAYCART')
    --------------------------------------------------------------------------------
    4.82
    
    
    If the version returned is not 4.82 or greater, install the appropriate upgrade. If you have the correct version is also worth checking that the extra commands have been installed as there is a shortcut to the upgrade procedure which does not install these commands
    SQL> select vcs_normalize('c1ccccc1N(=O)=O', 0, 0 ) from dual;
    If it returns
    VCS_NORMALIZE('C1CCCCC1N(=O)=O',0,0)
    --------------------------------------------------------------------------------
    
    
    then the commands are installed, but not the default business rules, this can be rectified later.
    If it returns
    VCS_NORMALIZE('C1CCCCC1N(=O)=O',0,0)
    --------------------------------------------------------------------------------
    [O-][N+](=O)c1ccccc1
    
    
    then both the commands and the default set of business rules are in place.
    If it returns an unhelpful Oracle error message like
    select vcs_normalize('c1ccccc1N(=O)=O', 0, 0 ) from dual
           *
    ERROR at line 1:
    ORA-00904: invalid column name
    
    then the shortcut installation has been performed and will need to be rectified.
     
  2. Having checked that you are running the correct version of DayCart, download the vcs scripts from the Daylight web site. In an appropriate directory unzip and untar the download file.
    $ > gzip -c -d ./vcs_scripts.tgz | tar xvf -
    
    The contents of this directory have been described above. These scripts have been designed to be run by a user with no special privileges, therefore it is necessary to edit the vcs_env file to reflect the local environment for both a Daylight and an Oracle user. There are three directories.
    $VCS_BIN for the executables,
    $VCS_CTL for the SQL*Loader control files
    $VCS_MAN for the SQL*Plus scripts
    which control the environment of the warehouse.  
    The VCS_BIN, VCS_CTL and VCS_MAN variables can be set to any directory or set of directories, including $PWD i.e. don't move them.
    $ > cd vcs_scripts
    $ > vi vcs_env 
    $ > ./INSTALLER
    
  3. If the business rules were not in place in item 1 above then you will need to install them. If you wish to add your own rules then edit this file first. It is recommended that you read the comments in this file as to the use of the class variable. Almost certainly you will wish to add to the salts table. Again note the use of the vcs_normalize() function. To install the rules, in a SQL*Plus session
    SQL> @$VCS_MAN/normal_vcs.sql
  4. Now install the vcs schema. Whilst in a SQL*Plus session
     SQL> @$VCS_MAN/create_vcs.sql
    The contents of the file can be seen here, but are worth going through in detail as this is the file which creates the data model and is most likely the one to be customized. There are four tables
    VCS_MAIN
    VCS_DATA
    VCS_COORDS
    VCS_SUPPLIER
    -- VCS_MAIN contains
    --
    --      VSMILES which is the isomeric SMILES of the sample without the 
    --      application of business rules, usually a direct conversion from
    --      an SD file with mol2smi().
    --
    --      NSMILES which is the normalized SMILES according to the business
    --      rules held in the table TRANSFORM. This may be populated by 
    --      updating the column by applying the function vcs_normalize() to the 
    --      VSMILES column. If the user chooses to update by applying the 
    --      vcs_normalize() function to the vcs_desalt(VSMILES, , , ) this 
    --      column will contain the parent SMILES and will be appropriate as 
    --      structure source to populate a table of physical properties using
    --      the dayprop() function.
    --      Structure searching would normally be carried out on this column.
    --
    --      VCS_ID which is the code used by the supplier to identify that 
    --      particular compound
    -- 
    --      VCS_SUPP which is a unique integer used to identify the supplier
    --      or source of the compound this is maintained in the VCS_SUPPLIER 
    --      table.
    --
    --      VCS_SUPP and VCS_ID are the primary key on this table and are 
    --      required to form a unique identifier for a sample.
    --
    --      If a sample is deleted from this table the associated data 
    --      are deleted from the VCS_DATA and VCS_COORDS tables.
    --
    
    It is created with the following characteristics
    create table VCS_MAIN (
    VSMILES                 VARCHAR2(4000),
    NSMILES                 VARCHAR2(4000),
    VCS_ID                  VARCHAR2(100),
    VCS_SUPP                NUMBER (6),
    constraint VCS_MAIN_PK PRIMARY KEY ( VCS_SUPP, VCS_ID ),
    foreign key (VCS_SUPP)
            REFERENCES VCS_SUPPLIER ( VCS_SUPP )
            ON DELETE CASCADE
    );
    
    -- VCS_DATA contains
    --
    --      VCS_ID and VCS_SUPP defined as for VCS_MAIN and form the foreign key 
    --      link to the VCS_MAIN table
    --
    --      VCS_DATA_NAME which is the name chosen by the supplier for
    --      the data type.
    --
    --      VCS_DATA_VALUE which is the value as VARCHAR2 for that datatype 
    --      for that sample identified by the VCS_ID and VCS_SUPP.
    --
    --      There are no further constraints on this table. Note when the 
    --      sample is deleted from the VCS_MAIN table the associated 
    --      data in this table are also deleted.
    --
    
    It is created with the following characteristics.
    create table VCS_DATA (
    VCS_ID                  VARCHAR2(100),
    VCS_SUPP                NUMBER (6),
    VCS_DATA_NAME           VARCHAR2(100),
    VCS_DATA_VALUE          VARCHAR2(4000),
    foreign key ( VCS_SUPP, VCS_ID )
            REFERENCES VCS_MAIN ( VCS_SUPP, VCS_ID )
            ON DELETE CASCADE
    );
    
    The VCS_COORDS table is virtually identical except the data are stored as CLOBs. This table is used to store the coordinates for depictions and conformations. Equally well it can store the connection table from a modelling program or the original file for a publication quality graphic. Remember if you choose to store the connection table as an inline CLOB you will need to alter the control script, vcs_coords.ctl, as the CLOB now contains end-of line characters. This example uses inline CLOBs, to illustrate the value of using Oracle security controls. Many companies will have structures in their collections which are privileged in some way and not for general viewing. It is much easier to manage row level security within the vcs schema than to control access to external files on a potentially remote system.
    -- VCS_COORDS contains
    --
    --      VCS_ID and VCS_SUPP defined as for VCS_MAIN and form the foreign key 
    --      link to the VCS_MAIN table
    --
    --      VCS_DATA_NAME which is either 2D or 3D for a depiction or a           
    --      conformation.
    --
    --      VCS_DATA_VALUE which is the value as CLOB for that datatype 
    --      for that sample identified by the VCS_ID and VCS_SUPP.
    --
    --      In general it is not forseen that these data will be searched only
    --      retrieved. You will need to look at the DBMS_LOB package to 
    --      manipulate or select these data outside at the SQL level
    --
    --      There are no further constraints on this table. Note when the 
    --      sample is deleted from the VCS_MAIN table the associated 
    --      data in this table are also deleted.
    --
    
    It is created with the following characteristics
    create table VCS_COORDS (
    VCS_ID                  VARCHAR2(100),
    VCS_SUPP                NUMBER (6),
    VCS_DATA_NAME           VARCHAR2(100),
    VCS_DATA_VALUE          CLOB,
    foreign key ( VCS_SUPP, VCS_ID )
            REFERENCES VCS_MAIN ( VCS_SUPP, VCS_ID )
            ON DELETE CASCADE
    );
    
    The supplier table VCS_SUPPLIER, is the only one which needs to be actively managed. In the model here the VCS_SUPP value is supplied externally. It could be autogenerated from a sequence but then a mechanism would need to be put in place to pass the value to the other tables. In addition there is a class column, VCS_SUPPLIER_CLASS, which can be used to exclude or include certain groups of sources in a search by including the string in this column as part of the query. For example what was 'CURRENT' or 'PREFERRED' could change with time, the SQL query, perhaps embedded in an application, would not. Multiple class membership can be added with space separated lists. Deletion of a row in this table deletes all data in all the other tables associated with this source. Users are warned that this process can be extremely slow. It would be much quicker to update the class to 'ARCHIVE' say.
    -- VCS_SUPPLIER contains
    --
    --      VCS_SUPP which is the primary key on this table defined as for
    --      the VCS_MAIN table. Normally this would be a sequential integer
    --      supplied by the user.
    --      Removal of a VCS_SUPP value from this table removes all the 
    --      associated data in the VCS_MAIN, VCS_DATA and VCS_COORDS tables. 
    --
    --      VCS_SUPPLIER_NAME which is the name of the compound supplier
    --      or the compound source, chosen by the user. 
    --      Users are encouraged to develop local systematic naming rules
    --      to allow related sources to be linked.
    --
    --      VCS_SUPPLIER_CLASS which is a string which allows types of 
    --      suppliers to be grouped. It could be used to 'hide' earlier 
    --      copies of databases without removing them or identify preferred 
    --      suppliers. Default value is 'UNKNOWN' when data is loaded
    --      using vcs_loader.
    --
    --      VCS_SUPPLIER_TIMESTAMP is the system time when the supplier
    --      record was created
    --
    
    It is created with the following characteristics.
    create table VCS_SUPPLIER (
    VCS_SUPP                NUMBER (6),
    VCS_SUPPLIER_NAME       VARCHAR2(100),
    VCS_SUPPLIER_CLASS      VARCHAR2(25),
    VCS_SUPPLIER_TIMESTAMP  DATE,
    constraint VCS_SUPPLIER_PK PRIMARY KEY ( VCS_SUPP )
    );
    

These steps need only be done once. With the environment in place you are ready to try the worked example. The data conversion and loading sessions need to be carried out for each data source.

VCS - Worked Example

A worked example of setting up a database within the VCS model follows. Users should download the NCI database and the associated AIDS activity file from here. In this example we create a directory ./nci_aids in which to work. It is assumed that the salt and transform table have been populated and that the vcs schema has been installed.

$ > mkdir ./nci_aids
$ > cd ./nci_aids 

From this website we see that

The DTP AIDS Antiviral Screen has checked tens of thousands of compounds for evidence of anti-HIV activity. Screening results and chemical structural data on compounds that are not covered by a confidentiality agreement are available.
Screening Results (March 2002 Release)
The results of the screening tests are evaluated and placed in one of three categories:

The AIDS Screening Results - March 2002 release is an ASCII comma delimited file (424Kb) containing the screening results for 43,905 compounds. The followin g data fields are provided for each compound: 1.NSC number - the NCI's internal ID number 2.Screening Result - one of the categories listed above.

This file should be downloaded into ./nci_aids/aids_conc_mar02.txt and will be used to load extra data into the warehouse. Note there are no structural data in this file.

Further we find

The 2D structure (connection table) for each of the 42,390 compounds was retrieved from the DTP's Drug Information System. Conversion to a 3D structure was accomplished using the program Corina, created by Prof. Gasteiger's group.
Note: No stereochemical information was used in the build, nor was any geometry optimization performed. Please take these facts into account when deciding if these 3D coordinates are suitable for your particular use.
The data are presented in a compressed ASCII file in MDL's SDFile format. There are 2 data fields included in the SD file:

Download this file into ./nci_aids/aidso99.sdf.gz. ( Note change in prompt from '$' to emphasize the separate directory for each compound collection )

nci_aids > ls -X
aidso99.sdf.gz  aids_conc_mar02.txt

To convert to SMILES and to create the input files for the SQL*Loader program

nci_aids >$VCS_BIN/vcs_builder NSC 0 aidso99.sdf.gz

Note the arguments to the program. NSC is the name used in the SD file for the NCI's internal ID number. It becomes the VCS_ID within the warehouse, which itself is required to be unique within a compound source. '0' is a number, VCS_SUPP, arbitrarily assigned by the warehouse manager as a unique identifier of the compound source. Normally the values of VCS_SUPP would be sequential integers, but it is possible to assign meaning to them, so that for instance, all numbers over 1000 were reliable compound suppliers. In this demo version, no constraints or special meanings are placed on the assigned values. See also VCS_SUPPLIER_CLASS described above. Users may wish also copy the vcs_env file into this directory. This will be read by each of the scripts to check for an alternative systematic name for the compound collection ( $VCS_SUPPLIER) , rather than being derived from the filename.

Care needs to be taken with the choice of the VCS_ID. In this case it is straight forward, the NSC number acts like a registry number, and there is a one to one relationship between structure and number. If one takes a collection of collections such as the Available Chemicals Directory a decision needs to be taken as to whether the MFCD number i.e.the sample classifier used by MDL is the appropriate level and each sample is treated as data, or the alternative scenario where the supplier's ID is used as the VCS_ID and the MFCD numbers are data about the sample identifier.

Running the above builder program creates several *.dat files for SQL*Loader. Inspection of the aidso99_mol2smi.err file shows that the file converted cleanly with no errors.

nci_aids > ls -X
aidso99_coords.dat  aidso99_main.dat      aidso99_mol2smi.err  aids_conc_mar02.txt
aidso99_data.dat    aidso99_supplier.dat  aidso99.sdf.gz
nci_aids > tail -5 aidso99_mol2smi.err
MOL2SMI results:  MOLs in:    42687
MOL2SMI results:  TDTs out:   42687
MOL2SMI results:  SMILES out: 42687
MOL2SMI results:  Errors:     0
Adios

These data can now be loader into the warehouse.

nci_aids > $VCS_BIN/vcs_loader aidso99

A reassurring stream of messages comes from SQL*Loader to indicate that commit points are being reached and the data are being loaded. Various files are produced by SQL*Loader which are systematically named.

nci_aids > ls -X
aidso99_coords.dat  aidso99_supplier.dat  aidso99_coords.log  aidso99_supplier.log
aidso99_data.dat    aidso99_mol2smi.err   aidso99_data.log    aids_conc_mar02.txt
aidso99_main.dat    aidso99.sdf.gz        aidso99_main.log

If any bad records are found they are written to *.bad. In this case we do not have such files. Inspection of the *.log file shows that all the structures have been loaded.

nci_aids > tail -20 aidso99_main.log
Table VCS_MAIN:
  42687 Rows successfully loaded.
  0 Rows not loaded due to data errors.
  0 Rows not loaded because all WHEN clauses were failed.
  0 Rows not loaded because all fields were null.


Space allocated for bind array:                  62088 bytes(13 rows)
Space allocated for memory besides bind array:        0 bytes

Total logical records skipped:          0
Total logical records read:         42687
Total logical records rejected:         0
Total logical records discarded:        0

Run began on Mon Aug 11 12:13:01 2003
Run ended on Mon Aug 11 12:19:35 2003

Elapsed time was:     00:06:33.65
CPU time was:         00:00:01.93

Your local Oracle guru will be able to show you how to decrease these load times significantly by adjusting the BIND and READ array sizes for your hardware and Oracle instance.

In a typical case not all the data will be contained in the *.sdf file. In this example we have also a set of data in a *.csv file containing the biological activity. This file can be loaded in a similar way. Note the example control file provided needs to be edited for other data collections.

nci_aids > $VCS_BIN/vcs_extra_data_loader aids_conc_mar02.txt
nci_aids > ls -X
aids_conc_mar02_extra_data.bad  aidso99_mol2smi.err             aidso99_main.log
aidso99_coords.dat              aidso99.sdf.gz                  aidso99_supplier.log
aidso99_data.dat                aids_conc_mar02_extra_data.log  aids_conc_mar02.txt
aidso99_main.dat                aidso99_coords.log
aidso99_supplier.dat            aidso99_data.log

This time we do have some 'bad' data, in ./nci_aids/aids_conc_mar02_extra_data.bad. The number of bad rows is recorded in the corresponding log file.

nci_aids > tail -20 aids_conc_mar02_extra_data.log
Table VCS_DATA:
  42687 Rows successfully loaded.
  1218 Rows not loaded due to data errors.
  0 Rows not loaded because all WHEN clauses were failed.
  0 Rows not loaded because all fields were null.


Space allocated for bind array:                  64155 bytes(15 rows)
Space allocated for memory besides bind array:        0 bytes

Total logical records skipped:          1
Total logical records read:         43905
Total logical records rejected:      1218
Total logical records discarded:        0

Run began on Mon Aug 11 13:07:17 2003
Run ended on Mon Aug 11 13:07:49 2003

Elapsed time was:     00:00:32.00
CPU time was:         00:00:01.54

The 1218 bad records arise because of NCI numbers which do not exist in the main file. Note that it it not essential that a structure is known in this model. However if data exists it must be about a known identifier, ( This is true both in the thor and Oracle models) so the identifier must have an entry in the main table. The log file records the fact that the load failed because of a violation of an integrity constraint on the table.

nci_aids > head  -35 aids_conc_mar02_extra_data.log

SQL*Loader: Release 8.1.7.0.0 - Production on Mon Aug 11 13:07:17 2003

(c) Copyright 2000 Oracle Corporation.  All rights reserved.

Control File:   /usr/local/daylight/v482/contrib/src/oracle/vcs/vcs_extra_data.ctl
Data File:      aids_conc_mar02.txt
  Bad File:     ./aids_conc_mar02_extra_data.bad
  Discard File:  none specified

 (Allow all discards)

Number to load: ALL
Number to skip: 1
Errors allowed: 999999
Bind array:     64 rows, maximum of 65536 bytes
Continuation:    none specified
Path used:      Conventional

Table VCS_DATA, loaded from every logical record.
Insert option in effect for this table: APPEND

   Column Name                  Position   Len  Term Encl Datatype
------------------------------ ---------- ----- ---- ---- ---------------------
VCS_ID                              FIRST     *   ,  O(") CHARACTER
VCS_SUPP                                                  CONSTANT
    Value is '0'
VCS_DATA_NAME                                             CONSTANT
    Value is 'NCI_AIDS_CLASS'
VCS_DATA_VALUE                       NEXT  4000   ,  O(") CHARACTER

Record 212: Rejected - Error on table VCS_DATA.
ORA-02291: integrity constraint (C$DCISCHEM.SYS_C001672) violated - parent key not found

Record 380: Rejected - Error on table VCS_DATA.

Almost certainly you will now wish to add the indexes, as described earlier.

$> sqlplus -s $USER_PASS @$VCS_MAN/create_vcs_indexes.sql

Now you can search the data. It is recommended that the chemistry searches are carried out on the vcs_main.nsmiles column.

so

SQL> select vsmiles from vcs_main where matches ( nsmiles, 'nnn' ) = 1;
.
.
.
397 rows selected.

Elapsed: 00:00:02.11

You can use the business rules on your query. So

SQL> select vcs_id, vsmiles from vcs_main 
     where 
         contains ( nsmiles,
                    'O=N(=O)c1ccccc1' 
                   ) = 1;

no rows selected

Elapsed: 00:00:00.30

whereas

SQL> select vcs_id, vsmiles from vcs_main 
     where 
         contains ( nsmiles, 
                    vcs_normalize( 'O=N(=O)c1ccccc1',0, 0 ) 
                   ) =1;
.
.
.
2447 rows selected.

Elapsed: 00:00:02.56

And the usual joins of tables work correctly.

SQL> select vcs_main.vsmiles, vcs_data_value from vcs_data, vcs_main
          where
              vcs_data.vcs_id in (
                                  select vcs_id from vcs_main where matches ( nsmiles, 'nnn' ) = 1
                                  )
              and vcs_data_name = 'NCI_AIDS_CLASS'
              and vcs_data.vcs_id = vcs_main.vcs_id;
.
.
.
397 rows selected.

Elapsed: 00:00:13.55 

VCS - Beyond the basics

As more data and compound sets are added, as with all Oracle installations, care needs to be taken in managing the data and constructing queries. If you are frequently loading new data and dropping the indexes, you may wish to create a fingerprint column to be used in creating the indexes on the VCS_MAIN table. That way you only need to fingerprint the molecules which are being added. Be sure to read, or get your DBA to read, the index creation section of the manual, to ensure the indexes you create are appropriate. The scripts presented here form the basis for an entry level model which can be modified to fit local requirements.

For instance, you can search a column of normalised salts for the parent structure using the component search. The database used here had ~ 1M samples from a mixture of suppliers and a commercial database World Drug Index from Derwent.

SQL> select vsmiles from vcs_main where component ( nsmiles, 'NCCN' ) = 1;

VSMILES
--------------------------------------------------------------------------------
NCC[NH3+].Nc1ccc(cc1)S(=O)(=O)[N-]c2ncccn2
Cn1c(=O)n(C)c2nc[nH]c2c1=O.NCCN
Cl.Cn1c(=O)n(C)c2nc[nH]c2c1=O.NCCN
NCCN.OC(=O)CCC(=O)O

Elapsed: 00:00:00.15

Compare the SMARTS match

SQL> select vsmiles from vcs_main where matches ( nsmiles, '[NH2][CH2][CH2][NH2]
' ) = 1;

VSMILES
--------------------------------------------------------------------------------
NCCN.OC(=O)CCC(=O)O
NCC[NH3+].Nc1ccc(cc1)S(=O)(=O)[N-]c2ncccn2
Cn1c(=O)n(C)c2nc[nH]c2c1=O.NCCN
Cl.Cn1c(=O)n(C)c2nc[nH]c2c1=O.NCCN

Elapsed: 00:00:25.10

Similarly

SQL> select vcs_main.vsmiles, vcs_data_value from vcs_data, vcs_main
where
     vcs_data.vcs_id in
        ( select vcs_id from vcs_main where component(nsmiles, 'NCCN' ) = 1)
and
     vcs_data.vcs_supp in
        (select vcs_supp from vcs_main where component(nsmiles, 'NCCN' ) = 1)
and 
     vcs_data_name like 'M%W%T'
and 
     vcs_data.vcs_id = vcs_main.vcs_id
and  
     vcs_data.vcs_supp = vcs_main.vcs_supp;

VSMILES                                       VCS_DATA_VALUE
--------------------------------------------------------------------------------
NCC[NH3+].Nc1ccc(cc1)S(=O)(=O)[N-]c2ncccn2    310.380

Cn1c(=O)n(C)c2nc[nH]c2c1=O.NCCN               240.265

Cl.Cn1c(=O)n(C)c2nc[nH]c2c1=O.NCCN            276.726

NCCN.OC(=O)CCC(=O)O                           178.186


Elapsed: 00:00:01.05

The equivalent of the thesaurus retrieval of names in thor can be carried out with

SQL> select vcs_main.vsmiles, vcs_data_value from vcs_data, vcs_main
where
    vcs_data.vcs_id in
        ( select vcs_id from vcs_main where component(nsmiles, 'NCCN' ) = 1)
and
    vcs_data.vcs_supp in
        (select vcs_supp from vcs_main where component(nsmiles, 'NCCN' ) = 1)
and 
    vcs_data_name = 'NAME'
and 
    vcs_data.vcs_id = vcs_main.vcs_id
and 
    vcs_data.vcs_supp = vcs_main.vcs_supp;
.
.
.
.
180 rows selected.

Elapsed: 00:00:00.30

With the more complicated searches, ensure that the cartridge has access to the optimizer support.

As we have the component() function in DayCart, it is not necessary to build the full parent-version-sample hierarchy in the Oracle data model which was required in the thor model.

Care does need to be taken with mixtures however.
Consider the following, where the vcs schema contain data from the Available Chemicals Database.

SQL> select count(1) from vcs_main where exact ( nsmiles, 'Cc1ccccc1C' ) = 1;

  COUNT(1)
----------
       182

Compare

SQL> select count(1) from vcs_main where component ( nsmiles, 'Cc1ccccc1C' ) = 1;

  COUNT(1)
----------
       644

In other words o-xylene is also available in mixtures. We can find these with the query

SQL> select count(1) from vcs_main 
        where component ( nsmiles, 'Cc1ccccc1C' ) = 1
        and   component ( nsmiles, 'Cc1ccc(C)cc1' ) = 1;

  COUNT(1)
----------
       462

Note the component() function does not allow the query smiles to be multicomponent.
So

SQL> select count(1) from vcs_main where component ( nsmiles, 'Cc1ccccc1C.Cc1ccc(C)cc1' ) = 1;

does not return the mixtures of o- and p- xylene. The correct syntax to do this is shown above with each component of the mixture explicitly stated.