John Bradshaw
Daylight CIS Inc., Sheraton House, Castle Park, Cambridge, CB3 0AX, UK
Following a talk at MUG '99 meeting at which the concept of a Virtual Chemical Stores/Stockroom was introduced, Daylight agreed to explore the possibility of producing a commercial product that would allow the information, present in electronic compound catalogues, to be made available in consistent format to end-users. The data model drew strongly on the Thor datatree structure and the compound hierarchy used in the Modgraph Compound Registry system. The original concept was to build a large single Thor database, however the data model allows alternatives such as Virtual Databases, in which the constituent databases are not merged, or using the Oracle cartridge retaining the compound hierarchy by judicious use of tables. A further option is to merge, the results of the query across multiple consistent databases, with a client.
The major sources of data are still the compound vendors, in particular those providing compounds for screening. Data are generally provided in the form of MDL sd files. However as the concept is not limited to commercially available compounds, other sources, such as the World Drug Index, or the NCI database would benefit from this approach.
A decision has been made to provide a product which will allow customers to build their own database by providing tools which will either produce Thor datatrees for loading into a Thor-Merlin system or SQL*Plus scripts to produce an Oracle database. There is also an option to produce a flat source which allows users to incorporate into an existing system or their own proprietary system. The earlier model has been improved so that now there is a predefined set of datatypes (Thor) or columns and tables (Oracle) into which any new data source can be added. We can therefore provide the datatypes database (Thor) or the schema (Oracle). In addition there is no longer a need for an external Thor database when building the input. It is hoped this product will be released in version 4.8.
The aim of most chemical suppliers' information handling is to produce a printed catalogue. So the fundamental data unit in this model is the catalogue entry from a particular supplier, which will be identified by some arbitrary string such as 0800, "BAS 00123" etc. Associated with this identifier will be items such as a chemical structure, a chemical name, and possibly other data. The suppliers will also use this identifier for their inventory and supply systems.
It is these records which forms the basis of the MDLI sd file which suppliers distribute to potential customers usually on CD.
For example, if we take a sample page of the Tocris catalogue, we find
0774 identifying a sample with a name 4-[3-(Benzotriazol-1-yl)propyl]-1-(2-methoxyphenyl)-piperazine maleate with a molecular weight of 467.52. We also learn that this compound is "A potent pre- and postsynaptic 5-HT 1A receptor antagonist." and a reference to this property, Mokrosz et al (1994) Structure-activity relationship studies of central nervous system agents. 13. 4-[3-(Benzotriazol-1-yl)propyl]-1-(2- methoxy)piperazine, a new putative 5-HT 1A receptor antagonist, and its analogues. J.Med.Chem. 37 ,2754. In addition there is a chemical structure corresponding to the systematic name.
Note that the compilers of the catalogue imply that the property of "A potent pre- and postsynaptic 5-HT 1A receptor antagonist." belongs to the sample they offer for sale. It is unlikely they have tested the particular sample they send out to you. The authors of the paper ascribe the property to what we will be referring to as the parent molecule, the maleic acid plays no part in this. It is implied that the property of activity at 5-HT 1A receptors is inherited by the version molecule and will be exhibited by any subsequent sample you buy of this version. On the other hand the molecular weight of 467.52 is data about the version molecule and will also be exhibited by any subsequent sample you buy of this version. As the data are so confused in the supply source, it is unlikely that any automatic system will unravel it. It needs to be dealt with, without prejudice.
The point to remember is that these identifiers are the link to the only items which have a real existence, everything else in this model is information and data i.e. they are the names which the customer uses to communicate with the supplier when placing an order for samples of chemicals such as these.
In general, all constituents of a multi-component compound are not equally important for the process in hand. If we restrict ourselves to some sort of biological assay or property measure such as logD or pKa, then these are easily understood as reflecting the presence of a parent molecule. Data such as aD reflect the presence of isomeric parent molecule. Note that neither of these parent molecules have an independent existence as a physical entity, in the Daylight sense they are identifiers or names, about which, data exist.
Data such as melting point are clearly about a particular sample, as are the experimental results of both physicochemical and biological assays. A version molecule structure may also have been assigned to this sample perhaps a salt or solvate or even a impure isomer.
Calculated data such as cLogP are about the parent molecule, whereas a rubicon structure is a datum about the isomeric parent. An x-ray structure on the other hand is a datum about the sample, and is related to the version structure. Molecular formulae or weights can refer to version or parent and thus must be clearly defined. Note that the sample is identified by some arbitrary name, in the current case by the supplier. The version structure is a datum about the sample identifier, it does not differ in principle from any other data assigned to the (hopefully ) white crystalline powder.
As we are interested in a chemical information system we can use this hierarchy of structures to group the data together in a dendrite model. The sample identifiers can be grouped by the isomeric version structure. As there are many suppliers of a particular compound, there is a many to one relationship between sample identifier and isomeric version structure. As the valence bond model for representing a compound is not unambiguous, the isomeric version structures can be grouped by normalized isomeric version structures. If, as implied above, not all the components of a compound are equally important, we can group the normalized isomeric version structures by the isomeric parent structure. Above this there is a many to one relationship between these isomeric parent structures and a parent structure which contains no stereochemical or isomeric information
It is also important to note that whilst the structure assigned to the sample may be the same as either the parent or isomeric parent, that is purely coincidental. This strict hierarchy ensures that the data are associated with the correct identifier in the tree, or key. This can be illustrated using the depict algorithm.
The above picture is generated from the parent smiles using smi2gif(), much like clogP would be using clogp(). Whereas
is generated from the version smiles and the associated 2D coordinates, still using smi2gif().
A benefit of this model is that data, which come from a flat file such as an sd file, can be restrained within a Thor subtree, or in a row in an ORACLE table, associating the supplier's data only with the supplier's identifier, not with a particular structure. In fact it is necessary to use two tables in ORACLE as, a priori we have no knowledge of the number of data items about a particular sample vide ultra.
A further benefit of this model is that we now have total control over the relationships between structure and data and can make sure only appropriate values are stored.
A set of routines have been written to calculate useful properties of molecules directly from the structures. These values can then be stored along with the structure from which they were derived. These functions are available via a program_object interface which allows them to be called from within DayCart®. The ones, which are not grayed out, are included in the VCS building routines, calculated from the parent smiles. All values are returned as part_tuples except for PART_COUNT which is an integer. If the SQL output is chosen numeric part_tuples are set to NULL
The following are included in the VCS building routines, calculated from the parent smiles. All values are returned as a part_tuple. The depiction is only returned if there are no 2D data in the input file.
The structure representation is normalized by a few simple rules. Note that this only affects the grouping of compounds, it does not affect the original representation which is maintained or affect any in-house display business rules.
Parent isomeric structures are created from the normalized version isomeric structure by the following steps.
If there are no components left, i.e. all components are in the salt table there is a roll-back of the last step to give a structure which is treated like a mixture. If there is more than one component this structure is treated like a mixture.
In the case of a mixture, in the current version, all possible single parents are generated plus the multicomponent parent. I.e. there is a one to many relationship between the child and the parent. In a registration system, the registrar may take on the Solomon role, and keeps the simple tree structure to the data model. In the absence of other information, all potential parents are treated equally. This means duplication of data in a Thor model and spawns yet another table in Oracle.
Questions have been raised why we need such a complex model. What is wrong with simply grouping on matching version valence model as in ACD? Aside from the value of classification and the fact that most chemical structure searches are at what we have described as the parent level, there are maintenance issues. Below are the fixes to the Tocris 2001 catalogue, in this model changes are constrained to the subtree in Thor or the row in Oracle as they are changes to the data about the catalogue number.
As was described in the introduction, this model fits well into the Thor paradigm. The model is well tested. There are limitations as to the level of nesting but a typical tree showing the relationships in the data is shown here. Note that the data about the sample is stored in a sub-tree rooted in the normalized primary supplier identifier. There is a user-controlled list, per catalogue, of secondary identifiers. All data from the supplier is stored in a two field datatype.
A model has been developed in Oracle® which makes use of DayCart® to handle the chemical information aspects. The overall model is shown in the plan
This can been simplified by looking at the arrangement of the five structure linked tables and the relationships between them.
As can be seen the is a one-to-many relationship between entries in the parent table and the related entries in all the other tables. The structures in the parent table are unique, constrained by a PL/SQL function at load time currently. This is entirely analogous to the root SMILES in the thor model.
Each isomer has only one parent, ( the ASMILES <--> USMILES relationship). The structures in the isomer table are unique, constrained by a PL/SQL function at load time currently. Related entries in the other tables are many-to-one to those in the isomer table.
In order to deal with mixtures properly, a normalised structure can have more than one parent, vide supra. Related entries in the other tables are many-to-one to those in the isomer table.
There is a one-to-one relationship between the version structures and the normalised structure, controlled by the rules. Again to handle mixtures there is a one-to-many relationship to corresponding entries in other tables.
Each sample has only one version structure and one normalised structure and, by definition only one supplier. To handle the problem of mixtures a sample is regarded as identical iff it has the same supplier, supplier name and isomeric parent. There is a one-to-many relationship to the corresponding entries in all the other tables, including the sample_data, which is only linked to this table. Note too, that the suppliers table is only linked to the sample table. This makes it very easy to remove a particular supplier/catalogue.
A shell wrapper for the model building program is provided to allow direct conversion of an optionally compressed sd file on a cd. So
dell$ ../../bin/build_vcs_model -help
ERROR: missing .sdf file: -help.sdf[.[Z|gz]]
USAGE : ../../bin/build_vcs_model [SQL|TDT|RAW] myfile.sdf[.[Z|gz]] [myoutput_file]
myfile.sdf[.[Z|gz]] is the input data from the ISIS base or the supplied sd file. It requires the extension .sdf currently. It may be zipped .Z or compressed .gz
EXAMPLE: ./build_vcs_model TDT /cdrom/ChemStar/chemstar.sdf.gz chemstar.tdt
reads input from the compressed sd file /cdrom/ChemStar/chemstar.sdf.gz
Output is to ./chemstar.tdt in tdt format.
Environment variables control the data set specific values.
Default output is RAW for users who wish to incorporate these data into other schema or database systems.
select supp_id, vcs_name from vcs_sample
where par_id in
(
select par_id from vcs_parent
where contains ( smiles, 'Nc1ccccc1CCCN' )= 1
and rot_bonds < 5
and h_donor < 5
)
order by supp_id
select vcs_name from vcs_sample
where par_id in
(
select par_id from vcs_parent
where contains ( smiles, 'Nc1ccccc1CCCN' )= 1
and rot_bonds < 5
and h_donor < 5
)
and supp_id in
(
select supp_id from vcs_supplier
where name like 'Asinex%'
)
select vcs_supplier.name, vcs_parent.smiles
from vcs_parent, vcs_sample, vcs_supplier
where vcs_parent.par_id = vcs_sample.par_id
and vcs_sample.supp_id = vcs_supplier.supp_id
and vcs_name='0800'
select vcs_supplier.name, vcs_sample.vcs_name, vcs_parent.smiles
from vcs_parent, vcs_sample, vcs_supplier
where vcs_parent.par_id = vcs_sample.par_id
and vcs_sample.supp_id = vcs_supplier.supp_id
and contains ( smiles, 'Nc1ccccc1CCCCN' ) = 1
select vcs_supplier.name, vcs_sample.vcs_name, vcs_data.data_value, vcs_parent.smiles
from vcs_parent, vcs_sample, vcs_supplier, vcs_data
where vcs_parent.par_id = vcs_sample.par_id
and vcs_sample.supp_id = vcs_supplier.supp_id
and vcs_sample.vcs_id = vcs_data.vcs_id
and data_value like '%toxin%'
select vcs_supplier.name, vcs_sample.vcs_name, vcs_data.data_value, vcs_parent.smiles
from vcs_parent, vcs_sample, vcs_supplier, vcs_data
where vcs_parent.par_id = vcs_sample.par_id
and vcs_sample.supp_id = vcs_supplier.supp_id
and vcs_sample.vcs_id = vcs_data.vcs_id
and data_value like '%toxin%'
and h_donor < 5
Daylight Chemical Information
Systems, Inc.
info@daylight.com