Michael A. Kappler
Daylight CIS, Inc.
A description of the Conversion Tools project and its initial release state is presented. The tools are scheduled to be included in Daylight Software v4.9.
Approach. The Conversion Toolkit interface will accept an object and input data, and outputs data in the correct format.
Issues. Dialects increase the complexity of converting a format. Do we take a hard or soft line on violations?
dt_Handle <= dt_alloc_conversion(dt_Integer ifmt, dt_Integer ofmt)
The return value from the dt_alloc_conversion routine will be a conversion object that is passed with data into the second routine, which converts data from one format to another.
dt_Handle <= dt_convert(dt_Handle conv, dt_Handle seq)
The parameter conv will be the conversion object from above and seq will be a sequence of strings or objects. Named Properties associated with the conversion object will control the behavior of the conversion process. Depending on the output format, the return value will be either a sequence of strings or objects. For example, when converting from a SDfile to SMILES, the input and output data will be a sequence of string objects representing a SDfile (one line per string) and SMILES (one structure per string). In the case of conversion to objects, the output data will be a sequence of dt_Handle types (molecules, reactions, etc.).
The list of supported formats will be as follows: MDL CTfile formats except XDfile (molfile, RGfile, rxnfile, SDfile, RDfile, extended molfile and extended rxnfile), SMILES, SMARTS, SMIRKS, Daylight Toolkit objects (molecule, reaction, pattern, and transform), and TDTs. Any CTfile format may be converted to the corresponding Daylight language or object, and vice versa.
Applications. Two powerful generic converters, mdl2daylight and daylight2mdl, will be provided for conversion to and from all of the supported MDL and Daylight formats. These converters perceive the input type and generate the most appropriate output type. The choice of which output type will be produced from an input type will be predetermined. For example, one input SMILES results in a molfile and multiple input SMILES result in a SDfile. The ability to recognize MDL input type will be limited. When the input or output type is known, the format may be specified.
All applications will be built on an internal conversion utility and support streaming of data, also known as \emph{feeder mode}. Conversion of data will be performed in parts and the size of the parts may be set by the user. Feeder mode will be necessary for processing large input files that would otherwise exceed hardware capabilities.
The main programming loop condition is:
while (NULL_OB != (oseq = dt_convert(conv, iseq)))
The parameter iseq will be filled with data prior to calling the dt_convert routine. Within the loop, converted data will be output and additional data will be placed in the input sequence.
Toolkit programmers wishing to convert to Daylight Toolkits objects (molecule, reaction, etc.), need to write their own programming loop to capture the objects as they are returned from \texttt{dt_convert}. For example, pseudo code for conversion of a SDfile to molecules is as follows:
#include "dt_smiles.h" /* SMILES Toolkit */
#include "dt_conv.h" /* Conversion Toolkit */
int main() {
dt_Handle conv, iseq, oseq;
/* create conversion object */
conv = dt_alloc_conversion(DX_CONV_FMT_SDF, DX_CONV_FMT_OBJ);
/* create input object */
iseq = dt_alloc_seq();
/* open and read part of an SDfile */
...
/* convert data */
while (NULL_OB != (oseq = dt_convert(conv, iseq))) {
/* loop over molecules */
while (NUL_OB != (obj = dt_next(oseq)))
/* do something with the molecules */
/* destroy output */
dt_dealloc(oseq);
/* read more from SDfile */
...
}
return 0;
}
Atom List
SMARTS: [#7,#6,#8]c1ccccc1 | ||
Any Atom
SMARTS: [#0]c1ccccc1 | ||
Not Aliphatic
SMARTS: [!#6]c1ccccc1 | ||
Wild Card Atom
SMILES: *c1ccccc1 | ||
Lone Pair
SMILES: c1ccncc1 | ||
R-groups
SMARTS: [CO,N]c1cccc([CO,N])c1[C] | ||
Isotope
SMILES: [13CH3]c1ccccc1 | ||
Charge
SMILES: [O-]c1ccccc1 | ||
Implicit Hydrogens
SMARTS: [C&!h0&!h1&!h2&!h3] | ||
Stereo Care (none)
SMILES: C/C=C/C=C/C=C/C | ||
Stereo Care (middle)
SMILES: CC=C/C=C/C=CC | ||
Stereo Care (ends)
SMILES: C/C=C/C=C\C=C\C note: implied middle | ||
Valence (connections)
SMARTS: [C&X4] | ||
Aromatic Bond
SMARTS: [#6]:1:[#6]:[#6]:[#6]:[#6]:[#6]1 | ||
Single or Double Bond
SMARTS: [#6]-,=1[#6]-,=[#6][#6]-,=[#6][#6]1 | ||
Double or Aromatic Bond
SMARTS: [#6]=,:1-,:[#6]=,:[#6]-,:[#6]=,:[#6]-,:[#6]1 | ||
Wild Card Bond
SMARTS: [#6]~1~[#6]~[#6]~[#6]~[#6]~[#6]1 | ||
Chirality
SMILES: C[C@H](N)C(=O)O | ||
Chirality (racemic)
SMARTS: C[C&@,@@](N)C(=O)O | ||
Stereo (racemic, care all)
SMARTS: C/,\C=C/C | ||
Stereo (racemic ends, care all)
SMARTS: C/,\C=C/C=C/C=C/,\C | ||
Stereo (racemic ends, care ends)
SMARTS: C/,\C=C/C=C/,\C=C/C | ||
Stereo (racemic middle, care all)
SMARTS: C/C=C/C=C\C=C\C | ||
Stereo (racemic middle, care middle and one end)
SMARTS: CC=C/,\C=C/C=C/C | ||
Stereo (racemic middle, care middle and both one end) warning: unsupported features SMARTS: CC=C/,\C=C(/C=C/C)/C=C/C warning: inconsistent bond directions SMARTS: CC=C/,\C=C(/C=C/C)\C=C/C warning: too much symmetry for directional bonds SMARTS: CC=C/,\C=C(/C)\C=C/C note: okay Chain Bond
| SMARTS: CC-&!@O-&!@CC Ring Bond
| SMARTS: CC-&@O-&@CC |