MUG '04 -- 24 - 27 Feb, 2004
Conversion Toolkit
Michael A. Kappler
Daylight, CIS
ABSTRACT
The Conversion Toolkit is aimed at facilitating migration of data into and out of Daylight software. The following describes the project goals, requirements, approach, design, issues, and current status of this new product.
Goals
- Facilitate migration to DayCart®. Mitigate architectural barriers by supporting third party data.
- Promote integration of Daylight Toolkits. The ability to handle third party data would give customers the opportunity to use our tools in other environments.
- Support creation of Daylight databases. Offer more content in Daylight native form.
The aim is to support structural information, data, queries, and reactions to and from SMILES, SMARTS, and SMIRKS languages.
Requirements
- Convert popular chemical information file formats to Daylight native form.
- Generate these popular formats from Daylight native form.
- Preserve information to the maximum extent possible.
Widely used file formats for structures, reactions, and data in the chemical information software industry have been published[1] by Molecular Design Limited (MDL), known as the chemical table file (CTfile) Formats.[2] There are an estimated two dozen file formats and another dozen or so in the making. The current position regarding various formats are catagorized as follows:
Group I: "must-have", supported in first release
MDL CTfile Formats[2] (input sketcher, storage)
- Ctab (structure)
- Ctab Atom List (query)
- Ctab Stext (data)
- Ctab Properties (query)
- Ctab Properties (Rgroup)
- Ctab Properties (Sgroup)
- Ctab Properties (3D)
- Ctab Enhancements (atom types, large numbers, data)
- molfile (one structure, a.k.a. V2000)
- RGfile (Rgroup query)
- SDfile (data, multiple structures)
- rxnfile (one reaction)
- SDfile (data, multiple reactions)
- XDfile (XML)
- extended molfile (structure, a.k.a. V3000 and mol3)
- extended Ctab Properties (query)
- extended Ctab Properties (Rgroup)
- extended Ctab Properties (Sgroup)
- extended Ctab Properties (3D)
- extended Ctab Properties (Collections)
- extended rxnfile (one reaction)
CambridgeSoft® ChemDraw (input sketcher)
Group II: "valuable", supported in subsequent release
- Marvin[4]
- JME[5]
- XML/CML[6]
Group III: potentially valuable", no plans
- Accelrys Common File Formats[7]
- CSD FDAT/FCON[8]
- GAMESS XYZ[9]
- Gaussian Z-matrix[10]
- ISISDraw TGF[11]
- JCAMP-CS[12]
- PDB[13]
- Sybyl MOL2[14]
Group IV: "noted", not worthy of support
Approach
Successful conversion will be determined by comparison of input from CambridgeSoft ChemDraw[3] and MDL® ISIS/Draw[16] against output generated by the Conversion Toolkit.
The implementation will be a Daylight Toolkit and integrated with existing tools. The Toolkit approach will maximize robustness - it should never fail itself, only fail to convert malformed input. Robustness and efficiency will be maximized using purification and quantification tools.
A compatibility accessment will be performed with other chemical drawing tools, i.e. CACTVS,[17] ChemSymphony,[18] JME,[5], Marvin,[4] and MDL® Draw Enterprise.[19]
Design
One interface concept is a sequence of string objects. Control will be described with "named properties" on a "Conversion Object". The interface will be extensible so additional or not-yet-developed formats can be supported in the future. Since no Daylight Toolkit performs direct I/O, a ``Contrib'' program would be offered for reading and writing data. The following is one possible interface design.
Toolkit
C Prototype
dt_Handle <= dt_alloc_conversion(dt_Integer request)
Description
Allocate an object for a specific conversion request, e.g., DX_CONV_MOL2SMILES.
C Prototype
dt_Handle <= dt_convert(dt_Handle conversion, dt_Handle sequence)
Description
Convert data based on properties. Input is a conversion object an a sequence of strings. Return a sequences of strings.
Library in $DY_ROOT/lib
libdt_conv.a & libdt_conv.so
Applications in $DY_ROOT/bin
MOL2SMILES, MOL2SMARTS, & MOL2SMIRKS
SMILES2MOL, SMART2SMOL, & SMIRK2SMOL
DayCart
CLOB <= function ddpackage.fsmiles2mol(data IN CLOB)
Contrib
C Prototype
dt_Handle <= du_file2seq(FILE *file, char *delimiter)
Description
Read from file into a sequence of string
C Prototype
dt_Boolean <= du_seq2file(dt_Handle sequence, FILE *file)
Description
Write from sequence of string to a file
Issues
The extent to which we allow dialects is an issue. Do we take a hard line on violations? A soft approach to format interpretation is problematic - there is no ``right'' answer. A list of allowed line delimiters (LF, CR, CR/LF, etc.) is needed.
Status
An Alpha version with basic functionality has been completed. A "Conversion Toolkit was built and depends on SMILES, SMARTS, and DEPICT Toolkits. An initial interface has been implemented and conversion between MDL and SMILES has been done. A non-functional interface betwen MDL and SMARTS and SMIRKS is in place. The estimated timeline for this project is as follows:
Table 1:Conversion Toolkit Project TImeline
|
December '03
| identify goals and requirements, design approaches
|
January '03
| begin development
|
February '04
| MUG '04, Alpha version, basic functionality
|
Summer '04
| Beta version, required features
|
September '04
| EuroMUG '04, compatability accessment, extended progress
|
sometime in '04
| Software Release v4.91, stable interface
|
sometime thereafter
| Software Release v4.92, extended capabilities
|
Data from the CTfile Formats is being used as test input. Send us your "problematic" data to ensure we cover known issues affect you. Below, we see the new toolkit successfully convert data from molfile format as reported in the CTFile Format document. Support for extended molfile format is planned.
% foreach CTFILE (ctfile* )
foreach? echo `$CTFILE` `cat $CTFILE | mol2smiles`
foreach? end
ctfile009.mol CC([NH3+])C(=O)[O-]
ctfile018.mol *CCC(Cl)C*
ctfile022.mol CN.c1ccccc1
ctfile033.mol CC([NH3+])C(=O)[O-]
ctfile036.rgf *c1cccc(*)c1*
ctfile040.sdf OC(=O)C1CCCCC1C(=O)O
ctfile044.rxn CC(=O)Cl
ctfile049.rdf CC(=O)Cl
ctfile057.xdf
ctfile077.mol
ctfile085.mol
ctfile092.mol
ctfile095.mol
References
- [1] Dalby, A. et al. Description of several chemical structure file formats used by computer programs developed at molecular design limited. J. Chem. Inf. Comput. Sci., 32 (1992), 244--255.
- [2] CTfile Formats (October 2003), MDL Website. http://www.mdl.com/downloads/public/ctfile/ctfile.jsp.
- [3] CambridgeSoft Website. http://www.cambridgesoft.com/.
- [4] ChemAxon Website. http://www.chemaxon.com/marvin/.
- [5] MolInspiration Website. http://www.molinspiration.com/jme/.
- [6] Chemical Markup Language Website. http://www.xml-cml.org/.
- [7] Accelrys Website. http://www.msg.ucsf.edu/local/programs/insightII/doc/life/insight2000.1/ formats980/Files980TOC.doc.html.
- [8] Cambridge Structural Database Website. http://www.ccdc.cam.ac.uk/ and http://www.lrz-muenchen.de/services/software/chemie/unichem_doku/ 5500/5500_245.html#HEADING244.
- [9] Trinity University Website. http://hackberry.chem.trinity.edu/IJC/Text/xmolxyz.html.
- [10] UniChem User's Guide. http://www.lrz-muenchen.de/services/software/chemie/unichem_dok u/5500/5500_248.html#HEADING247.
- [11] MDL Website. http://www.mdl.com/.
- [12] Atomic and Molecular Physical Data Website. http://www.jcamp.org/protocols.html.
- [13] Protein Data Bank Website. http://www.rcsb.org/pdb/.
- [14] Tripos Website. http://www.tripos.com/custResources/mol2Files/.
- [15] Thomson Dialog Website. http://library.dialog.com/bluesheets/html/bl0390.html.
- [16] MDL Website. http://www.mdl.com/products/framework/isis_draw/index.jsp.
- [17] University of Erlangen Website. http://www2.ccc.uni-erlangen.de/software/cactvs/.
- [18] Cherwell Scientific Website. http://www.chm.bris.ac.uk/chemsymphony/start_here.html.
- [19] MDL Website. http://www.mdl.com/products/framework/draw_enterprise/index.jsp.
Daylight Chemical Information Systems, Inc.
info@daylight.com