Michael A. Kappler
Daylight CIS, Inc.
Dayconvert is a new tool for conversion of chemical information to and from Daylight native form. Dayconvert is available as an application or packaged as a DayCart operator. Our aim in the initial release is conversion of chemical information stored in the most widely used file formats in chemistry. Example conversions to and from SMILES, SMARTS, SMIRKS and ThorDataTrees (TDTs) will be shown.
The following structure is L-Alanine (13C) from "CTfile Formats", page 9:
$ cat ctfile009.mol 6 5 0 0 1 0 3 V2000 -0.6622 0.5342 0.0000 C 0 0 2 0 0 0 0.6622 -0.3000 0.0000 C 0 0 0 0 0 0 -0.7207 2.0817 0.0000 C 1 0 0 0 0 0 -1.8622 -0.3695 0.0000 N 0 3 0 0 0 0 0.6220 -1.8037 0.0000 O 0 0 0 0 0 0 1.9464 0.4244 0.0000 O 0 5 0 0 0 0 1 2 1 0 0 0 1 3 1 1 0 0 1 4 1 0 0 0 2 5 2 0 0 0 2 6 1 0 0 0 M CHG 2 4 1 6 -1 M ISO 1 3 13 M END
Method A: Application
The following converts the L-Alanine (13C) structure from molfile format to SMILES, given the above molfile in a file named ctfile009.mol:
$ cat ctfile009.mol | dayconvert -ifmt mol -ofmt smiles CC([NH3+])C(=O)[O-]
To include isomeric information (isotopes and stereochemistry), use the isomeric option:
$ cat ctfile009.mol | dayconvert -ifmt mol -ofmt smiles -isomeric [13CH3][C@H]([NH3+])C(=O)[O-]
Method B: DayCart
The following converts the L-Alanine (13C) structure from molfile format to SMILES, given the above molfile in an Oracle table named ddtable and a column named molfile:
SQL> select dayconvert(molfile,'mol','smiles',0) from ddtable; DAYCONVERT(MOLFILE,'MOL','SMILES',0) ----------------------------------- CC([NH3+])C(=O)[O-]
To include isomeric information (isotopes and stereochemistry), use the isomeric option:
SQL> select dayconvert(molfile,'mol','smiles',1) from ddtable; DAYCONVERT(MOLFILE,'MOL','SMILES',1) ----------------------------------- [13CH3][C@H]([NH3+])C(=O)[O-]
The following query represents analine, phenol, or toluene:
$ cat phenylNCO.mol 7 7 1 0 0 0 2 V2000 -4.1223 1.6500 0.0000 C 0 0 0 0 0 0 -4.1234 0.8227 0.0000 C 0 0 0 0 0 0 -3.4086 0.4098 0.0000 C 0 0 0 0 0 0 -2.6922 0.8231 0.0000 C 0 0 0 0 0 0 -2.6950 1.6537 0.0000 C 0 0 0 0 0 0 -3.4104 2.0628 0.0000 C 0 0 0 0 0 0 -3.4167 2.8917 0.0000 L 0 0 0 0 0 0 3 4 2 0 0 0 0 4 5 1 0 0 0 0 2 3 1 0 0 0 0 5 6 2 0 0 0 0 6 1 1 0 0 0 0 1 2 2 0 0 0 0 6 7 1 0 0 0 0 7 F 3 7 6 8 M ALS 7 3 F N C O M END
Method A: Application
The following converts the substituted-benzene query from molfile format to SMARTS, given the above molfile in a file named phenylNCO.mol:
$ cat phenylNCO.mol | dayconvert -ifmt mol -ofmt smarts [#7,#6,#8]c1ccccc1
Method B: DayCart
The following converts the substituted-benzene query from molfile format to SMARTS, given the above molfile in an Oracle table named ddtable and a column named molfile:
SQL> select dayconvert(molfile,'mol','smarts',1) from ddtable; DAYCONVERT(MOLFILE,'MOL','SMARTS',1) ----------------------------------- [#7,#6,#8]c1ccccc1
The following reaction is acylation of benzene from "CTfile Formats", page 44:
$ cat ctfile044.rxn $RXN REACCS81 1017911041 7439 2 1 $MOL REACCS8110179110412D 1 0.00380 0.00000 315 4 3 0 0 0 0 1 V2000 0.3929 -0.2577 0.0000 C 0 0 0 0 0 0 0 0 0 1 0 0 -1.0590 -0.7710 0.0000 C 0 0 0 0 0 0 0 0 0 2 0 0 0.3929 1.2823 0.0000 O 0 0 0 0 0 0 0 0 0 3 0 0 1.6503 -1.1468 0.0000 Cl 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 2 1 3 2 0 0 0 2 1 4 1 0 0 0 4 M END $MOL REACCS8110179110412D 1 0.00371 0.00000 8 6 6 0 0 0 0 1 V2000 1.3335 -0.7689 0.0000 C 0 0 0 0 0 0 0 0 0 5 0 0 1.3335 0.7689 0.0000 C 0 0 0 0 0 0 0 0 0 6 0 0 0.0000 -1.5415 0.0000 C 0 0 0 0 0 0 0 0 0 7 0 0 0.0000 1.5415 0.0000 C 0 0 0 0 0 0 0 0 0 8 0 0 -1.3335 -0.7689 0.0000 C 0 0 0 0 0 0 0 0 0 9 0 0 -1.3335 0.7689 0.0000 C 0 0 0 0 0 0 0 0 0 10 0 0 1 2 1 0 0 0 2 1 3 2 0 0 0 2 2 4 2 0 0 0 2 3 5 1 0 0 0 2 4 6 1 0 0 0 2 5 6 2 0 0 0 2 M END $MOL REACCS8110179110412D 1 0.00374 0.00000 255 9 9 0 0 0 0 1 V2000 -0.5311 -0.1384 0.0000 C 0 0 0 0 0 0 0 0 0 5 0 0 -1.8626 0.6321 0.0000 C 0 0 0 0 0 0 0 0 0 6 0 0 -0.5311 -1.6943 0.0000 C 0 0 0 0 0 0 0 0 0 7 0 0 0.8191 0.6284 0.0000 C 0 0 0 0 0 0 0 0 0 1 0 0 -3.2278 -0.1346 0.0000 C 0 0 0 0 0 0 0 0 0 8 0 0 -1.8813 -2.4723 0.0000 C 0 0 0 0 0 0 0 0 0 9 0 0 2.1282 -0.1085 0.0000 C 0 0 0 0 0 0 0 0 0 2 0 0 0.8191 2.2292 0.0000 O 0 0 0 0 0 0 0 0 0 3 0 0 -3.2278 -1.6831 0.0000 C 0 0 0 0 0 0 0 0 0 10 0 0 1 2 1 0 0 0 2 1 3 2 0 0 0 2 1 4 1 0 0 0 2 2 5 2 0 0 0 2 3 6 1 0 0 0 2 4 7 1 0 0 0 2 4 8 2 0 0 0 2 5 9 1 0 0 0 2 6 9 2 0 0 0 2 M END
Method A: Application
The following converts the acylation reaction from rxnfile format to Reaction SMILES, given the above rxnfile in a file named ctfile044.mol:
$ cat ctfile044.mol | dayconvert -ifmt rxn -ofmt smiles -isomeric CC(=O)Cl.c1ccccc1>>CC(=O)c1ccccc1
Method B: DayCart
The following converts the acylation reaction from rxnfile format to Reaction SMILES, given the above rxnfile in an Oracle table named ddtable and a column named rxnfile:
SQL> select dayconvert(rxnfile,'rxn','smiles',1) from ddtable; DAYCONVERT(MOLFILE,'MOL','SMILES',1) ----------------------------------- CC(=O)Cl.c1ccccc1>>CC(=O)c1ccccc1
[C:1][C:2](=[O:3])[F,Cl,Br,I:4].[c:5]1[c:6][c:7][c:8][c:9][c:10]1>>[C:1][C:2](=[O:3])[c:5]1[c:6][c:7][c:8][c:9][c:10]1.[F,Cl,Br,I:4]
The following is a file list showing an SDfile from Derwent (World Drug Index) and an RDfile from Sunset Molecular, LLC (WomBat):
-rw-r--r-- 1 mick daystaff 35276963 Apr 2 2004 wdi041001.sdf -rw-r--r-- 1 mick daystaff 377205105 Apr 16 2004 wombat041.rdf
Method A: Application
The following converts the WDI database from SDfile format to SMILES:
$ cat wdi041001.sdf | dayconvert -ifmt sdf -ofmt smiles -isomeric > wdi041001.smi $ cat wdi041001.smi C/C=C\1/CN2CCc3c([nH]c4ccccc34)[C@@H]2C[C@@H]1Cc5nccc6c7cc(O)ccc7[nH]c56 ON1C(=O)CCC(N2C(=O)c3ccc(O)cc3C2=O)C1=O COc1cc(ccc1O)C2Oc3c(OC)cc(cc3O[C@H]2CO)c4cc(=O)c5c(O)cc(O)cc5o4 ... Nc1nc(N)c(nc1Cl)C(=O)NC(=N)NCc2ccccc2
The following converts the WomBat database from RDfile format to SMILES:
$ cat wombat041.rdf | dayconvert -ifmt rdf -ofmt smiles -isomeric > wombat041.smi $ cat wombat041.smi CC(C)c1c(O)c(O)c(C=O)c2c(O)c(c(C)cc12)c3c(C)cc4c(C(C)C)c(O)c(O)c(C=O)c4c3O CC(C)c1c(O)c(O)c(C#N)c2c(OC(=O)C)c(c(C)cc12)c3c(C)cc4c(C(C)C)c(O)c(O)c(C#N)c4c3OC(=O)C CC(C)c1c(O)c(O)c2C(=N)Oc3c(c(C)cc1c32)c4c5OC(=N)c6c(O)c(O)c(C(C)C)c(cc4C)c56 ... CC(=CCc1c(O)c(Cl)c(C)c(C=O)c1O)CCC=C(C)C2=CC(=O)C(C)(C)O2
Method B: DayCart
A SQL control file named import_sdf.ctl is provided to import an SDfile into Oracle:
$ sqlldr userid='c$dcischem/secret' control=import_sdf.ctl SQL*Loader: Release 10.1.0.2.0 - Production on Tue Mar 8 17:00:15 2005 Copyright (c) 1982, 2004, Oracle. All rights reserved. Commit point reached - logical record count 100 Commit point reached - logical record count 200 Commit point reached - logical record count 300 ... Commit point reached - logical record count 10000
The following converts the WDI database from SDfile format to SMILES, given the above SDfile in an Oracle table named ddtable and a column named sdfile:
SQL> select dayconvert(sdfile,'sdf','smiles',1) from ddtable; DAYCONVERT(SDFILE,'SDF','SMILES',1) ---------------------------------- C/C=C\1/CN2CCc3c([nH]c4ccccc34)[C@@H]2C[C@@H]1Cc5nccc6c7cc(O)ccc7[nH]c56 ON1C(=O)CCC(N2C(=O)c3ccc(O)cc3C2=O)C1=O COc1cc(ccc1O)C2Oc3c(OC)cc(cc3O[C@H]2CO)c4cc(=O)c5c(O)cc(O)cc5o4 ... Nc1nc(N)c(nc1Cl)C(=O)NC(=N)NCc2ccccc2
A SQL control file named import_rdf.ctl is provided to import an RDfile into Oracle:
$ sqlldr userid='c$dcischem/secret' control=import_rdf.ctl SQL*Loader: Release 10.1.0.2.0 - Production on Tue Mar 8 17:01:15 2005 Copyright (c) 1982, 2004, Oracle. All rights reserved. Commit point reached - logical record count 100 Commit point reached - logical record count 200 Commit point reached - logical record count 300 ... Commit point reached - logical record count 76165
The following converts the WomBat database from RDfile format to SMILES, given the above RDfile in an Oracle table named ddtable and a column named rdfile:
SQL> select dayconvert(rdfile,'rdf','smiles',1) from ddtable; DAYCONVERT(RDFILE,'RDF','SMILES',1) ---------------------------------- CC(C)c1c(O)c(O)c(C=O)c2c(O)c(c(C)cc12)c3c(C)cc4c(C(C)C)c(O)c(O)c(C=O)c4c3O CC(C)c1c(O)c(O)c(C#N)c2c(OC(=O)C)c(c(C)cc12)c3c(C)cc4c(C(C)C)c(O)c(O)c(C#N)c4c3OC(=O)C CC(C)c1c(O)c(O)c2C(=N)Oc3c(c(C)cc1c32)c4c5OC(=N)c6c(O)c(O)c(C(C)C)c(cc4C)c56 ... CC(=CCc1c(O)c(Cl)c(C)c(C=O)c1O)CCC=C(C)C2=CC(=O)C(C)(C)O2
Method A: Application
The following converts the WDI and WomBat databases from SMILES to SDfile and RDfile formats, respectively:
$ dayconvert wdi041001.smi -ifmt smiles -ofmt sdf -isomeric > wdi041001.sdf $ dayconvert wombat041.smi -ifmt smiles -ofmt rdf -isomeric > wombat041.rdf
Method B: DayCart
A PL/SQL procedure named export_file is provided to export an file from Oracle. The following converts the WDI and WomBat databases from SMILES to SDfile and RDfile formats, respectively, given the SMILES in an Oracle table named ddtable and a column named sdfile and rdfile:
SQL> exec ddpackage.export_file('smi','sdf','wdi041001.sdf') SQL> exec ddpackage.export_file('smi','rdf','wombat041.rdf') PL/SQL procedure successfully completed.
A PL/SQL procedure named oracle2rdf is provided to export an RDfile from Oracle:
SQL> exec ddpackage.oracle2rdf('wombat041.rdf') PL/SQL procedure successfully completed.
Method A: Application
The following converts the L-Alanine (13C) structure from SMILES to molfile:
$ echo '[13CH3][C@H]([NH3+])C(=O)[O-]' | dayconvert -isomeric -ifmt smiles -ofmt mol Daylight03090518412D 7 6 0 1 0 999 V2000 -0.1000 1.5300 0.0000 C 1 0 0 0 0 0 0 0 0 -0.3600 2.5100 0.0000 C 0 0 1 0 0 0 0 0 0 -0.6300 3.4900 0.0000 N 0 3 0 0 0 0 0 0 0 0.6200 2.7700 0.0000 C 0 0 0 0 0 0 0 0 0 0.8800 3.7500 0.0000 O 0 0 0 0 0 0 0 0 0 1.3400 2.0500 0.0000 O 0 5 0 0 0 0 0 0 0 -1.3400 2.2500 0.0000 H 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 2 3 1 6 0 0 2 4 1 0 0 0 4 5 2 0 0 0 4 6 1 0 0 0 2 7 1 1 0 0 M CHG 2 3 1 6 -1 M ISO 1 1 13 M END
Method B: DayCart
SQL> select dayconvert('[13CH3][C@H]([NH3+])C(=O)[O-]','smiles','mol',1) from dual; DAYCONVERT('[13CH3][C@H]([NH3+])C(=O)[O-]','SMILES','MOL',1) ----------------------------------------------------------- Daylight03090518412D 7 6 0 1 0 999 V2000 -0.1000 1.5300 0.0000 C 1 0 0 0 0 0 0 0 0 -0.3600 2.5100 0.0000 C 0 0 1 0 0 0 0 0 0 -0.6300 3.4900 0.0000 N 0 3 0 0 0 0 0 0 0 0.6200 2.7700 0.0000 C 0 0 0 0 0 0 0 0 0 0.8800 3.7500 0.0000 O 0 0 0 0 0 0 0 0 0 1.3400 2.0500 0.0000 O 0 5 0 0 0 0 0 0 0 -1.3400 2.2500 0.0000 H 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 2 3 1 6 0 0 2 4 1 0 0 0 4 5 2 0 0 0 4 6 1 0 0 0 2 7 1 1 0 0 M CHG 2 3 1 6 -1 M ISO 1 1 13 M END
Note: Beginning in v4.91, stereochemical hydrogens are represented as atoms on molecules with the dt_smilin_addh entry point in the SMILES Toolkit.
The ThorDataTree (TDT) is a Daylight format for structure and data. Method A: Application
The following converts the L-Alanine (13C) structure from molfile format to TDT format, given the above molfile in a file named ctfile009.mol:
$ cat ctfile009.mol | dayconvert -ifmt mol -ofmt tdt -isomeric > ctfile009.tdt $ cat ctfile009.tdt $SMI<CC([NH3+])C(=O)[O-]> ISM<[13CH3][C@H]([NH3+])C(=O)[O-]> 2D<-0.720700,2.081700,-0.662200,0.534200,-1.862200,-0.369500,0.662200,-0.300000,0.622000,-1.803700,1.946400,0.424400,,,,,,,,,,,,,,> BST<-1,,,,,,,,,,,> VIS<,,,,,,0,0,0,0,0,0,0> ATI<3,1,4,2,5,6,,,,,,,> BDI<2,3,1,4,5,,,,,,,> BDR<,,,,,,,,,,,> |
Method B: DayCart
The following converts the L-Alanine (13C) structure from molfile format to TDT format, given the above molfile in an Oracle table named ddtable and a column named molfile:
SQL> select dayconvert(molfile,'mol','tdt',1) from ddtable; DAYCONVERT(MOLFILE,'MOL','TDT',1) -------------------------------- $SMI<CC([NH3+])C(=O)[O-]> ISM<[13CH3][C@H]([NH3+])C(=O)[O-]> 2D<-0.720700,2.081700,-0.662200,0.534200,-1.862200,-0.369500,0.662200,-0.300000,0.622000,-1.803700,1.946400,0.424400,,,,,,,,,,,,,,> BST<-1,,,,,,,,,,,> VIS<,,,,,,0,0,0,0,0,0,0> ATI<3,1,4,2,5,6,,,,,,,> BDI<2,3,1,4,5,,,,,,,> BDR<,,,,,,,,,,,> |
The 2D-coordinates (2D), bond styles (BST) and visibility (VIS) datatypes are used by the HTTP-based SMI2GIF application to preserve the original depiction.
Table 1: Newer Depictions are Better | |||
---|---|---|---|
v4.81 what the computer "thinks" |
v4.82 use of 2D-coordinates |
v4.83 aesthetics improvments |
v4.91 total control of layout |
Note: The DEPICT Toolkit entry points dt_setcoord, dt_setbondstyle and dt_setvisible are used to observe 2D, BST, and VIS datatypes.
In additional to the depiction datatypes in the TDT, the atom index (ATI), bond index (BDI), and bond reversal (BDR) datatypes are used used to reproduce the original molfile.
Method A: Application
The following converts the L-Alanine (13C) structure from TDT format to molfile format and reproducing the original input.
$ cat ctfile009.tdt | dayconvert -ifmt tdt -ofmt mol -isomeric Daylight03100507422D 6 5 0 1 0 999 V2000 -0.6622 0.5342 0.0000 C 0 0 2 0 0 0 0 0 0 0.6622 -0.3000 0.0000 C 0 0 0 0 0 0 0 0 0 -0.7207 2.0817 0.0000 C 1 0 0 0 0 0 0 0 0 -1.8622 -0.3695 0.0000 N 0 3 0 0 0 0 0 0 0 0.6220 -1.8037 0.0000 O 0 0 0 0 0 0 0 0 0 1.9464 0.4244 0.0000 O 0 5 0 0 0 0 0 0 0 1 2 1 0 0 0 1 3 1 1 0 0 1 4 1 0 0 0 2 5 2 0 0 0 2 6 1 0 0 0 M CHG 2 4 1 6 -1 M ISO 1 3 13 M END
Method B: DayCart
The following converts the L-Alanine (13C) structure from ThorDataTree TDT format to molfile format and reproducing the original input.
SQL> select dayconvert(tdtfile,'tdt','mol',1) from ddtable; DAYCONVERT(TDTFILE,'MOL','TDT',1) -------------------------------- Daylight03100507422D 6 5 0 1 0 999 V2000 -0.6622 0.5342 0.0000 C 0 0 2 0 0 0 0 0 0 0.6622 -0.3000 0.0000 C 0 0 0 0 0 0 0 0 0 -0.7207 2.0817 0.0000 C 1 0 0 0 0 0 0 0 0 -1.8622 -0.3695 0.0000 N 0 3 0 0 0 0 0 0 0 0.6220 -1.8037 0.0000 O 0 0 0 0 0 0 0 0 0 1.9464 0.4244 0.0000 O 0 5 0 0 0 0 0 0 0 1 2 1 0 0 0 1 3 1 1 0 0 1 4 1 0 0 0 2 5 2 0 0 0 2 6 1 0 0 0 M CHG 2 4 1 6 -1 M ISO 1 3 13 M END
Additional chemical information involving structural properties and data are in SDfile and RDfile formats. The solution for preserving structures or reactions with data is a TDT. The following structures are L-trans-1,2 cyclohexane-dicarboxylic acid and 2-methyl furan from "CTfile Formats" (October 2003), page 40 (SDfile format) and the conversion of them to a TDT. The same solution is used for RDfile formats.
The Oracle types use in the DayCart operators may be either a variable character array (VARCHAR2) or a character large object (CLOB).
SQL> describe ddtable Name Null? Type --------- -------- -------------- ID NOT NULL NUMBER MOLFILE VARCHAR2(4000) RXNFILE VARCHAR2(4000) SDFILE CLOB RDFILE CLOB TDTFILE CLOB
The initial release of the Converstion Toolkit Application Programmer's Interface is four entry points and can be used to input and output strings and objects, such as molecules, reactions, patterns, transformations, depiction, and conformations.
A generic input type, such as 'mdl' or 'daylight', may be specified, which invokes a perception algorithm that attempts to detect the specific format type. Additionally, each input type is associated with an output type. So, input may be specified as "Generic MDL" and output may be specified as "Generic Daylight". Then, for example, if the input is perceived to be an SDfile it will be read as such, and output will be either SMILES or SMARTS depending on whether the input contains query features. An informational program, FILEFORMAT, is provided to show perceived types of CTfiles.
Another informational program, MDL2INFO, is provided to show the datatypes of an SDfile or an RDfile and can be used to select the datatype to be used for naming a structure for storage in a TDT. The default value is first line of the header in a molfile, rxnfile, SDfile, or RDfile. The following is information for the WomBat 2004.1 database and shows that several of the top datatypes are unique and present for all records, and are therefore reasonable selectiona for naming the structure sample to be stored in the TDT.
For complete information on the Conversion Toolkit, Dayconvert Application and DayCart operators, see the Conversion Toolkit Programmer's Guide.