Conversion of MDL Query Formats

to SMARTS and SMIRKS

John Barnard and Annette von Scholley-Pfab

Barnard Chemical Information Ltd., Sheffield UK

This presentation discusses the problems of interconverting not-entirely-compatible structure query representations. Specifically, the conversion of MDL Molfiles and RGfiles to SMARTS strings is described, and issues such as implicit/explicit hydrogens, ring-embedment, aromaticity, link atoms and R-groups,  and the compromises needed in their conversion, are covered. In addition, the conversion of MDL RxnFiles to Reaction SMARTS and SMIRKS strings is described, and algorithms for the automatic assignment of atom maps are outlined.


MDL's ISIS/Draw program contains various features for expressing substructure and reaction search queries: Query structure can be exported as Molfile (simple query) , RGfile (R-group query) or RxnFile (reaction query).

Descriptions of these file formats are published on the web, and other vendors have software products which can write them.


Daylight's SMARTS language includes similar features (SMARTS atomic and bond primitives) for similar purposes.

Automatic conversion of the MDL file formats to SMARTS would allow the convenience and familiarity of MDL's (and other vendors') drawing programs to be combined with the speed of Daylight's searching.

But subtle differences in the atom and bond properties which can be expressed in each language cause some problems for interconversion  programs.

MOLSMART program:


Hydrogen Counts and Further Substitution

In MDL files you can specify In SMARTS you can specify At present the MOLSMART program makes the following conversions: This is appropriate  where explicit hydrogens are used for "good" reasons: Because some people use explicit hydrogens in ISIS/Draw to prevent further substitution it might be better to convert such explicit hydrogens in the MDL file to the SMARTS "H" primitive. Comments on this would be welcome.


Ring and Ring Bond Counts

It is often desirable to specify whether or not a particular substructure may occur as part of a ring, or a particular ring as part of a larger ring system. If the MDL file specifies "no ring bonds", this can be exactly converted to the SMARTS "R0".

If the MDL file specifies an exact number of ring bonds, the MOLSMART program converts this to a recursive SMARTS.

e.g. for a carbon atom specified to have a ring bond count of 3:
[ C; $ (* (@*) (@*) @* ) ]
This is, however, not totally accurate as it specifies only a minimum of 3 ring bonds, whereas the MDL file specifies exactly 3. Any suggestions on how to provide a more faithful translation would be welcome. For 4 ring bonds, the translation is accurate.

Aromaticity

In conversion the program has to deduce whether an atom is aromatic or not from the bonding pattern.

Aromatic Atoms:

Aliphatic Atoms: Atoms which do not fall into any of the above classes could be either aromatic or aliphatic. They are shown in SMARTS expressions of the form "#nn" which effectively encompasses both upper and lower case forms.
e.g. "#6" for aromatic or aliphatic carbon.

Remaining Problems:

Comments are welcome.


Link Atoms

Indicate nose-to-tail repetition of parts of the structure. Different conversion strategies are needed if they occur in rings or in chains.

In chains, we can use recursive SMARTS:

[#7] [ $(C ([#7])  [#7] ),  
      $(C ([#7]) C[#7] ), 
      $(C ([#7])CC[#7] ) ]

Note that [#7] is used for the nitrogens, since either of them might be aromatic or aliphatic. Note also that
     [#7] [ $(C[#7]), $(CC[#7]), $(CCC[#7]) ]
cannot be used as it would match any structure which just has a NC in it.

In rings, separate SMARTS (to be searched in OR relationship) are needed, because we cannot have matching ring closure symbols inside and outside the recursive SMARTS:

N1 C N1
N1 CC N1
N1 CCC N1

Here the nitrogens have two single bonds, and so must be aliphatic.


R-Groups

ISIS/Draw permits sophisticated R-group queries to be constructed, which can be searched against databases with the ISIS Power Searching Module. Complex "logic" can be used to define the relative frequency of occurrence of different R-groups at different positions.

In simple cases, a single recursive SMARTS can be produced:
 
 

R1-C-C-R2
R1 = Cl or NO2
R2 = ethyl

 C ( [ $(N(CC)(=O)=O), Cl ] ) C [C;D2] [C;D1]


 


In more complex cases involving multiple occurrence of R-groups, and involved "logic", it may be necessary to produce a large number of separate SMARTS (effectively, one for each possible combination of R-group members). e.g. no less than 80 different SMARTS are needed for this query:


Reaction SMARTS

ISIS/Draw can be used to specify reaction queries; these can be exported to RxnFiles.

With the use of the reaction arrow (>>) and appropriate top-level parentheses these can be converted to reaction SMARTS queries.

There is some simplification because some of the MDL features (such as R-groups) are not permitted in reaction queries.

([#6]CC(=O)O[H]).(O([#6])[H])
>> ([#6]CC(=O)O[#6]).(O([H])[H])

Atom map classes can be specified in ISIS/Draw.
If they are provided, they are copied to the reaction SMARTS, though for internal reasons the actual numbers used are usually changed.


SMIRKS


Automatic atom mapping

One SMIRKS requirement is that atoms appearing on both sides must be mapped (unmapped atoms are assumed to disappear or appear).

To avoid the need for the user to map the reaction manually in ISIS/Draw, MOLSMART includes an algorithm to assign these maps automatically.
This is based on the Principle of Minimum Chemical Distance (MCD):

The current version requires a fully balanced reaction for mapping. We have been working to extend the mapping algorithm to handle unbalanced reactions.

Applications using MOLSMART are being discussed at MUG99 by Pat Walters and Meixiao Liu.

A scaled-down version of MOLSMART (for SMIRKS generation only) has been incorporated into the latest release of MSI's WebLab Diversity Explorer.


Acknowledgements: