Daylight Programmer's Guide: SMARTS

9. SMARTS Toolkit: Structural Searching

9.1 Introduction

The Daylight SMARTS Toolkit provides a powerful set of substructure searching algorithms. The SMARTS Toolkit can parse a SMARTS string and produce a pattern object, then test the pattern against one or more molecule objects to see if the molecule contains the pattern. Several types of pattern searches are available, including "yes/no" tests and exhaustive enumeration of all occurances of a pattern in a molecule.

There are three objects specific to the Daylight SMARTS Toolkit:

pattern: The result of "compiling" a SMARTS string. Each SMARTS string is converted to a pattern before use; the pattern-generation algorithm parses the SMARTS, checks for errors, and pre-computes certain information that improves substructure-search speed.
pathset: A collection of path objects all of which have the same base molecule. A pathset is the result of matching a pattern against a molecule. (For background information, see the chapter on Substructures and Paths.)
vbind: A "vector binding". A binding of a name to a pattern or pathset. This is explained in more detail below.

9.2 Optimizing SMARTS

dt_smarts_opt(string smarts, boolean vmatch) => string optsmarts

Returns a new SMARTS string that is equivalent in meaning to the given SMARTS, but that has been reordered to permit matches to be done more quickly on typical molecules. (See the chapter on SMARTS in the section entitled "Efficiency Considerations" for more information.)

The optimized SMARTS is called "equivalent in meaning" to the original because the optimized string will match the exact same sets of objects as the original (though the pathsets returned may have their atoms and bonds in different orders), no matter what molecules they are matched against.

The exact meaning of a SMARTS depends upon the type of match to be performed, as described below in the functions dt_match(), and dt_vmatch(). The parameter vmatch, if TRUE, indicates that the SMARTS string should be optimized for use by dt_vmatch(); otherwise indicates that it should be optimized for dt_match() or dt_umatch(). (In practical terms, optimizing for dt_match() or dt_umatch() means that the head atom of the optimized string may not be the same as that of the original string, whereas optimizing for dt_vmatch() means that the first atom of the optimized string will be the same as that of the original string.)

The optimized SMARTS returned by this function are expected to be matched reasonably quickly on the average. The actual matching speed, however, depends on the molecules to be matched against, as well as on the SMARTS string itself. It is not possible to guarantee that the returned string will actually match faster than to original SMARTS. It is often possible to generate faster SMARTS "by hand" using particular knowledge of the molecules to which it will be applied. The algorithm employed by this function is based on rules generated from a database of "typical" organic molecules.

9.3 Allocating Patterns and Pathsets

dt_smartin(string smarts) => Handle pattern: Interprets the SMARTS string and creates a pattern object; returns the object's handle.
dt_alloc_pathset(Handle molecule) => Handle pathset: Allocates a new pathset object and returns its handle. (Note: this function is typically not needed; pathsets are normally generated by dt_match(), described below.

9.4 Vector Bindings and Vbind Functions

Vector bindings are a mechanism that allow faster evaluation of a match, and allows complex patterns to be constructed out of simpler patterns. Those familiar with earlier Daylight software (versions 3.6x and earlier) will recognize vector bindings as closely related to GCL's "define" functionality.

The term "binding" comes from mathematics. When one thing is bound to another, the latter can be used to represent the former. For example, in high-school algebra, the expression "x=3" binds the value three to the symbolic name "x"; thereafter, wherever "x" appears we substitute 3.

9.4.1 Pattern Bindings

The simplest use of vector bindings is when a pattern is bound to a name. Once the pattern and name are bound, the name can be used in another SMARTS string in place of an atomic symbol. For example, if we bind the pattern for C[Cl,Br,I] to the name "HALO", the string [$HALO] becomes a valid atomic symbol; wherever it is used it is equivalent to the atomic expression [$(C[Cl,Br,I])]. (This is an example of a recursive SMARTS.) Binding patterns to names doesn't extend the expressive power of SMARTS (unnamed vectors such as [$(C[Cl,Br,I])] are valid in any SMARTS), but it tends to make SMARTS more readable. This is especially true where atomic expressions such as [$(C[Cl,Br,I])] occur multiple times. For example, a 1,2,4-substituted cyclohexane could be written two ways; the second is easier to write and easier to understand:

     [$(C[Cl,Br,I])]1[$(C[Cl,Br,I])][CH2][$(C[Cl,Br,I])][CH2][CH2]1

     [$HALO]1[$HALO][CH2][$HALO][CH2][CH2]1

9.4.2 Pathset Bindings

When a pathset is bound to a name, that name represents all the places in the molecule where a search has already succeeded. If, for example, one has a number of SMARTS to test against a single molecule, and many of the SMARTS contain an common expression such as [$HALO], the search speed of each can be improved by pre-computing the pathsets for [$HALO] (i.e. doing dt_vmatch() on the pattern for C[Cl,Br,I]), then binding the resulting pathset to the name "HALO". When the [$HALO] atomic expression is encountered in a SMARTS, the vector binding already contains the pathset with all atoms that match; no additional searching is required for that atomic expression.

Binding pathsets in this fashion can greatly improve the performance of "chemical knowledge" systems in which a large number of rules are applied to each molecule. By carefully choosing common subexpressions, matching them to the molecule, and binding the resulting pathsets to names, a great deal of redundant searching can be eliminated.

9.4.3 Functions

dt_alloc_vbind(string name) => Handle vbind

If there is already a vector binding with the specified name, that existing vector binding is returned. Otherwise, a new vector binding is allocated and given the specified name.

The string name must begin with a letter from the alphabet; subsequent characters in the name must be alphabetic, a digit, or the underscore '_'. In UNIX parlance, the name must match the regular expression /^[a-zA-Z][a-zA-Z0-9_]*$/.

If the object bound to a vector-binding is deallocated (see dt_setval(), below), the vector binding's value is automatically set to NULL_OB. (Note that the vector-binding object is not deallocated in this case since its binding is not its parent or base object.)

dt_name(Handle vbind) => string name

Returns the vector binding's name.

dt_setval(Handle vbind, Handle pp) => boolean ok

Sets the value of the vector binding to the value of pp, which must be NULL_OB, a pattern, or a pathset.

dt_getval(Handle vbind) => Handle pp

Return the value of the vector binding vbind. It will be NULL_OB, a pattern, or a pathset.

9.5 Pattern Matching

All of the matching functions require the target to be a valid, consistant object. This means that molecules and reactions must be in mod-off state before calling any of the following functions.

dt_match(Handle pat, Handle mol, boolean exist) => pathset

Matches the pattern against the molecule; returns a pathset indicating the results of the match, or NULL_OB if no match is found or an error is detected.

The boolean parameter exist indicates whether an exhaustive search or a first-only search is to be performed. If exist is FALSE, an exhaustive search takes place; pathset will contain all places in the molecule where pattern matches. If exist is TRUE, the match stops as soon as the first match is found; pathset will contain just the first path that an exhaustive search would have found.

dt_vmatch(Handle pat, Handle mol, boolean exist) => pathset

Matches the pattern pat against the molecule mol as with dt_match(), above, except that each of the paths in the returned pathset contains just a single atom, the one that matched the "head" atom of the SMARTS from which pat is derived. Further, no two paths in pathset will contain the same atom.

dt_umatch(Handle pat, Handle mol, boolean exist) => pathset

A unique set of atoms match. Matches the given pattern against the molecule as with dt_match(), above, except that each of the paths in the returned pathset is guaranteed to contain a unique set of atoms (relative to all other paths in the pathset).

Conceptually one can think of this as though dt_match() were performed, then all paths compared to one another. If two paths contain the same atoms, one of the two (the choice is arbitrary) is removed from the pathset. Notice that two paths thus compared may not be equivalent; in particular paths that include cycles may contain the same atoms but not the same bonds. The resulting pathset is guaranteed to contain (as determined by dt_member()) the exact same number of atoms as a match performed with dt_match(), but it may not contain the same number of bonds.

The purpose of this function is to eliminate the "uninteresting" redundancy in paths that are typically returned by exhaustive searches using dt_match(). An exhaustive search, by definition, will find a path for each symmetry in the pattern. For example, dt_match() will return 12 paths when the SMARTS c1ccccc1 is applied to the molecule c1ccccc1; once for each of 6 possible starting atoms, times two for the clockwise and anticlockwise paths. When dt_umatch() is applied, only one path is returned.

The parameter exist is included for consistency; note, however, that if exist is TRUE, dt_match() and dt_umatch() are equivalent.

dt_xmatch(Handle pat, Handle mol, integer limit) => pathset

This is a further restriction of the unique set of atoms match used by dt_umatch(3). In this case, only non-overlapping paths are returned. That is, no two paths in the result set will contain the same atom. The parameter 'limit' works as follows: if 'limit' is zero, returns all possible paths (exhaustive search). If 'limit' is a positive integer, returns up to 'limit' answers. Hence, 'limit' can be used to restrict the number of matches found.

NOTE: the results from dt_xmatch(3) depend on the processing order of the atoms in the target molecule or reaction. For example, matching the pattern 'CC' against the molecule 'CCCC' will return two pathsets, Matching the same pattern against 'C(CC)C' gives one pathset. In the first case, Carbons #1 and #2 are matched first, leaving #3 and #4 as the second match. In the second case, Carbons #2 and #3 are matched first, leaving #1 and #4, which don't match further.

Logically, dt_xmatch(3) performs an exhaustive match and then picks one of potentially multiple non-overlapping sets of paths for the answer. Fortunately, although one can think of dt_xmatch(3) working this way, the actual implementation is much faster.

Back to Table of Contents
Go to previous chapter Error Handling
Go to next chapter Fingerprint Toolkit.