9. SMARTS Toolkit: Structural Searching
Back to Table of Contents
9.1 Introduction
The Daylight SMARTS Toolkit provides a powerful set of substructure
searching algorithms. The SMARTS Toolkit can parse a
SMARTS string
and produce a
pattern object,
then test the pattern against one or more
molecule objects
to see if the molecule contains the pattern.
Several types of pattern searches are available, including "yes/no"
tests and exhaustive enumeration of all occurances of a pattern in a
molecule.
There are three objects specific to the Daylight SMARTS Toolkit:
-
pattern
-
The result of "compiling" a
SMARTS string.
Each SMARTS string is converted to a pattern before use; the
pattern-generation algorithm parses the SMARTS, checks for errors, and
pre-computes certain information that improves substructure-search speed.
-
pathset
-
A collection of
path objects all of which have the same
base molecule. A pathset is the result of matching a pattern against
a molecule. (For background information, see the
chapter on Substructures and Paths.)
-
vbind
-
A "vector binding". A binding of a name to a pattern or
pathset. This is explained in more detail below.
9.2 Optimizing SMARTS
-
dt_smarts_opt(string smarts, boolean vmatch) => string optsmarts
-
Returns a new SMARTS string that is equivalent in meaning to the
given SMARTS, but that has been reordered to permit matches to be done
more quickly on typical molecules. (See the
chapter on SMARTS
in the section entitled "Efficiency Considerations" for
more information.)
The optimized SMARTS is called "equivalent in meaning" to the
original because the optimized string will match the exact same sets
of objects as the original (though the pathsets returned may have
their atoms and bonds in different orders), no matter what molecules
they are matched against.
The exact meaning of a SMARTS depends upon the type of match to be
performed, as described below in the functions dt_match(), and
dt_vmatch(). The parameter
vmatch , if TRUE, indicates that the
SMARTS string should be optimized for use by
dt_vmatch(); otherwise
indicates that it should be optimized for dt_match() or dt_umatch().
(In practical terms, optimizing for dt_match() or dt_umatch() means
that the head atom of the optimized string may not be the same as that
of the original string, whereas optimizing for dt_vmatch() means that
the first atom of the optimized string will be the same as that of the
original string.)
The optimized SMARTS returned by this function are expected to be
matched reasonably quickly on the average. The actual matching
speed, however, depends on the molecules to be matched against, as
well as on the SMARTS string itself. It is not possible to guarantee
that the returned string will actually match faster than to original
SMARTS. It is often possible to generate faster SMARTS "by hand"
using particular knowledge of the molecules to which it will be
applied. The algorithm employed by this function is based on rules
generated from a database of "typical" organic molecules.
9.3 Allocating Patterns and Pathsets
-
dt_smartin(string smarts) => Handle pattern
-
Interprets the SMARTS string and creates a pattern object; returns
the object's handle.
-
dt_alloc_pathset(Handle molecule) => Handle pathset
-
Allocates a new pathset object and returns its handle. (Note: this function
is typically not needed; pathsets are normally generated by
dt_match(), described
below.
9.4 Vector Bindings and Vbind Functions
Vector bindings are a mechanism that allow faster evaluation of a
match, and allows complex patterns to be constructed out of simpler
patterns. Those familiar with earlier Daylight software (versions
3.6x and earlier) will recognize vector bindings as closely related
to GCL's "define" functionality.
The term "binding" comes from mathematics. When one thing is bound
to another, the latter can be used to represent the former. For
example, in high-school algebra, the expression "x=3" binds the value
three to the symbolic name "x"; thereafter, wherever "x" appears we
substitute 3.
9.4.1 Pattern Bindings
The simplest use of vector bindings is when a pattern is bound to a
name. Once the pattern and name are bound, the name can be used in
another SMARTS string in place of an atomic symbol. For example, if
we bind the pattern for C[Cl,Br,I] to the name "HALO", the string
[$HALO] becomes a valid atomic symbol; wherever it is used it is
equivalent to the atomic expression [$(C[Cl,Br,I])].
(This is an example of a
recursive SMARTS.)
Binding patterns to names doesn't extend the expressive power of SMARTS
(unnamed vectors such as [$(C[Cl,Br,I])] are valid in any SMARTS), but it
tends to make SMARTS more readable. This is especially true where atomic
expressions such as [$(C[Cl,Br,I])] occur multiple times. For example, a
1,2,4-substituted cyclohexane could be written two ways; the second is easier
to write and easier to understand:
[$(C[Cl,Br,I])]1[$(C[Cl,Br,I])][CH2][$(C[Cl,Br,I])][CH2][CH2]1
[$HALO]1[$HALO][CH2][$HALO][CH2][CH2]1
9.4.2 Pathset Bindings
When a pathset is bound to a name, that name represents all the
places in the molecule where a search has already succeeded. If, for
example, one has a number of SMARTS to test against a single
molecule, and many of the SMARTS contain an common expression such as
[$HALO], the search speed of each can be improved by pre-computing
the pathsets for [$HALO] (i.e. doing dt_vmatch() on the pattern for
C[Cl,Br,I]), then binding the resulting pathset to the name "HALO".
When the [$HALO] atomic expression is encountered in a SMARTS, the
vector binding already contains the pathset with all atoms that
match; no additional searching is required for that atomic
expression.
Binding pathsets in this fashion can greatly improve the performance
of "chemical knowledge" systems in which a large number of rules are
applied to each molecule. By carefully choosing common
subexpressions, matching them to the molecule, and binding the
resulting pathsets to names, a great deal of redundant searching can
be eliminated.
9.4.3 Functions
-
dt_alloc_vbind(string name) => Handle vbind
-
If there is already a vector binding with the specified name, that existing
vector binding is returned. Otherwise, a new vector
binding is allocated and given the specified name.
The string name must begin with a letter from the alphabet;
subsequent characters in the name must be alphabetic, a digit, or the
underscore '_'. In UNIX parlance, the name must match the regular
expression /^[a-zA-Z][a-zA-Z0-9_]*$/ .
If the object bound to a vector-binding is deallocated (see
dt_setval(), below), the vector binding's value is automatically set
to NULL_OB. (Note that the vector-binding object is not deallocated
in this case since its binding is not its parent or base object.)
-
dt_name(Handle vbind) => string name
-
Returns the vector binding's name.
-
dt_setval(Handle vbind, Handle pp) => boolean ok
-
Sets the value of the vector binding to the value of
pp ,
which must be NULL_OB, a pattern, or a pathset.
-
dt_getval(Handle vbind) => Handle pp
-
Return the value of the vector binding vbind. It will be NULL_OB, a
pattern, or a pathset.
9.5 Pattern Matching
All of the matching functions require the target to be a valid, consistant
object. This means that molecules and reactions must be in mod-off state
before calling any of the following functions.
-
dt_match(Handle pat,
Handle mol, boolean exist) => pathset
-
Matches the pattern against the molecule; returns a pathset
indicating the results of the match, or NULL_OB if no match is found or
an error is detected.
The boolean parameter exist indicates whether an
exhaustive search or a first-only search is to be performed. If exist
is FALSE, an exhaustive search takes place; pathset will contain
all places in the molecule where pattern matches. If
exist is TRUE, the match stops as soon as the first match
is found; pathset will contain just the first path that an exhaustive
search would have found.
-
dt_vmatch(Handle pat,
Handle mol, boolean exist) => pathset
-
Matches the pattern pat against the molecule mol as
with dt_match(),
above, except that each of the paths in the returned pathset contains
just a single atom, the one that matched the "head" atom of the
SMARTS from which pat is derived. Further, no two paths in pathset
will contain the same atom.
-
dt_umatch(Handle pat,
Handle mol, boolean exist) => pathset
-
A unique set of atoms match. Matches the given pattern against
the molecule as with
dt_match(),
above, except that each of the paths in the returned pathset is
guaranteed to contain a unique set of atoms (relative to all other
paths in the pathset).
Conceptually one can think of this as
though dt_match() were
performed, then all paths compared to one another. If two paths
contain the same atoms, one of the two (the choice is arbitrary) is
removed from the pathset. Notice that two paths thus compared may
not be equivalent; in particular paths that include cycles may
contain the same atoms but not the same bonds. The resulting pathset
is guaranteed to contain (as determined
by dt_member()) the exact
same number of atoms as a match performed
with dt_match(), but it may
not contain the same number of bonds.
The purpose of this function is to eliminate the "uninteresting"
redundancy in paths that are typically returned by exhaustive
searches using dt_match().
An exhaustive search, by definition, will find a path for each
symmetry in the pattern. For example,
dt_match() will return 12 paths
when the SMARTS c1ccccc1 is applied to the molecule c1ccccc1; once
for each of 6 possible starting atoms, times two for the clockwise
and anticlockwise paths. When
dt_umatch() is applied, only
one path is returned.
The parameter exist is included for consistency; note, however, that
if exist is TRUE, dt_match()
and dt_umatch() are equivalent.
-
dt_xmatch(Handle pat,
Handle mol, integer limit) => pathset
-
This is a further restriction of the unique set of atoms match used
by dt_umatch(3). In this case, only non-overlapping paths are returned.
That is, no two paths in the result set will contain the same atom. The
parameter 'limit' works as follows: if 'limit' is zero, returns all
possible paths (exhaustive search). If 'limit' is a positive integer,
returns up to 'limit' answers. Hence, 'limit' can be used to restrict
the number of matches found.
NOTE: the results from dt_xmatch(3) depend on the processing order of
the atoms in the target molecule or reaction. For example, matching
the pattern 'CC' against the molecule 'CCCC' will return two pathsets,
Matching the same pattern against 'C(CC)C' gives one pathset.
In the first case, Carbons #1 and #2 are matched first,
leaving #3 and #4 as the second match. In the second case,
Carbons #2 and #3 are matched first, leaving #1 and #4, which don't
match further.
Logically, dt_xmatch(3) performs an exhaustive match and then picks one
of potentially multiple non-overlapping sets of paths for the answer.
Fortunately, although one can think of dt_xmatch(3) working this way,
the actual implementation is much faster.
Back to Table of Contents
Go to previous chapter Error Handling
Go to next chapter Fingerprint Toolkit.
|