Improved SMILES Substructure Searching

Roger Sayle
Bioinformatics Group,
Metaphorics LLC,
Santa Fe, New Mexico.

1. Abstract

This talk describes several recent improvements in substructure searching. First, Daylight's "Merlin" substructure search engine is introduced, to provide the framework in which to describe the modifications. This is then followed by descriptions of four significant enhancements to that model. The improvements are (1) fragment fingerprinting, (2) pattern enumeration, (3) on-the-fly code generation and (4) triage finite state machines. Each of these methods dramatically improves the performance of common classes of structural queries. Most of these algorithms have been implemented within the new Daylight Oracle cartridge, DayCart.

2. The Merlin substructure search engine

A flowchart of the substructure search engine in Merlin is given below. There are two major paths depending upon whether the query is a SMILES molecule or a SMARTS pattern.

If the query is a complete valid molecule, its fingerprint can be determined and rapidly used to screen the database. One or more bits that are set in the query fingerprint that are not set in the database fingerprint indicates that the query cannot be a substructure of that database entry. Similarly, a SMILES query can be used in a character frequency screen to also prune entries that cannot match. For example, if the query is bromine, and database entry that whose SMILES does not contain atleast one "B" and one "r" is rejected. A database entry that passes both of these screens then requires an explicit pattern match to confirm the existance of the substructure. This requires a dt_smilin and dt_match of each molecule. Although this final stage is relatively quick, thousands of molecules per second, parsing the molecule, SSSR ring detection, aromaticity perception and pattern matching are much slower than fingerprint screening, which can process millions of molecules per second.

On the other hand, generic pattern matching using SMARTS queries is much slower. Unlike SMILES queries, most SMARTS patterns cannot be fingerprinted or character frequency screened, hence every entry in a database must be processed and pattern matched.

To salvage what it can of the situation, Merlin performs a simple check to determine whether a SMARTS query can be transformed into a syntactically valid SMILES, and if so treats it like a SMILES query. This involves deleting all atom and bond expressions that are not SMILES primitives and interpreting the resulting string as a SMILES.

3. Fragment Fingerprints

The first of the improvements to the merlin model is to enable the fingerprinting of molecule fragments. Traditionally, fingerprinting could only be applied to valid molecules. However, the implementation can be modified to allow the correct fingerprinting of molecule fragments that are not themselves valid molecules.

Examples of fragment fingerprints are shown below:

Fragment  256-bit fingerprint [ASCII representation]
c         .........U...............0..........U....E..2
c:c       ...0.....U...............0..........U....F2.2
c-c       .........U..........E....8..........U....G..2
c1ccccc1  .0.02....U..........6..3.G....2...U.U....F2.2
C=C       .....U......U.............+..........UE...k.2
The examples above show that 'c' and 'c:c' result in fingerprints that are substructures of benzene, even though they are not valid molecules. The canonical representation of the SMILES 'c:c' and 'c-c' is 'C=C', which as shown above is not a substructure of benzene.

To demonstrate the improvement, the time to query 126705 structures in NCI95 with the substructure 'c-c' drops from 1:39.47s to 8.73s using the DayCart Oracle cartridge on a 2 CPU 360MHz SUN Ultra60.

4. SMARTS fingerprint enumeration

The second of the optimizations is based upon SMARTS pattern enumeration. The use of disjunction (logical or) operators in SMARTS patterns greatly restricts the ability to fingerprint a query. For example, the query 'C=[N,P]' originally took 1m13s to search NCI95 in DayCart on a 2 CPU 360MHz SUN Ultra60, because the merlin engine is only able to fingerprint 'C'. However the two individual searches 'C=N' and 'C=P' take 12.48s and 0.77s respectively. Hence, SMARTS enumeration optimization works by determining a set of fingerprintable queries rather than just a single fingerprint. In practice, this performance of the above query drops to 13.35s.

To describe the steps of the algorithm, consider the case study pattern 'O=[C,N]aa[N,O;!H0]'. This pattern for intra-molecular hydrogen bonds was included in Dave's Daylight cartridge benchmarks presented in his EuroMug98 talk (here). This was consider a pathological case, where only a single 'O' could be fingerprinted and character frequency screened.

Step #1: SMARTS Optimization The first step is to apply SMARTS optimization techniques as first described in my EuroMug97 talk (here). This converts the SMARTS to 'O=[C,N]-a:a-[N!H0,O!H0]', by recognizing that default bonds to atom expressions that must be aliphatic must be single, and that only a single non-aromatic bond is permitted to an aromatic atom.

Step #2: Enumeration Normal Form The next stage is to strip any predicates unsuitable for enumeration or not applicable to fingerprint screening. The internal representation then contains only atomic number and aromaticity atom predicates, and bond type bond predicates. This results in 'O=[C,N]-[c,n,o,s,p,as,se]:[c,n,o,s,p,as,se]-[N,O]'.

Step #3: Enumeration Refinement To prevent a combinatorial explosion, the next stage iteratively removes high degree atoms or bonds until the user-specified limit is reached. For example, this could reduce the above SMARTS to 'O=[C,N].[N,O]'.

Step #4: Pattern Enumeration The next stage is the actual enumeration and fragment fingerprinting (see above). For this case study, this results in four fingerprints 'O=C.N', 'O=C.O', 'O=N.N' and 'O=N.O'.

Step #5: Fingerprint Reduction The final stage reduces the number of fingerprints by eliminating identical and redundant fingerprints. For our example, the fragment fingerprints for 'O=N.O' and 'O=N.N' are identical; the same as the fingerprint for 'O=N'. Additionally, 'O=C.O' is a subset of the fingerprint for 'O=C.N'. Hence, for our case study, only two fingerprints need be screened 'O=C' and 'O=N'.

In the explanation above the enumeration threshold was set artificially low. In practice, the DayCart Oracle cartridge uses a limit of 1024 fingerprints, and the SMARTS query above generates 196 fingerprints to screen. On a 2 CPU 360MHz SUN Ultra/60, the resulting search time drops from 1m09.29s to only 16.96s.

5. SMILES vs SMARTS semantics

It is interesting to note the fingerprint enumeration fixes a very rare bug in the merlin search engine, caused by the difference in SMILES and SMARTS semantics. Consider the SMARTS query 'n1ccncc1' which matches a six membered ring containing aromatic atoms. Note the SMARTS semantics for a default bond is to match a single or an aromatic bond. However if interpreted as a SMILES, it describes pyrazine, and its fingerprint only matches substructures where all six bonds are aromatic. Which in version 4.62 would fail to match the molecule below.

In version 4.62 of the Daylight smiles toolkit the above molecule contains three aromatic rings, and a central non-aromatic ring. The central ring contains two pyrole-like (not pyridine-like) aromatic nitrogens and hence doesn't contain Huckel's 4n+2 electrons. This is a rare example of a ring containing only aromatic atoms that isn't itself aromatic.

Fortunately, in version 4.71, this problem disappears. The new and improved aromaticity perception algorithm now recognizes that the circuit around the perimeter of the molecule contains 18 (4n+2) electrons, and so now every bond in this particular example is aromatic.

6. On-the-fly Code Generation

Some platforms benefit from on-the-fly code generation to speed up the actual fingerprint test. Typically, no information about a particular query fingerprint is known at compile time, and so the resulting generic fingerprint code looks like the following source code.
int fingertest( unsigned int *qry, unsigned int *db, int len )
{
    while( len-- ) {
        if( *qry != (*qry & *db) )
            return False;
        qry++;
        db++;
    }
    return True;
}
However, if the query fingerprint was known at compile-time, a faster implementation would be to hard code fingerprint test with inline values. The order words are tested could also be modified to test those with the most bits set first, and those with the least bits set last. Words in the query with no bits set don't have to be tested at all, with a large performance improvement for sparse fingerprints.

int fingertest( unsigned int *db )
{
    if( (db[ 0]&1074348120) != 1074348120 ) return False;
    if( (db[31]& 168839168) !=  168839168 ) return False;
    if( (db[47]&      1072) !=       1072 ) return False;
    if( (db[19]& 536936452) !=  536936452 ) return False;
    if( (db[ 2]&       514) !=        514 ) return False;
    if( (db[45]&     16384) !=      16384 ) return False;
    if( (db[21]&      8192) !=       8192 ) return False;
    return True;
}
On-the-fly code generation can take advantage of the above optimizations at run-time. The query is converted into a block of executable code that implements a function returning a boolean value if the fingerprint matches. On UNIX machines, this is done by placing the binary machine code at a page-aligned memory address, and calling "mprotect(2)" to mark the code as executable. Although "Just In Time" compilation requires intimate knowledge of the processor's instruction set, the regular structure of fingerprint testing require only a few different assembler instructions. Implementations have be developed for Sparc, MIPS, Alpha and Intel architectures.

The following assembly language shows the binary machine code generated to test the above fingerprint on the Intel 386 architecture.

8b542404        movl 4(%esp),%edx
8b02            movl (%edx),%eax
2558400940      andl $1074348120,%eax
3d58400940      cmpl $1074348120,%eax
7403            je +3
31c0            xorl %eax,%eax
c3              ret
8b427c          movl 124(%edx),%eax
250048100a      andl $168839168,%eax
3d0048100a      cmpl $168839168,%eax
7403            je +3
31c0            xorl %eax,%eax
c3              ret
...
f642b540        testb $64,181(%edx)
7503            jne +3
31c0            xorl %eax,%eax
c3              ret
f6425520        testb $32,85(%edx)
7503            jne +3
31c0            xorl %eax,%eax
c3              ret
b801000000      movl $0x1,%eax
c3              ret

On an 195MHz R10000 SGI Origin, the time taken to screen the first 2000 molecules of medchem99 against the same database (33046 structures) using 512bit fingerprints drops from 25.7s to 16.7s, and on 2048bit fingerprints from 38.7s to 23.3s.

7. "Triage" finite state machine matching

The final optimization takes a completely different approach to improve the speed of substructure screening. The use of fingerprint screening and character frequency screening prune molecules that couldn't possibly match from performing the computationally expensive molecule perception and SMARTS pattern match. The "triage" optimization works from the other end to determine which molecules must match without requiring dt_smilin or dt_match.

For example, consider a substructure search of benzene. Any canonical SMILES in the database that contains the substring 'c1ccccc1' must contain a benzene ring as a substructure. And of the 126705 structures in NCI95, UNIX grep command can rapidly identify 14377 such molecules. The advantage is that using a technique called finite state machines (FSMs), or more technically deterministic finite automata (DFAs), searching for substrings can be done incredibly rapidly. DFAs don't perform any backtracking, and can process each character using only five or six machine instructions. In theory, a 360MHz processor can examine 60 million characters per second, or approximately 2 million structures a second, or NCI95 in under a tenth of a second. For the benzene example above, this 0.1s avoids processing 14377 molecules!

The implementation used in DayCart uses a Aho & Corasick's algorithm that converts a trie of words to match into a DFA. This O(n) method inserts the necessary back edges to avoid backtracking much like the Knuth-Morris-Pratt string matching algorithm does for a single string.

In the actual implementation the DFA is represented as a 2-dimensional matrix. The first ordinate being the current state, the second being the next input character and the value in the matrix the sucessor state. The actual matching loop is shown below.

int FSMMatch( FSMType *fsm, char *str )
{
    register char *ptr;
    register int state;
    register int ch;

    ptr = str;
    state = 2;
    while( *ptr ) {
        ch = fsm->map[*ptr++];
        state = fsm->tran[state][ch];
    }
    /* return state == 1; */
    return fsm->tran[state][0] == 1;
}

Although the details of FSM construction are omitted, the set of substrings to match and modifications to the classic construction are detailed below.

  1. The triage algorithm is currently only applied to single component (connected) molecules without stereochemistry. Hence if the canonical SMILES of the query contains a '.' (indicating multiple parts), a '>' (indicating a reaction), or '@', '/' or '\' (indicating stereo chemistry) the triage is skipped. Isotopes are correctly handled. Future implementations may support stereochemistry [with more advanced SMILES generation] and searches of reactant/product sides of the reaction [by simple manipulation of the DFA].

  2. Instead of searching for just the canonical SMILES, triage currently attempts to generate all possible SMILES of a query molecule. However, to once again avoid combinatorial explosions, the number of SMILES is currently capped at 100. For example, methylbenzene, currently generates the 12 permutations below:
    Cc1ccccc1       c1(C)ccccc1     c1(ccccc1)C
    c1c(C)cccc1     c1c(cccc1)C     c1ccccc1C
    c1cc(C)ccc1     c1cc(ccc1)C     c1cccc(C)c1
    c1cccc(c1)C     c1ccc(C)cc1     c1ccc(cc1)C
    

  3. To reduce the number of SMILES permutations, the algorithm takes advantage of a property of canonical SMILES that higher bond orders are prefered over lower. This reduces the number of permutations of a carboxy query to 3, 'O=CO', 'C(=O)O', 'O=CO'.

  4. To avoid semantic differences, single bonds between aromatic atoms, such as in biphenyl, are explicitly written as '-'. This preserves the semantics, but for v471 only matches where the bridge bond is in a ring. Hence, a biphenyl search does not triage itself, but does find 146 superstructures in NCI95.

  5. Multiple versions of each SMILES permutation are inserted into the trie by sequentially incrementing ring closure digits, whilst the highest ring closure index is less then %10. The nine benzene cases, increase the number of matches in NCI95 from 14377 to 40029.

  6. Additional states are required at the end of SMILES ending in the characters 'C' and 'B' to avoid confusion with 'Cl' and 'Br', i.e. 'CC' is not a substructure of 'CCl'! These transitions are easy enough to add to the DFA, as an extra state where any input character other than the appropriate 'l' or 'r' leads to the sucess state. [And for the very observant explains the commented termination condition in the code above].

  7. SMILES that are only a single character long, i.e. 'C', 'N', 'O', 'S', 'P', 'F', 'I' and 'B', require special attention to avoid finding a match inside square brackets, i.e. 'P' is not a substructure of '[Pt]'. This is handled by adding an extra state such that a '[' at the initial state goes to the new state, a ']' at the new state returns to the initial state, and every other character loops in the new state. Note that the query 'Cl' matches '[Cl-]', but the query 'I' does not match '[I-]'.

  8. An easy to implement optimization to the DFA is to silently skip open parenthesis at states where they are not otherwise accepted. This doesn't increase the number of states, but greatly increases coverage. So that the query 'CC' will additionally match 'C(C'. With the benzene example, the number of NCI95 matches increases from 40029 to 50893.

  9. The DFA is modified to enter the fail state upon encountering either a space, a tab or the string '%99'. The space and tab delimters are only useful in processing files, whilst the '%99' avoids the strange behaviour of wrap-around ring closure indices.

Performance results for a set of queries are shown in the table below:

NameSMILESCorrectFPTriageBeforeAfterLatest
PropaneCCC65337663524241142.5917.9914.34
Selenium[Se]2469952250.800.830.52
Benzenec1ccccc179426794865089372.6927.5620.29
MethaneC11851911852411851161.295.474.25
AmidoNC=O25695269751470218.899.848.16
MethylbenzeneCc1ccccc154529568692049054.7635.5825.90
CarboxyOC=O33009343691780923.8612.4810.24
ChlorineCl19424233181942411.231.381.12
CyclopropaneC1CC186343584848.247.785.02
Biphenylc1ccccc1c2ccccc22967514214621.9421.6511.44
DopamineNCCc1ccc(O)c(O)c1829913231.852.091.47
Sulfisoxazole7830.500.880.51
BetaCarotene21610.480.680.58
Nitrofurantoin0000.420.580.52

8. Bibliography

  1. A.V. Aho and M.J. Corasick, "Efficient String Matching: An Aid to Bibliographic Search", Communications of the ACM, Vol. 18, No. 6, pp. 333-340, 1975.
  2. D. Keppel, S.J. Eggers and R.R. Henry, "Evaluating Runtime-Compiled Value Specific Optimizations", Technical Report UWCSE 93-11-02, Department of Computer Science and Engineering, University of Washington, 1993.
  3. D.E. Knuth, "The Art of Computer Programming: Vol. 3, Sorting and Searching", 2nd Edition, Addison-Welsey Publishers, 1998.
  4. D.E. Knuth, J.H. Morris and V.R. Pratt, "Fast Pattern Matching in Strings", Siam Journal of Computing, Vol. 6, No. 2, pp. 323-350, 1977.
roger@metaphorics.com