If the query is a complete valid molecule, its fingerprint can be determined and rapidly used to screen the database. One or more bits that are set in the query fingerprint that are not set in the database fingerprint indicates that the query cannot be a substructure of that database entry. Similarly, a SMILES query can be used in a character frequency screen to also prune entries that cannot match. For example, if the query is bromine, and database entry that whose SMILES does not contain atleast one "B" and one "r" is rejected. A database entry that passes both of these screens then requires an explicit pattern match to confirm the existance of the substructure. This requires a dt_smilin and dt_match of each molecule. Although this final stage is relatively quick, thousands of molecules per second, parsing the molecule, SSSR ring detection, aromaticity perception and pattern matching are much slower than fingerprint screening, which can process millions of molecules per second.
On the other hand, generic pattern matching using SMARTS queries is much slower. Unlike SMILES queries, most SMARTS patterns cannot be fingerprinted or character frequency screened, hence every entry in a database must be processed and pattern matched.
To salvage what it can of the situation, Merlin performs a simple check to determine whether a SMARTS query can be transformed into a syntactically valid SMILES, and if so treats it like a SMILES query. This involves deleting all atom and bond expressions that are not SMILES primitives and interpreting the resulting string as a SMILES.
Examples of fragment fingerprints are shown below:
Fragment 256-bit fingerprint [ASCII representation] c .........U...............0..........U....E..2 c:c ...0.....U...............0..........U....F2.2 c-c .........U..........E....8..........U....G..2 c1ccccc1 .0.02....U..........6..3.G....2...U.U....F2.2 C=C .....U......U.............+..........UE...k.2The examples above show that 'c' and 'c:c' result in fingerprints that are substructures of benzene, even though they are not valid molecules. The canonical representation of the SMILES 'c:c' and 'c-c' is 'C=C', which as shown above is not a substructure of benzene.
To demonstrate the improvement, the time to query 126705 structures in NCI95 with the substructure 'c-c' drops from 1:39.47s to 8.73s using the DayCart Oracle cartridge on a 2 CPU 360MHz SUN Ultra60.
To describe the steps of the algorithm, consider the case study pattern 'O=[C,N]aa[N,O;!H0]'. This pattern for intra-molecular hydrogen bonds was included in Dave's Daylight cartridge benchmarks presented in his EuroMug98 talk (here). This was consider a pathological case, where only a single 'O' could be fingerprinted and character frequency screened.
Step #1: SMARTS Optimization The first step is to apply SMARTS optimization techniques as first described in my EuroMug97 talk (here). This converts the SMARTS to 'O=[C,N]-a:a-[N!H0,O!H0]', by recognizing that default bonds to atom expressions that must be aliphatic must be single, and that only a single non-aromatic bond is permitted to an aromatic atom.
Step #2: Enumeration Normal Form The next stage is to strip any predicates unsuitable for enumeration or not applicable to fingerprint screening. The internal representation then contains only atomic number and aromaticity atom predicates, and bond type bond predicates. This results in 'O=[C,N]-[c,n,o,s,p,as,se]:[c,n,o,s,p,as,se]-[N,O]'.
Step #3: Enumeration Refinement To prevent a combinatorial explosion, the next stage iteratively removes high degree atoms or bonds until the user-specified limit is reached. For example, this could reduce the above SMARTS to 'O=[C,N].[N,O]'.
Step #4: Pattern Enumeration The next stage is the actual enumeration and fragment fingerprinting (see above). For this case study, this results in four fingerprints 'O=C.N', 'O=C.O', 'O=N.N' and 'O=N.O'.
Step #5: Fingerprint Reduction The final stage reduces the number of fingerprints by eliminating identical and redundant fingerprints. For our example, the fragment fingerprints for 'O=N.O' and 'O=N.N' are identical; the same as the fingerprint for 'O=N'. Additionally, 'O=C.O' is a subset of the fingerprint for 'O=C.N'. Hence, for our case study, only two fingerprints need be screened 'O=C' and 'O=N'.
In the explanation above the enumeration threshold was set artificially low. In practice, the DayCart Oracle cartridge uses a limit of 1024 fingerprints, and the SMARTS query above generates 196 fingerprints to screen. On a 2 CPU 360MHz SUN Ultra/60, the resulting search time drops from 1m09.29s to only 16.96s.
In version 4.62 of the Daylight smiles toolkit the above molecule contains three aromatic rings, and a central non-aromatic ring. The central ring contains two pyrole-like (not pyridine-like) aromatic nitrogens and hence doesn't contain Huckel's 4n+2 electrons. This is a rare example of a ring containing only aromatic atoms that isn't itself aromatic.
Fortunately, in version 4.71, this problem disappears. The new and improved aromaticity perception algorithm now recognizes that the circuit around the perimeter of the molecule contains 18 (4n+2) electrons, and so now every bond in this particular example is aromatic.
int fingertest( unsigned int *qry, unsigned int *db, int len ) { while( len-- ) { if( *qry != (*qry & *db) ) return False; qry++; db++; } return True; }However, if the query fingerprint was known at compile-time, a faster implementation would be to hard code fingerprint test with inline values. The order words are tested could also be modified to test those with the most bits set first, and those with the least bits set last. Words in the query with no bits set don't have to be tested at all, with a large performance improvement for sparse fingerprints.
int fingertest( unsigned int *db ) { if( (db[ 0]&1074348120) != 1074348120 ) return False; if( (db[31]& 168839168) != 168839168 ) return False; if( (db[47]& 1072) != 1072 ) return False; if( (db[19]& 536936452) != 536936452 ) return False; if( (db[ 2]& 514) != 514 ) return False; if( (db[45]& 16384) != 16384 ) return False; if( (db[21]& 8192) != 8192 ) return False; return True; }On-the-fly code generation can take advantage of the above optimizations at run-time. The query is converted into a block of executable code that implements a function returning a boolean value if the fingerprint matches. On UNIX machines, this is done by placing the binary machine code at a page-aligned memory address, and calling "mprotect(2)" to mark the code as executable. Although "Just In Time" compilation requires intimate knowledge of the processor's instruction set, the regular structure of fingerprint testing require only a few different assembler instructions. Implementations have be developed for Sparc, MIPS, Alpha and Intel architectures.
The following assembly language shows the binary machine code generated to test the above fingerprint on the Intel 386 architecture.
8b542404 movl 4(%esp),%edx 8b02 movl (%edx),%eax 2558400940 andl $1074348120,%eax 3d58400940 cmpl $1074348120,%eax 7403 je +3 31c0 xorl %eax,%eax c3 ret 8b427c movl 124(%edx),%eax 250048100a andl $168839168,%eax 3d0048100a cmpl $168839168,%eax 7403 je +3 31c0 xorl %eax,%eax c3 ret ... f642b540 testb $64,181(%edx) 7503 jne +3 31c0 xorl %eax,%eax c3 ret f6425520 testb $32,85(%edx) 7503 jne +3 31c0 xorl %eax,%eax c3 ret b801000000 movl $0x1,%eax c3 ret
On an 195MHz R10000 SGI Origin, the time taken to screen the first 2000 molecules of medchem99 against the same database (33046 structures) using 512bit fingerprints drops from 25.7s to 16.7s, and on 2048bit fingerprints from 38.7s to 23.3s.
For example, consider a substructure search of benzene. Any canonical SMILES in the database that contains the substring 'c1ccccc1' must contain a benzene ring as a substructure. And of the 126705 structures in NCI95, UNIX grep command can rapidly identify 14377 such molecules. The advantage is that using a technique called finite state machines (FSMs), or more technically deterministic finite automata (DFAs), searching for substrings can be done incredibly rapidly. DFAs don't perform any backtracking, and can process each character using only five or six machine instructions. In theory, a 360MHz processor can examine 60 million characters per second, or approximately 2 million structures a second, or NCI95 in under a tenth of a second. For the benzene example above, this 0.1s avoids processing 14377 molecules!
The implementation used in DayCart uses a Aho & Corasick's algorithm that converts a trie of words to match into a DFA. This O(n) method inserts the necessary back edges to avoid backtracking much like the Knuth-Morris-Pratt string matching algorithm does for a single string.
In the actual implementation the DFA is represented as a 2-dimensional matrix. The first ordinate being the current state, the second being the next input character and the value in the matrix the sucessor state. The actual matching loop is shown below.
int FSMMatch( FSMType *fsm, char *str ) { register char *ptr; register int state; register int ch; ptr = str; state = 2; while( *ptr ) { ch = fsm->map[*ptr++]; state = fsm->tran[state][ch]; } /* return state == 1; */ return fsm->tran[state][0] == 1; }
Although the details of FSM construction are omitted, the set of substrings to match and modifications to the classic construction are detailed below.
Cc1ccccc1 c1(C)ccccc1 c1(ccccc1)C c1c(C)cccc1 c1c(cccc1)C c1ccccc1C c1cc(C)ccc1 c1cc(ccc1)C c1cccc(C)c1 c1cccc(c1)C c1ccc(C)cc1 c1ccc(cc1)C
Performance results for a set of queries are shown in the table below:
Name | SMILES | Correct | FP | Triage | Before | After | Latest |
Propane | CCC | 65337 | 66352 | 42411 | 42.59 | 17.99 | 14.34 |
Selenium | [Se] | 246 | 995 | 225 | 0.80 | 0.83 | 0.52 |
Benzene | c1ccccc1 | 79426 | 79486 | 50893 | 72.69 | 27.56 | 20.29 |
Methane | C | 118519 | 118524 | 118511 | 61.29 | 5.47 | 4.25 |
Amido | NC=O | 25695 | 26975 | 14702 | 18.89 | 9.84 | 8.16 |
Methylbenzene | Cc1ccccc1 | 54529 | 56869 | 20490 | 54.76 | 35.58 | 25.90 |
Carboxy | OC=O | 33009 | 34369 | 17809 | 23.86 | 12.48 | 10.24 |
Chlorine | Cl | 19424 | 23318 | 19424 | 11.23 | 1.38 | 1.12 |
Cyclopropane | C1CC1 | 863 | 4358 | 484 | 8.24 | 7.78 | 5.02 |
Biphenyl | c1ccccc1c2ccccc2 | 2967 | 5142 | 146 | 21.94 | 21.65 | 11.44 |
Dopamine | NCCc1ccc(O)c(O)c1 | 829 | 913 | 23 | 1.85 | 2.09 | 1.47 |
Sulfisoxazole | 7 | 8 | 3 | 0.50 | 0.88 | 0.51 | |
BetaCarotene | 2 | 16 | 1 | 0.48 | 0.68 | 0.58 | |
Nitrofurantoin | 0 | 0 | 0 | 0.42 | 0.58 | 0.52 |