1.0
============================================================ Table 1: Acyclic Saturated Isomers (C7H16 to C20H42) Constraints: nodes=1 edges=1 edges/node=1-4 cycle/graph=0 ------------------------------------------------------------ Nodes Unique Checks (Pct) Total (Pct) Time Uniq/s ------------------------------------------------------------ 7 9 9(100) 9(100) 0.01 n.m. 8 18 18(100) 18(100) 0.01 n.m. 9 35 36 (97) 36 (97) 0.01 n.m. 10 75 79 (95) 81 (93) 0.02 n.m. 11 159 169 (94) 183 (84) 0.04 n.m. 12 355 378 (94) 441 (80) 0.07 n.m. 13 802 856 (94) 1,079 (74) 0.16 n.m. 14 1,858 1,976 (94) 2,715 (68) 0.37 n.m. 15 4,347 4,615 (94) 6,893 (63) 0.93 n.m. 16 10,359 10,957 (95) 17,779 (58) 2.34 4446 17 24,894 26,230 (95) 46,105 (54) 5.89 4234 18 60,523 63,539 (95) 120,558 (50) 14.92 4059 19 148,284 155,106 (96) 316,629 (47) 38.90 3813 20 366,319 381,883 (96) 835,793 (44) 107.19 3418 ============================================================ Table 2: Monocyclic Saturated Isomers (C7H14 to C20H40) Constraints: nodes=1 edges=1 edges/node=1-4 cycle/graph=1 ------------------------------------------------------------ Nodes Unique Checks (Pct) Total (Pct) Time Uniq/s ------------------------------------------------------------ 7 29 69 (42) 69(100) 0.01 n.m. 8 73 187 (39) 187 (39) 0.02 n.m. 9 185 532 (35) 544 (34) 0.05 n.m. 10 474 1,450 (33) 1,551 (31) 0.13 n.m. 11 1,230 4,058 (30) 4,643 (26) 0.42 n.m. 12 3,231 11,148 (29) 13,807 (23) 1.16 2810 13 8,501 30,960 (27) 41,684 (20) 3.31 2576 14 22,556 85,403 (26) 125,126 (18) 9.66 2337 15 60,064 236,432 (25) 376,321 (16) 27.86 2157 16 160,592 653,256 (25) 1,126,433 (14) 80.31 2000 17 430,656 1,807,533 (24) 3,364,619 (13) 240.02 1794 18 1,158,357 4,999,172 (23) 10,012,916 (12) 730.11 1587 19 3,122,614 13,836,530 (23) 29,713,552 (11) 2535.06 1232 20 8,436,528 38,300,823 (22) 87,903,509 (10) 10951.90 770 ============================================================ Key: Unique - number of non-isomorphic graphs Checks - number of graphs checked for isomorphism Total - number of graphs without isomorphic pruning Time - in seconds, includes overhead of 0.01 sec Uniq/s - number of unique graphs per second System: Pentium IV 2.4 GHz CPU Red Hat ES 2.1 (Panama) Compiler: gcc version 2.96 20000731 optimization level: -O3
Table 1 and 2 show the results of the two control experiments. For both experiments, the exact number of unique isomers is in agreement with the literature. This proves the algorithm is completely covering ``graph space'', at least under these constraints, and offers promise that it is in general. Table 1 shows the number of acyclic graphs checked for isomorphism at a low of 94% unique for C11-C15, then gradually rises to over 96% at C20. This algorithm is believed to be performing at or near the theoretical maximum unique structures per isomorphic check. Some redundancy has been shown to be unavoidable. Table 2 shows the number of cyclic graphs checked for isomorphism gradually decreasing from 42% at C7 to 22% at C20. This downward trend is the result of more nodes, because there is an exceedingly greater number of ways to form a cycle.
Both tables show the trend in total number of graphs without isomorphic pruning to become more significant from C7 to C20 compared to unique graphs. In Table 1, the percentage of unique in all total graphs decreases from 93% at C10 to 63% at C15 then 44% at C20. In Table 2, the percentage decreases from 31% at C10 to 16% at C15 then 10% at C20. This shows the importance of dt_cansmiles(), which can significantly reduce redundancy, especially with more nodes and cycle formation. Its value is expected to be higher when forming multiple cycles.
The rate at which structures are produced varies from over 3K per second for acyclic graphs to just under 1K for monocyclic graphs. The trend shows a decrease in production rate with more nodes, and is due to the cost of fingerprinting longer path lengths and identifying rings when dt_mod_off() is called.
Next, the algorithm was seeded with methane and 100 structures were generated and visualized. No constraints were used except for observing normal valences. Table 3 shows some of the generated structures. It became immediately apparent that the extent of branching and cyclization needed to be constrained in order to avoid producing unstable and uninteresting compounds.
1.0
============================================================ Table 4: Database Statistics ------------------------------------------------------------ Database Compounds Avg MW Brnch Rings Cycli Chira ------------------------------------------------------------ acd021 329,985 344.86 2.43 2.36 1.06 acd033 167,769 336.81 2.45 1.95 1.07 aquire94 2,436 193.80 2.33 1.00 1.05 asinex00 148,850 372.24 2.42 3.14 1.08 bioscr99np 13,905 355.66 2.48 3.25 1.18 1.14 bioscr99sc 39,975 363.83 2.44 3.25 1.12 0.04 chembridge99 75,344 340.51 2.39 2.85 1.06 chemreact97 389,372 345.71 2.43 2.20 1.08 chemsynth97demo 433 338.43 2.43 2.24 1.10 maybridge001 63,648 316.35 2.44 2.36 1.05 medchem02 43,849 279.37 2.43 1.92 1.08 0.00 nci00 162,148 310.67 2.44 2.37 1.11 0.00 spresi95 2,462,790 349.07 2.45 2.49 1.10 spresi95preps 981,936 327.94 2.44 2.49 1.10 spresi98rxn 2,035,698 350.46 2.42 2.26 1.08 tcm01 5,695 395.30 2.54 3.51 1.22 3.68 tsca93 38,816 235.58 2.40 1.31 1.07 wdi033 62,518 494.57 2.47 3.17 1.14 2.83 ============================================================ Key: Avg MW - Dayprop property AVERAGE_MOL_WT Branching - number of bonds per non-terminal atom Rings - number of cycles per molecule Cyclization - number of cycles per ring atom Chirality - number of chiral center per molecule
So, many databases were evaluated as shown in Table 4. The range of values is narrow for some metrics, and is centered about 350 for average molecular weight, 2.43 for branching, 2.5 for number of rings, and 1.09 for cyclization. Additionally, the average number of chiral centers for drug databases varies from 1 to 4, with some not containing any stereocenters. These average values and two standard deviations were used to bound the GENSMI algorithm.
Now, based on database profiles, the algorithm was constrained by setting limits on branching to 3.15, cyclization to 1.65, ring per molecule to 5, rings per atom to 2, and allowing only 5 and 6-member rings to be formed. Table 5 shows some of the structures. The algorithm was rerun with branching set to 2.75 and cyclization set to 1.40. Table 6 shows some of those structures. Using these constraints, the resulting structure seem more stable and desirable.
For constructing ``drug-like'' structures, drug databases were profiled and the following constraints were devised: Atoms: C 17-99%, O 1-50%, N 1-45%, F 0-17%, Cl 0-16%, S 0-13%, P 0-5%, Br 0-3%, I 0-2%; Bonds: single 42-100%, double 0-30%, triple not allowed. Additionally, attachments to carbon were limited to three.
Tables 7 through 9 show structures generated with GENSMI for approximate mass (AM) ranges of 150-250, 250-400, and 400-500. Each table is sorted by best similarity to a WDI 03.4 structure, and lists several highest and the single lowest rankings.
In Table 7, several similarities are over 0.8 and the worst is under 0.4. The branching metric is around 2.5 and reflects the nature of n-methyl, n-butyl, and n-propyl straight chain groups. The complexity metric ranges from about 28 to 43, except for the worst similar structure, which has a complexity of almost 60. The cyclization metric varies from 1.2 to 1.4 and parallels the number of rings in the structures.
In Table 8, several similarities are over 0.8 with a highest-in-all of 0.84 and the worst is under 0.4. The branching metric is around 2.7 and reflects the isopropyl groups present. The complexity metric ranges from about 27 to 48, and the worst similar is nearly 80. Again, the cyclization metric parallels the number of rings.
In Table 9, the similarities are around 0.74 and the worst is under 0.4. Again, the branching metric is around 2.7 and reflects the isopropyl groups present. The complexity metric ranges from about 32 to 36, and the worst similar is just over 70. Again, the cyclization metric parallels the number of rings.
Each table shows the history of assembling the structures. The syntax is a dot-separated string and starts with the letter ``S'' for the seed SMILES ``Nc1ccc(O)cc1'' followed by any mixed number of ``N's'' and ``E's'' for nodes and edges added to the seed. The node listing includes the atomic index to which it's added, the atom number of the new atom, and the bond type between the atoms. The edge listing includes the two atom indexes in the bond separated by a dash ``-'', equal sign ``='', or pound sign ``#'', indicating a single, double, or triple bond. A string in this form can be used as input to GENSMI, and effectively bound the algorithm at an arbitrary place in ``chemical space''. This may be useful for parallels processing multiple GENSMI sessions.