Results

Results

5 Results

1.0

============================================================
Table 1: Acyclic Saturated Isomers (C7H16 to C20H42)
Constraints: nodes=1 edges=1 edges/node=1-4 cycle/graph=0
------------------------------------------------------------
Nodes  Unique    Checks (Pct)     Total (Pct)   Time  Uniq/s
------------------------------------------------------------
  7         9          9(100)          9(100)     0.01  n.m.
  8        18         18(100)         18(100)     0.01  n.m.
  9        35         36 (97)         36 (97)     0.01  n.m.
 10        75         79 (95)         81 (93)     0.02  n.m.
 11       159        169 (94)        183 (84)     0.04  n.m.
 12       355        378 (94)        441 (80)     0.07  n.m.
 13       802        856 (94)      1,079 (74)     0.16  n.m.
 14     1,858      1,976 (94)      2,715 (68)     0.37  n.m.
 15     4,347      4,615 (94)      6,893 (63)     0.93  n.m.
 16    10,359     10,957 (95)     17,779 (58)     2.34  4446
 17    24,894     26,230 (95)     46,105 (54)     5.89  4234
 18    60,523     63,539 (95)    120,558 (50)    14.92  4059
 19   148,284    155,106 (96)    316,629 (47)    38.90  3813
 20   366,319    381,883 (96)    835,793 (44)   107.19  3418
============================================================
Table 2: Monocyclic Saturated Isomers (C7H14 to C20H40)
Constraints: nodes=1 edges=1 edges/node=1-4 cycle/graph=1
------------------------------------------------------------
Nodes  Unique    Checks (Pct)     Total (Pct)   Time  Uniq/s
------------------------------------------------------------
  7        29         69 (42)         69(100)     0.01  n.m.
  8        73        187 (39)        187 (39)     0.02  n.m.
  9       185        532 (35)        544 (34)     0.05  n.m.
 10       474      1,450 (33)      1,551 (31)     0.13  n.m.
 11     1,230      4,058 (30)      4,643 (26)     0.42  n.m.
 12     3,231     11,148 (29)     13,807 (23)     1.16  2810
 13     8,501     30,960 (27)     41,684 (20)     3.31  2576
 14    22,556     85,403 (26)    125,126 (18)     9.66  2337 
 15    60,064    236,432 (25)    376,321 (16)    27.86  2157
 16   160,592    653,256 (25)  1,126,433 (14)    80.31  2000
 17   430,656  1,807,533 (24)  3,364,619 (13)   240.02  1794
 18 1,158,357  4,999,172 (23) 10,012,916 (12)   730.11  1587
 19 3,122,614 13,836,530 (23) 29,713,552 (11)  2535.06  1232
 20 8,436,528 38,300,823 (22) 87,903,509 (10) 10951.90   770
============================================================
Key: Unique - number of non-isomorphic graphs
     Checks - number of graphs checked for isomorphism
     Total  - number of graphs without isomorphic pruning
     Time   - in seconds, includes overhead of 0.01 sec
     Uniq/s - number of unique graphs per second
System:   Pentium IV 2.4 GHz CPU Red Hat ES 2.1 (Panama)
Compiler: gcc version 2.96 20000731 optimization level: -O3

Table 1 and 2 show the results of the two control experiments. For both experiments, the exact number of unique isomers is in agreement with the literature. This proves the algorithm is completely covering ``graph space'', at least under these constraints, and offers promise that it is in general. Table 1 shows the number of acyclic graphs checked for isomorphism at a low of 94% unique for C11-C15, then gradually rises to over 96% at C20. This algorithm is believed to be performing at or near the theoretical maximum unique structures per isomorphic check. Some redundancy has been shown to be unavoidable. Table 2 shows the number of cyclic graphs checked for isomorphism gradually decreasing from 42% at C7 to 22% at C20. This downward trend is the result of more nodes, because there is an exceedingly greater number of ways to form a cycle.

Both tables show the trend in total number of graphs without isomorphic pruning to become more significant from C7 to C20 compared to unique graphs. In Table 1, the percentage of unique in all total graphs decreases from 93% at C10 to 63% at C15 then 44% at C20. In Table 2, the percentage decreases from 31% at C10 to 16% at C15 then 10% at C20. This shows the importance of dt_cansmiles(), which can significantly reduce redundancy, especially with more nodes and cycle formation. Its value is expected to be higher when forming multiple cycles.

The rate at which structures are produced varies from over 3K per second for acyclic graphs to just under 1K for monocyclic graphs. The trend shows a decrease in production rate with more nodes, and is due to the cost of fingerprinting longer path lengths and identifying rings when dt_mod_off() is called.

Next, the algorithm was seeded with methane and 100 structures were generated and visualized. No constraints were used except for observing normal valences. Table 3 shows some of the generated structures. It became immediately apparent that the extent of branching and cyclization needed to be constrained in order to avoid producing unstable and uninteresting compounds.

1.0

============================================================
Table 4: Database Statistics
------------------------------------------------------------
 Database         Compounds  Avg MW Brnch Rings Cycli Chira
------------------------------------------------------------
 acd021             329,985  344.86  2.43  2.36  1.06
 acd033             167,769  336.81  2.45  1.95  1.07
 aquire94             2,436  193.80  2.33  1.00  1.05
 asinex00           148,850  372.24  2.42  3.14  1.08
 bioscr99np          13,905  355.66  2.48  3.25  1.18  1.14
 bioscr99sc          39,975  363.83  2.44  3.25  1.12  0.04
 chembridge99        75,344  340.51  2.39  2.85  1.06
 chemreact97        389,372  345.71  2.43  2.20  1.08
 chemsynth97demo        433  338.43  2.43  2.24  1.10
 maybridge001        63,648  316.35  2.44  2.36  1.05
 medchem02           43,849  279.37  2.43  1.92  1.08  0.00
 nci00              162,148  310.67  2.44  2.37  1.11  0.00
 spresi95         2,462,790  349.07  2.45  2.49  1.10
 spresi95preps      981,936  327.94  2.44  2.49  1.10
 spresi98rxn      2,035,698  350.46  2.42  2.26  1.08
 tcm01                5,695  395.30  2.54  3.51  1.22  3.68
 tsca93              38,816  235.58  2.40  1.31  1.07
 wdi033              62,518  494.57  2.47  3.17  1.14  2.83
============================================================
Key:
  Avg MW      - Dayprop property AVERAGE_MOL_WT
  Branching   - number of bonds per non-terminal atom
  Rings       - number of cycles per molecule
  Cyclization - number of cycles per ring atom
  Chirality   - number of chiral center per molecule

So, many databases were evaluated as shown in Table 4. The range of values is narrow for some metrics, and is centered about 350 for average molecular weight, 2.43 for branching, 2.5 for number of rings, and 1.09 for cyclization. Additionally, the average number of chiral centers for drug databases varies from 1 to 4, with some not containing any stereocenters. These average values and two standard deviations were used to bound the GENSMI algorithm.

Now, based on database profiles, the algorithm was constrained by setting limits on branching to 3.15, cyclization to 1.65, ring per molecule to 5, rings per atom to 2, and allowing only 5 and 6-member rings to be formed. Table 5 shows some of the structures. The algorithm was rerun with branching set to 2.75 and cyclization set to 1.40. Table 6 shows some of those structures. Using these constraints, the resulting structure seem more stable and desirable.

For constructing ``drug-like'' structures, drug databases were profiled and the following constraints were devised: Atoms: C 17-99%, O 1-50%, N 1-45%, F 0-17%, Cl 0-16%, S 0-13%, P 0-5%, Br 0-3%, I 0-2%; Bonds: single 42-100%, double 0-30%, triple not allowed. Additionally, attachments to carbon were limited to three.

Tables 7 through 9 show structures generated with GENSMI for approximate mass (AM) ranges of 150-250, 250-400, and 400-500. Each table is sorted by best similarity to a WDI 03.4 structure, and lists several highest and the single lowest rankings.

In Table 7, several similarities are over 0.8 and the worst is under 0.4. The branching metric is around 2.5 and reflects the nature of n-methyl, n-butyl, and n-propyl straight chain groups. The complexity metric ranges from about 28 to 43, except for the worst similar structure, which has a complexity of almost 60. The cyclization metric varies from 1.2 to 1.4 and parallels the number of rings in the structures.

In Table 8, several similarities are over 0.8 with a highest-in-all of 0.84 and the worst is under 0.4. The branching metric is around 2.7 and reflects the isopropyl groups present. The complexity metric ranges from about 27 to 48, and the worst similar is nearly 80. Again, the cyclization metric parallels the number of rings.

In Table 9, the similarities are around 0.74 and the worst is under 0.4. Again, the branching metric is around 2.7 and reflects the isopropyl groups present. The complexity metric ranges from about 32 to 36, and the worst similar is just over 70. Again, the cyclization metric parallels the number of rings.

Each table shows the history of assembling the structures. The syntax is a dot-separated string and starts with the letter ``S'' for the seed SMILES ``Nc1ccc(O)cc1'' followed by any mixed number of ``N's'' and ``E's'' for nodes and edges added to the seed. The node listing includes the atomic index to which it's added, the atom number of the new atom, and the bond type between the atoms. The edge listing includes the two atom indexes in the bond separated by a dash ``-'', equal sign ``='', or pound sign ``#'', indicating a single, double, or triple bond. A string in this form can be used as input to GENSMI, and effectively bound the algorithm at an arbitrary place in ``chemical space''. This may be useful for parallels processing multiple GENSMI sessions.