GADDwiCC: GA-based Druglike Database with Combinatorial Chemistry Extension

GADDwiCC: GA-based Druglike Database with Combinatorial Chemistry

Abstract

The GA-Based Druglike Database (GADD) program enumerates virtual libraries using the Daylight Reaction Toolkit. The program has been extended to constrain reaction transformations in a combinatorial chemistry fashion and the source code made openly available. Examples libraries based on a 2-amino-thiozole and a diazepine core will be shown.

Introduction

Goals
- Demonstrate that the Daylight Reaction Toolkit is especially powerful and useful in the area of combinatorial virtual libraries
- Openly provide source code as an application example and starting point for other programmers in Daylight "Contrib"
Design
- Input user-defined cores and R-groups
- Support up to 4 different R-group sets
- Allow R-groups to be used multiple times on a core
Approach
- Leverage the GADD program, authored and described in a MUG '01 presentation by Jeremy Yang.

Background

Virtual library design and enumeration is currently regarded as an important capability in drug discovery. Many pharmaceutical companies have developed programs to address this need, such as that described by Leach et al. (J. Chem. Inf. Comput. Sci., 99, 6, 1161-1170), which was built on Daylight tools.

In terms of Daylight software, a reasonable tool for library enumeration would be the Monomer Toolkit, which provides objects and languages for describing molecules and mixtures resulting from combinatorial synthesis. The Monomer toolkit is well suited for exhaustive enumeration of a scaffold or oligomeric combinatorial libraries. The Chortles language was designed to express all possible combinations. One limitation is that steric interference can't be expressed.

Another reasonable tool for virtual library desgin is the Reaction Toolkit, which provides objects and languages for describing chemical reactions. Library design with the Reaction Toolkit is particularly useful for enumerating accessible structures, especially when based on known or infered transformations.

The original GADD program uses the Reaction Toolkit to enumerate virtual libraries and provides a starting point for this work. The program input is a set of carbon fragments (CFRAGs) and hetero fragments (FRAGs). Each fragment has one or more connection points, each requiring a mate. Mating between CFRAGs (green) and FRAGs (blue) form UMOLs. UMOLs can mate further with CFRAGs and FRAGs. Figure 1 illustrates mating possibilities.

Figure 1: Each fragment has one or more connection points that require a mate. Mating between CFRAGs (green) and FRAGs (blue) form UMOLs. UMOLs can mate further with CFRAGs and FRAGs.

The method for selection of CFRAGs and FRAGs is not a Genetic Algorithm (GA), but rather GA-based in the sense of probabilities, randomness, and fitness. Each CFRAG and FRAG has a "weight" corresponding to its of selection probability. A value of one for all fragments would make the probability of selecting a particular fragment equal to all other fragments. The greater the weight, the higher the probability of selection. The selection probability of a particular fragment (P_i) depends on its weight (W_i) and the summation of weights (SUM(W_i->n)):

Probability_i = W_i / SUM(W_i->n)

Separate probabilites are used for weighting random selection of CFRAGs and FRAGs. The fitness of UMOLs are evaluated based on a set of constraints, including Lipinski's "Rule of 5". Table 1 summarizes an example set of fitness constraints.

Table 1: Example Set of Constraints
Constraint Allowed Range
Charge -2 to 2
Heavy Atoms 11 to 50
Molecular Weight^* 200 to 500
Rotatable Bonds 0 to 8
Hydrogen-bond Donors^* 0 to 5
Hydrogen-bond Acceptors^* 0 to 10
Undesired substructures none
Hydrophobicity^* (ClogP) -10 to 5
An asterick (*) denotes a component in Lipinski's "Rule of 5".

UMOLs that satisfy fitness contraints become members of the population. UMOLs are reacted with an unlimited supply of hydrogen atoms, creating structures with zero remaining connection points (FMOL). Terminal CFRAGs and FRAGs can also lead to FMOLs. Figure 2 illustrates a reaction of a UMOL with hydrogens to form a FMOL.

Figure 2: UMOLs that satisfy fitness contraints become members of the population. UMOLs are reacted with an unlimited supply of hydrogen atoms, creating structures with zero remaining connection points (FMOL). Terminal CFRAGs and FRAGs can also lead to FMOLs.

UMOLs and FMOLs that satisfy fitness constraints are checked for uniqueness using the SMILES canonicalization algorithm (J. Chem. Info. Comput. Sci. 1989, 29, 97-101). If the FMOL is unique, it is written to a Thor database.

In the next section, extension of the GADD program with combinatorial chemistry features is described.

Methodology

The key extension of the GADD program for combinatorial chemistry is discrimination of mates. The CORE and RGROUP fragments have been introduced, which are CFRAG and FRAG fragment sets constrained by labeled connection points (R1, R2, etc.). COREs and RGROUPs can only mate at matching connection points, to form UMOLs. COREs can have more than one connection point of the same type (e.g., two R2's). RGROUPs have one connection type only. Figure 3 illustrates combinatorial chemistry mating possibilities.

Figure 3: COREs and RGROUPs can only mate at matching connection points, to form UMOLs. COREs can have more than one connection point of the same type (e.g., two R2's). RGROUPs have one connection type only.

Mating continues between UMOLs and RGROUPs until zero connection points remain (FMOLs). Unlike the original GADD program, hydrogen atoms are not implicitly involved, but can be explicitly defined as a fragment.

Experimental

Two experiments were performed to test the ability of GADDwiCC to produce virtual combinatorial libraries. The first experiment was based on a reaction between a thiourea and an alpha-haloketone to form a 2-amino-thiozole core as desribed by Leach et al. The two N-amine positions were labeled R1 and R2, and the fragments at ring positions 4 and 5 were labeled R3 and R4, respectively (Figures 4 and 5).

Figure 4: The first experiment was based on a reaction between a thiourea and an alpha-haloketone to form a 2-amino-thiozole core as desribed by Leach et al.

Figure 5: The two N-amine positions were labeled R1 and R2, and the fragments at ring positions 4 and 5 were labeled R3 and R4, respectively.

A second experiment was based on the 1,4-benzodiazepin-2-one core of Diazepam, popularly known as Valium. Ring positions 1, 3, and 5 were labeled R1, R2, and R3, respectively. The phenyl substituient, typically chloride, was labeled R4 (Figure 6).

Figure 6: A second experiment was based on the 1,4-benzodiazepin-2-one core of the Diazepam, popularly known as Valium. Ring positions 1, 3, and 5 were labeled R1, R2, and R3, respectively. The phenyl substituient, typically chloride, was labeled R4.

Experiment I: Thiozoles

Search ACD Database 2001.1 for structures containing NC(=S)[NH2] and [Cl,Br][CH]([*;!R])C(=O)[*;!R]
Hit 960 thioureas and 575 alpha-haloketones
Perform defragmentation transform
[NH2][C:11](=S)[N,*:1]>>[NH2][CH:11](=S).[1*][N,*:1] and
[*,*:3][C:11](=O)[C:12]([H])([Cl,Br:13])[*,*:4]>>[C:11]([H])(=O)[C:12]([H])([H])[Cl,Br:13].[3*][*,*:3].[4*][*,*:4]
Fragmented R1- and R2-groups into 962 SMILES
Fragmented R3-groups into 363 SMILES
Fragmented R4-groups into 144 SMILES
Virtual library size is 50,285,664 structures

Experiment II: Diazepines

Search WDI Database 2001.4 for structures containing [N;R1]1C(=O)CN=Cc3[cH]cc[cH][cH]31
Hit 164 1,4-benzodiazepin-2-ones
Perform defragmentation transform
[*,*:1][N;R1:11]1C(=O)[C:12]([*,*:2])N=C([*,*:3])c3cc([*,*:4])ccc31 >> [H][N;R1:11]1C(=O)[C:12]([H])N=Cc3ccccc31.[1*][*,*:1].[2*][*,*:2].[3*][*,*:3].[4*][*,*:4]
Fragmented R1-groups into 38 SMILES
Fragmented R2-groups into 31 SMILES
Fragmented R3-groups into 18 SMILES
Fragmented R4-groups into 13 SMILES
Virtual library size is 284,544 structures
Bad SMARTS N1C(=O)[$([CH0]);!$(C([OH])(C(=O)[OH]));!$(C([2*]))]N=Cc3ccccc31

Defragmentation of hits and tabulation R-group frequencies have been automated with the tdt2gadd script. The GADDwiCC algorithm is limited to performing reactions that form a non-ring single bond to the core. Therefore, R-groups that involve formation of rings and multiple bonds have not been included in the initial search.

Running the Algorithm

The GADDwiCC algorithm is built into the GADD program and is invoked with the '-combichem' option. This is the only difference between execution of GADDwiCC or the original GADD algorithm. In the original GADD algorithm, the '-cfrags' and '-frags' options specify filenames containing carbon and hetero fragments, respectively. When the '-combichem' option is specified, the options specify cores and R-groups, respectively. The following output shows the GADD help page.

% $DY_ROOT/contrib/src/applics/gadd/gadd -help
************************************************************
| gadd    - GA-based Druglike Database evolver using       |
|           clogp-type frags/cfrags or                     |
|           combinatorial chemistry principals.            |
************************************************************
|                                                          |
| gadd [opts] -frags <file> -cfrags <file>                 |
|                                                          |
|   -frags  <file> ... input frags smiles file (reqd)      |
|   -cfrags <file> ... input cfrags smiles file (reqd)     |
|                                                          |
| opts:                                                    |
|   -combichem ....... perform combinatorial chemistry     |
|   -exhaustive ...... deterministic core and r-groups     |
|   -db <dbspec> ..... Thor db to create                   |
|   -dbsize <npri> ... Size of db to create                |
|   -nmax <N> ........ Maximum population to achieve       |
|   -genmax <I> ...... Maximum number of generations       |
|   -sample <J> ...... Select only every jth hit           |
|   -nofraglist ...... Don't record frags used             |
|   -message <msg> ... Description of experiment           |
|   -rand <N> ........ seed for random number generator    |
|   -seed <file> ..... initial file of UMOLs               |
|   -noendgame ....... don't cleanup (leave UMOLs)         |
|   -endgameonly ..... just cleanup existing gadd db       |
|   -fmolsonly ....... delete generation 0 from db         |
|   -nofavorites ..... don't weight fragment selection     |
|   -quiet ........... minimal progress reports            |
|   -help ............ this help                           |
|                                                          |
| fitness opts:                                            |
|   -min_wt MINWT ...... min molweight [200]               |
|   -max_wt MAXWT ...... max molweight [500]               |
|   -min_charge MINQ ... min total formal charge [-2]      |
|   -max_charge MAXQ ... max total formal charge [+2]      |
|   -min_rb MINRB ...... min rotatable bond count [0]      |
|   -max_rb MAXRB ...... max rotatable bond count [8]      |
|   -min_hbd MINHBD .... min h-bond donor sites [0]        |
|   -max_hbd MAXHBD .... max h-bond donor sites [5]        |
|   -min_hba MINHBA .... min h-bond acceptor sites [0]     |
|   -max_hba MAXHBA .... max h-bond acceptor sites [10]    |
|   -min_cp MINCP ...... min clogp value [-10]             |
|   -max_cp MAXCP ...... max clogp value [+5]              |
|   -min_heavy N ....... min heavy atoms [15]              |
|   -max_heavy N ....... max heavy atoms [50]              |
|                                                          |
|   -nofitness .......... Ignore fitness considerations    |
|   -badsmarts <file> ... Reject matches with SMARTS       |
|                                                          |
| Runtime debug options:                                   |
|   -silent ... no information                             |
|   -terse .... reduce information                         |
|   -dbinfo ... additional information                     |
|   -dbmem .... show handle count                          |
|   -dbpos .... show file position                         |
|   -dbload ... show FRAG and CFRAG I/O                    |
|                                                          |
************************************************************
|                   Daylight CIS Inc.                      |
|              Toolkit Contributed Program                 |
************************************************************

A convenient script called gaddwicc has been provided to simplify command line arguments, automatically view results, and rerun experiments. The script option "-name thiozole" will create a database named "thiozole" from thiozole.cores and thiozole.rgroups files. After the algorithm completes, the datsbase is loaded into a Merlin Pool for viewing. The script checks for a pre-existing database before execution, and will automatically replace it when the experiment is rerun.

Table 2: Script Commands and Required Files to Create the Original GADD and New GADDwiCC Databases
Algorithm	Database	Script Command	Required files
GADD	Minoxidil	gaddwicc -name minoxidil	minoxidil.cfrags, minoxidil.frags
GADDwiCC	Thiozole	gaddwicc -name thiozole -combichem	thiozole.cores, thiozole.rgroups
GADDwiCC	Diazepine	gaddwicc -name diazepine -combichem -badsmarts diazepine.bad	diazepine.cores diazepine.rgroups diazepine.bad

Experiments were performed on a 450MHz SUN UltraSPARC-IIi.

Results

The algorithm has been characterized in terms of drug-likeness, fitness, uniqueness, and productivity. Drug-likeness (D) depends of the presence of the FMOLs in the WDI Database 2001.4. Structures not found in the WDI database decrease D.

Drug-likeness (D) = UNION(GADD, WDI) / WDI

Fitness (F) depends on the number of fit products per reaction. Unequal selection probabilities and ignorance of symmetry decrease F.

Fitness (F) = FIT(Products) / Reactions

Uniqueness (U) depends on the number of unique FMOLs and all FMOLs. Introduction of UMOLs into the poulation and ignorance of symmetry decrease E.

Uniqueness (U) = UNIQUE(FMOLs) / (FIT(Products) * # of R-groups)

Productivity Rate (R) depends of number of FMOLs and time. F and U decrease R.

Productivity (R) = UNIQUE(FMOLs) / Time

Experiment I: Thiozoles

The Thiozole DB is a sample of TDTs of possible structures and completed in 11m 7s. Characterization terms can be computed from the program output.

D = 0%
F = 23%
U = 77%
R = 1.5/s

The Drug-likeness (D) term show none of the products are in the WDI. This is due to no overlap in R-groups in the sample between the ACD and WDI. The Fitness (F) term shows that most reactions produce unfit products, which are removed from the population. The Uniqueness (U) term shows that a significant amount of products are identical to previous results.

Cost of Fitness Contraints. The experiment was rerun with the '-nofitness' option and completed in 5m 4s. The Fitness constraint can affect the Productivity Rate (R) by a factor of 2.

D = 0%
F = N/A
U = 81%
R = 3.3/s

Efficiency of Random Sampling. The experiment was rerun with the '-exhaustive' option and completed in 1m 27s. The Productivy Rate (R) increased by almost an order of magnitude and reveals an inefficiency in the sampling technique.

D = 0%
F = 67%
U = 91%
R = 11.5/s

Lower Bound of Production Rate. The experiment was rerun with the '-nofitness' and '-exhaustive' options and completed in 52s. The Productivity Rate (R) reflects a lower bound of an implementation consistently produce fit, unique, products.

D = 0%
F = N/A
U = 92%
R = 19.2/s

Experiment II: Diazepines

The Diazepine DB is a sample of TDTs. Characterization terms can be computed from the program output.

D = 11%
F = 24%
U = 63%
R = 1.7/s

Review of the benzodiazepine search hits showed that one of the two substitutients at position 2 was always hydrogen, except for geminal hyrdroxyl and carboxylate. A "Bad SMARTS" pattern was used to constrain products that did not meet this criteria. This feature is perhaps the only redeeming quality over the design of a similar Diazepine virtual combinatorial library based on the Monomer toolkit, as described in a IBC presentation by Craig James and Dr. David Weininger.

Conclusion

The GADD program has been extended with Combinatorial Chemistry features
Virtual libraries can be produced using the Reaction Toolkit
Drug molecules will be produced if reactant R-groups match those found on drugs
The "Bad SMARTS" feature allows for selective enumeration of a library
The source code is openly available in Daylight "Contrib"

Future Directions

Use a real GA and a reaction gene
Perform retrosynthetic analysis to bridge the gap between starting materials and drug databases
Explore starting materials and reactions databases
Screen for a pharmacophore model
Estimate adsorption, distribution, metabolism, elimination, toxicity
Alignment of structures

Acknowledgements

Jeremy Yang, Daylight
Jack Delany, Daylight

Daylight Chemical Information Systems, Inc.
info@daylight.com

Table 1: Example Set of Constraints
Constraint	Allowed Range
Charge	-2 to 2
Heavy Atoms	11 to 50
Molecular Weight^*	200 to 500
Rotatable Bonds	0 to 8
Hydrogen-bond Donors^*	0 to 5
Hydrogen-bond Acceptors^*	0 to 10
Undesired substructures	none
Hydrophobicity^* (ClogP)	-10 to 5
An asterick () denotes a component in Lipinski's "Rule of 5".*