The Hitch Hiker's Guide

The Hitch Hiker’s Guide

to Chemical Space
by Anthony Nicholls,

OpenEye Scientific Software, Inc.

From the Introduction to "The Hitch Hiker’s Guide to Chemical Space":

Q:What is Chemical Space?

A: It's where 10²⁰⁰small, possibly drug-like, molecules live.

Q: 10²⁰⁰. Isn’t that a big number?

A: Yes, it's big. Really big. You just won't believe how vastly hugely mindbogglingly big it is. I mean, you think your average combinatorial library is big, but that's just peanuts compared to Chemical Space. Listen:

(it goes on for a bit here before finally settling down with things you really need to know)

Q. Sounds scary. Do you have anything helpful to say about finding your way around in Chemical Space?

A: Lots! For starters…

The important Question:

In a 10²⁰⁰, how many are really different?

Only differences in:

Scalar properties/descriptors, 1D
Chemical Bonds, aka graphs, 2D
Physical Structure, 3D

that produce differences in:

Biological Activity
Bioavailability
Toxicity

really matter (for drug design).

Fundamental OpenEye Credo:

Shape Matters

Shape = Field Properties:

Such as:

Steric

Electrostatic

Functions derived by atoms types

(e.g. hydrophobic, hydrogen bond potential)

Why we feel shape is probably sufficient

Consider the following problems in traditional energetic approaches to the important problem of calculating binding energies:

General Hydrophobic Effect (binding event, plus unbound conformers ranking)
Discrete Water Effects (binding)
Entropy of Binding (Solid Body)
Conformation Entropy (ligand and protein)
Large Scale Protein Motions
Polarization of Charge Distributions
Hydrogen Bonds

i.e. why worry about "precise" vacuum energies?!?

To have a shape one needs a structure:

Computational Structure Generation

General Methods:

Rule Based, e.g. CONCORD
Distance Geometry, e.g. Rubicon
Build Up
Exhaustive Search

However, structure generation has to take account of the aqueous environment of biological processes, i.e. we need:

Solvation Modeling:

The output from Solvent Modeling has two uses:

Energies, to Rank Structures
Derivatives (Forces), to Minimize Structures

Possible Methods include:

All atom simulations (FEP)

VERY Time consuming

Accessible Area (e.g. Ooi-Scheraga)

Fast but inaccurate

GBSA (Still)

Relatively fast, accurate for small molecules

Poisson-Boltzmann (PB)/ Hydrophobic Area terms

PB much faster than FEP, slower than Area based methods

We use PB because we believe it provides the highest ratio of accuracy to CPU cycles, once certain implementation problems are addressed.

OpenEye Advances in PB

Use of Gaussian Functions to derive a smooth dielectric function

Quadratic interpolation to map properties to and from the grid upon which PB is solved

Results:

Energies stable with respect to grid displacement (see graph below)

ChartObject Acetate Vacuum

Much faster solve and set up

Derivatives for Solvent Minimization

Comparing Molecular Shapes

Difference in Shape = Difference in Fields

(i.e. what is the overlap of A with B)

A Field is an ordered set of numbers

(e.g. think of a grid/lattice representation)

The difference between an ordered set of numbers is a metric

(admittedly in an infinite dimensional space)

Metrics are our friends.

(aka: the triangle inequality: d_ab + d_bc >= d_ac rules, ok)

Why are metrics so wonderful?

Example 1:

Exhaustive Search for the Best

Overlay = Minimal Field Difference

Traditionally a 6-dimensional search problem

But Consider:

Two molecules A and B
Each center of mass at (0,0,0)
Generate a set of N positions and orientations of B relative to original {B}
Order {B} in a distance tree, based on the overlap with B
Find the best overlap of {B} with A via the distance tree
Number of comparisons ~ log(N)

Example 2:

Minimal Metrics =

Metric of a Different Order

Given a non-projective Operation T

i.e. the mimimum field difference between two molecules is also a metric.

An example of T would be a rotation and/or translation, and hence the best overlay between two molecules defines an interesting minimal metric, in particular for the space it "induces".

Shape Space

The N*(N-1)/2 optimal overlays of N molecular fields form an N*N distance matrix, D.

From a distance matrix can be derived what is known as the metric G matrix. If G is diagonalized the number of approximately non-zero eigenvalues determines the "effective" dimensionality of the geometric space N points can inhabit to give rise to those N*(N-1)/2 distances, and the position of those points.

The space formed from the diagonalization, and subsequent culling of insignificant dimensions, of the metric matrix derived from a set of molecular overlay distances we refer to as Shape Space.

Q: What is the dimensionality of the Shape Space of 10²⁰⁰ small molecules?

A: Probably that of a MUCH smaller set

Q. How much smaller? 10¹⁸⁰, 10²⁰ (if so we’re in trouble!) or 10² ?

We plan to find out via the fields of the structures corresponding to:

Random smiles strings
Combinatorial libraries
Reaction Pathways

My Guess? Dimensionality < 100. Why? Because, as we shall show in later, the steric field can usually be approximated very well by 30-40 variables. We anticipate a smaller dimensionality for electrostatic fields.

Uses of a Shape Space Decomposition:

Only D+1 "difficult" overlays are ever required (where D= Dimensionality of the Shape Space), after that ~ 10⁵ speed up in finding the best overlays.
Fundamental (geometric) measures of diversity, fundamental shape descriptors
Clever organization of databases of molecular shapes
Multiple properties form product Shape Spaces

PLUS: Tames 10²⁰⁰ molecules, sort of.

All of the above concerns global similarity, but what of:

i.e. A and B fit poorly on top of each other, and yet they both have an essential element of common shape.

Local Shape Comparisons:

Two Novel Metric Approaches:

Ellipsoidal Domain Decomposition

Surface Metrics

Surface Contour- 1D
Pointwise Characteristic Functions– 0D

Ellipsoidal Domain Decomposition

The Idea:

Use a smooth gaussian function for the molecular field (typically steric) and seed N ellipsoidal gaussians within the molecules. Minimize the variable describing these gaussians against the field difference from the sum of such and the molecular field

The Result:

(for 1-3 Ellipsoidal Gaussians for a small, 2 ringed molecule)

One Ellipsoid Fit

Two Ellipsoid Fit

Three Ellipsoid Fit

By analyzing the fit of each ellipsoid to the underlying atoms we can robustly determine that the 2 ellipsoid fit is the best representation for this molecule.

Most molecules are fit by 3-4 ellipsoids to within 10-15A³, i.e. less than volume of one methyl group. As each ellipsoid has 10 free parameters, the steric field can thus be well represented by 30-40 parameters, and hence our hope for a shape space of similar dimensionality.

Uses of Ellipsoidal Decompositions:

Look Really Pretty

Automatic Fragmentation

Provide Injections to Shape Space

Clique Detection/ Overlay

Local Property Descriptions:

2D Ellipsoidal Surface Functions
Scalar/Vector Characterization

Ellipsoidal Diversity Measure

Many More! (Very Rich Description)

Final Conclusions:

Chemical Space is Vast but Mappable

Useful Guide is Local Domain Decomposition

Two approaches to the (almost) Ultimate Question to Life, the Universe and Everything:

"What looks like molecule X?":

Retrieval from huge databases

De Novo Generation, e.g. via GA

Finally, Like Hitch Hiking, Waiting for OpenEye Tools requires Patience!

Current State of OpenEye Code:

Structure Generation ~ a

Solvent Ranking/ Minimization ~ b

Fast Overlay > a

Docking ~ a

Ellipsoidal Decomposition ~ b

Surface Metricization < a

Until then:

KEEP

BANGING

THE

ROCKS

TOGETHER!

Thanks to:

Peter Jeffs @GlaxoWellcome and Tony Wilkinson @Zeneca for funding for some of these projects

Andy Grant @Zeneca for his work, ideas and enthusiasm

Roger Sayle for help and wise thoughts

Daylight for the chance to present

Dave-I-Think-Big-Therefore-I-Am-Weininger for the courage to try