MUG '98: Adding non-structural data into fingerprints, John Bradshaw

MUG '98 -- 12th Annual Daylight User Group Meeting -- 26 February 1998

Adding non-structural data into fingerprints

John Bradshaw
GlaxoWellcome, Stevenage, Herts SG1 2NY, UK

Background

Traditionally, searching of chemical databases, was restricted to a search for superstructures of a given target molecule. Whilst, one can also do substructure searches in DAYLIGHT, most other vendors do not offer this. A major advance was made in chemical structure searching with the introduction of similarity searches, using molecular fingerprints This allowed scientists to ask for things 'like' their query molecule. Various measures have been introduced to enable the user to vary what is meant by 'like'.

When we are searching associated data, the tools on offer tend to be equivalent to the super/substructure matching, i.e. we do numerical range searches and a variety of string searches. With the exception of the ARES string searching, and the name similarity demonstrated by Jack Delany at an earlier MUG, most effort has gone into moving the molecular data into the world of non-structure data by techniques such as non-linear mapping or multidimensional scaling. For example see Martin, Eric J.; Blaney, Jeffrey M.; Siani, Michael A.; Spellmeyer, David C.; Wong, Alex K.; Moos, Walter H. Measuring Diversity: Experimental Design of Combinatorial Libraries for Drug Discovery. J. Med. Chem. (1995), 38(9), 1431-6.

The aims of this work are to create an environment in which one could rank compounds by similarity to a target using, not only, the structural data, but also, other data associated with the compounds. So for example

Create a list, where, we traded off some of the structural similarity to a target against availability in the compound store.
Note, this is not the same as, sorting, by structural similarity, compounds available in the store.
Create a list, where, we trade off some of the structural similarity to a target compound against having a similar number of hydrogen bond donors.
Note, that this is not the same as, doing a range search on hydrogen bond count, followed by a similarity search.
Create a list, where, we trade off some of the structural similarity against having a similar ClogP to the target compound.
Note, this is not the same as, doing a range search on ClogP, followed by a similarity search.

Given the powerful tools already available for comparing binary strings (fingerprints) we have concentrated our efforts on investigating how to convert other data types into a form suitable for inclusion in such a representation.

It is, of course, essential that these binary representations are appropriate for similarity calculations, so that, objects which are close in a real space, will be found to have high similarity by the appropriate measure inthe binary space.

This is entirely equivalent to the development of multiple linear regression for QSAR. It was learned very early, that, biological activity in particular, was not dependent on a single descriptor.

A quick revision: Types of variable and hash functions.

Basically there are four types of numerical data.

Nominal variables, are often defined by strings of characters. e.g. red, green, blue; alkaloid, soluble in water, active against HIV etc or a special case within DAYLIGHT, a SMARTS or set of SMARTS eg *@;!:* which says this compound belongs to the class of compounds which have two atoms connected by a non-aromatic ring bond. What we need is an efficient way to convert such character strings into numbers which retain the properties of nominal variables.

Again we are on familiar DAYLIGHT territory. Hash functions are at the very heart of THOR and SMILES disc storage. A simple example is provided for those unfamiliar with these concepts.

The use of hash functions gives us an ideal way to take any character string describing a class and create a number which can be used to set bit(s) in a fingerprint. As this looked like the simplest class of data variable to handle we started with nominal variables.

Data represented by nominal variables

As part of PhD project at the Centre for Molecular Design at the University of Portsmouth on consensus classifiers, we have been making use of the data in Chapman and Hall's Dictionary of Natural Products, which we hold in a thor database. A typical record is seen for morphine.
Of particular importance are the lines

TOCN<XA0600>
TOCN<XA0670>
TOCN<XA5730>
TOCN<XA6070>

which are a subset of the type of compound classifiers used by Chapman and Hall. These particular ones expand out to

TOCN<XA0600>	TOCN<Pharmacological Agents - Anaesthetics, local>
TOCN<XA0670>	TOCN<Pharmacological Agents - Analgesics - opioid>
TOCN<XA5730>	TOCN<Pharmacological Agents - Opioid receptor agonists>
TOCN<XA6070>	TOCN<Pharmacological Agents - Psychotropic agents>

This illustrates several points,

Classification allows full membership of various groups. It is not like standard clustering where you have to make a choice as to which group something belongs to.
Classification of this type may not be complete. Note morphine has not been fully classified. Of the 2563 classes that morphine could have been assigned to, it did not go to
VX2900 Natural Products-Morphine alkaloids.
If we had selected all morphine alkaloids to search against, we would have removed morphine. Only if we are able to rank them, considering class information and structural information together would we be able to do what we intended.
All the characters in a class label may not be equally important when used for testing similarity. In the case of the Chapman and Hall classifiers, the label is hierarchical with the first character representing a superclass. Based on these classifiers alone, one would wish the members of a superclass to be more like each other than any member of any other class.

Fergus Lippi at the University of Portsmouth has developed weighting schemes for these classifiers. It is not clear yet as to how specific these schemes need to be.
We have chosen to start only from the leftmost character, and do all overlapping substrings from left to right, to enhance the normal direction of interpretation.

Currently for the string VX2900 we would assign

String	Bits set
V	7
VX	4
VX2	3
VX29	1
VX290	1
VX2900	1

This ensures that, in the absence of other information, the class identity is maintained. In a mixed environment, it may be possible to decrease the number of bits.

In a Daylight world, we would hash the whole string including, the data tag to allow us to mix information.
Strings need to be ordered, in some unique way, for this approach to be effective. If this is a natural language label, or a constructed hierarchical classifier this is not a problem. However with something like a SMARTS, which we are allowed to write in several equivalent orderings, we immediately have difficulties.
[F,Cl,Br,I], [I,Br,Cl,F] are equivalent, as are more obtuse versions [F,$([Cl,Br,I])]. Either, we need some smarter algorithms to create the numbers from the strings, or, a systematic SMARTS representation, for example, Roger Sayle's work presented at Euromug97.

Data represented by ordinal variables

We can represent ordinal variables by the same method used for nominal variables, with one important addition. We need to set the bits for all the classes, or variables contained within the target.
So, for example, suppose we have a compound with two H-bond donors, we need to set the bits for one H-bond donor also. Note, as these are counts the class boundaries are unequivocal and they are exact. Measures are represented by interval and ratio scaled variables. These are not exact, so any class boundaries or partitions are equivocal.

To deal with numeric labels we are adopting the following strategy.
If we simply took a string representing the data for the count of rotatable bonds
RB<1> and adopted the strategy above it would be very similar to
RB<15> etc.
This is not what is wanted. So we map the strings onto
RB<A>
and RB<O> .
These have the 'tag' bits in common, as required, but no others, except by chance. The extra offset of 64 is to allow the linear congruential generator to work properly for short strings. These are the problems of working with a language without a fixed vocabulary!!

Data represented by interval and ratio scaled variables

When we move to interval and ratio scaled variables, the problems increase enormously. We cannot simply bin the data, and treat it as ordinal. As described above, we have problems of precision and accuracy. A more robust way would appear to be to make use of Gray Codes. An n-bit Gray code is an enumeration of all n-bit strings so that successive elements differ in one bit position. This should allow the mapping of reals of defined precision onto a binary space, where numbers which are close on the real line, have most bits in common. We have done no more than begin to look at the extensive literature in this area, as yet, but are convinced it is a way forward.

Mixing the data. The (mis)use of FPP

There are, in principle, two ways to handle these extra data

Mix the data into the fingerprint. This can be done by
- Simply ORing the bit string into the standard fingerprint.
- Extending the fingerprint, by appending or prepending the bit string with the extra data.
Regard this new fingerprint as a different view of the molecule and create the equivalent of the FPP used for mixtures.
Instead of each segment representing the structure fingerprints of a different component of the mixture, it represents a different view of the same molecule.

Currently we favour the second approach as it retains the flexibility we require. We do not have to precalculate weighting schemes, which we would with the first alternative.
Also, it allows us to make use of data fusion techniques we are developing with Peter Willett's group at Sheffield, and similarity measures such as the Tversky index, where appropriate, at run time. In the true spirit of merlin it should not be necessary for the user to know, up-front, why and how a particular database was constructed. They should simply be able to explore it.
If we do need to store weights, they can be handled in the same way that stoichiometry is dealt with in mixtures and reactions.

Shortcomings

The major reservation about this work is that the fingerprints become database dependent. They do not have the apparent universality of structural fingerprints.
However, I would argue, that structural fingerprints only tend to be the same because, when we build a database, we use either default settings or some in-house rule. For example, within GW, we tend to use the same fixed width, 1024 fingerprints, for all our databases, for sub/superstructure searching, similarity and clustering, despite advice to the contrary.

We need to work harder to link in the datatypes database and information in the $FPG record so that at worst we have a 'def-before-ref' world.

Conclusions

We believe this is a goal worth pursuing. It has the potential to move us into the similarity equivalent of multivariate data analysis. Almost certainly we have not got it right this first pass through, but hopefully, with input from as many folk as possible, we can develop a useful data tool.

Acknowledgements

Centre for Molecular Design at the University of Portsmouth
Fergus Lippi
David Salt
Martyn Ford
Mike Lipkin

Chapman and Hall
Fiona Macdonald for permission to use the data from the Dictionary of Natural Products.

Daylight CIS Inc
In particular Jack Delany

Daylight Chemical Information Systems, Inc.
support@daylight.com

John Bradshaw.