The Application Of Data Fusion To Similarity Searching In Chemical Databases

The Application Of Data Fusion To Similarity Searching In Chemical Databases

Claire M.R. Ginn, Sonia S. Ranade and Peter Willett

Krebs Institute of Biomolecular Research, Department of Information Studies, University of Sheffield, Sheffield S10 2TN, United Kingdom

John Bradshaw

GlaxoWellcome Research and Development Limited, Gunnels Wood Road,

Stevenage, SG1 2NY, United Kingdom

Keywords: chemical database, fusion, molecular similarity, similarity searching

Similarity Searching

Similarity searching is an exploratory tool which is used widely in the pharmaceutical and agrochemical industries. Given a target molecule, typically one that has been shown to exhibit some biological activity of interest, a similarity search of a chemical database aids the procedure of finding a new lead, that is a chemical which exhibits the desired activity but which, to be useful, is not an obvious derivative of the target structure.

The basic hypothesis for similarity searching in chemical databases is the "similar property principle" (Johnson and Maggiora, 1990), which states that compounds that are chemically alike in some way will have similar biological activities. Similarity is obviously a subjective quality and attempts to quantify it, so that it can be processed by a computer, can only be incomplete. However, it can be said that two compounds are similar with respect to a particular descriptor or a particular feature. There are many ways in which the structural similarities between pairs of molecules can be calculated, and there has been much debate as to which similarity measure is the best for this purpose (see, e.g., Brown and Martin, 1996; Downs and Willett, 1995). However, these studies have shown that it cannot be known, a priori, which measure will be best for a given target structure or for a given type of activity. This paper considers the use of data fusion methods for combining different similarity measures, with the aim of producing a combined measure of inter-molecular structural similarity that would be more effective than the individual measures on which it is based.

Data fusion was first used to combine database search outputs in the context of text retrieval systems, where documents are ranked in order of decreasing similarity with a user’s free-text query. Thus, Belkin et al. (1995) investigated the combination of both similarity scores and ranks using a variety of simple arithmetic fusion algorithms and reported some success in terms of increasing the retrieval of relevant documents. Initial work in the field of chemical information has been reported by Kearsley et al. (1996) and by Ginn et al. (1997), both of whom have carried out similarity searches for drug molecules in the Standard Drug File database. The present study reports further work with the arithmetic fusion functions studied by Ginn et al., considering their application to a very different sort of biological activity data.

Experimental Details

The process of similarity searching in general has three stages: the generation of descriptors; the determination of the similarity of each database molecule to that of a target molecule; and then the ranking of the database in decreasing order of the calculated similarities (Downs and Willett, 1995). The most common descriptors for

Activity	Subclass	No. of Actives
Lysosome	L1	10
	L2	17
Staining	L3	3
	L4	32
Probes	L5	18
Mitochondria	M1	24
Staining Probes	M2	21
Nuclear Staining	N1	21
Probes	N2	22

Table 1 The division of the L, M and N activity classes into their respective mechanism-specific subclasses.

similarity searching are fingerprints, which are binary strings where each bit represents the presence or absence of a particular feature. Fingerprints can encode topological information (such as the number of bonds of a specific type in a molecule), 2D substructural fragments (such as specific functional groups or ring systems), or geometric 3D descriptors (such as inter-atomic distances or torsion angles), with versions of the last being used to characterise both rigid and conformationally flexible molecules. Other descriptors rely on less ‘bond-specific’ information, such as the EVA descriptor (Ginn et al., 1997) or global physicochemical properties (Kearsley et al., 1996). Having described each molecule in the database it is necessary to judge how similar its description is to the corresponding descriptor for the target molecule. This is done using a similarity coefficient such as the Euclidean distance or an association coefficient such as the Tanimoto coefficient. The molecules in the database are then ranked in decreasing similarity with the target molecule. Applying the similar property principle, the compounds near the top of the list should have a similar activity to the target compound and the top-ranking molecules, the hits or nearest neighbours, can be examined for a possible new lead chemical.

This study is based on a dataset of 140 biological dyes used to stain cells so as to visualise various organelles (Rashid, 1991). These organelles are the lysosomes, the mitochondria, the nucleus, the cytoplasm, the golgi complex, the plasmalemma, fat droplets and the endoplasmic reticulum. Of these activity classes the staining of the lysosomes (L), mitochondria (M) and nucleus (N) were thought to be the most reliable, with these activity classes being subdivided into the following mechanism-specific subclasses: L₁, L₂, L₃, L₄, L₅, M₁, M₂, N₁ and N₂ (Ranade, 1997). It can be seen from Table 1 that class L₃ has only three actives. This is too few for a meaningful analysis to be carried out, and only the remaining eight subclasses were hence considered, with each molecule being allocated an 8-bit bit-string in which the i-th bit was set to unity if the molecule exhibited the i-th activity. For example, the bit-string 01001001 for compound X would indicate that X belonged to the L₂, M₁ and N₂ activity classes.

Three different types of descriptor were used to characterise the molecules in this dataset: 2D fragments, 3D fragments and physical properties. 2D fragment descriptors provide a topological description of the molecule. The descriptors used here were the fingerprints produced by the Makefrag, Makedict and Makebits routines produced by Barnard Chemical Information Limited (Ranade, 1998). The molecular similarity between the fingerprints representing two molecules was calculated using the Tanimoto Coefficient. This is defined as

where C is the total number of common bits set in both fingerprints, A is the number of bits set in the target’s fingerprint and B is the number of bits set in the hit’s fingerprint. Fingerprints based on 3D fragment descriptors used the NBN (non-bonded, bonded and non-bonded torsion angle) measure developed by Bath et al. (1994), with the similarity between the fingerprints for two structures again being calculated using the Tanimoto Coefficient. Finally, three global physicochemical properties were considered in generating this descriptor for each structure: the logarithm of the octanol/water partition coefficient (a standard measure of hydrophobicity), the net electric charge of the molecule and the conjugated bond number (a count of the number of bonds included in the delocalised electron system of the molecule) (Horobin, 1982). These values were standardised so that direct comparisons could be made between the different descriptors without being biased by a large score from a particular property, and each molecule represented as a three-element vector, K = (a,b,c). Molecular similarity between two structures was calculated by using the generalised form of the Tanimoto coefficient:

where K and L are vectors representing molecules, j indicates the j-th component of the vector, n is the length of the vector, in this case 3.

Each molecule in the database was considered as a target for similarity searching by each of the three methods. However, incomplete data was available for four structures and so they were excluded. This left 136 compounds: 10 of which could be considered as targets for activity L₁, 17 for L₂, 32 for L₄, 18 for L₅, 24 for M₁, 21 for M₂, 21 for N₁ and 22 for N₂, as can be seen from Table 1. Similarity searches were carried out using each of the three descriptors, and the three rankings for each target molecule were fused using the SUM, MIN and MAX fusion algorithms defined below. Given a hit h with ranks r₁, r₂, ..., r_n from similarity methods s₁, s₂, ..., s_n:, the fused score is given by:

SUM , or

The output of the fusion algorithm is taken as the hit’s new similarity score and the fused lists are then re-ranked in descending order of similarity to the target.

Results

The results have been evaluated in two ways: the first considers the three principal activity subclasses individually; and the second the Hamming Distance between the bit-strings representing the activities of the target and of a hit.

The results for the individual activity subclasses were assessed by calculating the mean prediction of activity, M, at each rank n across t targets:

where a_i is the activity of the compound (either 1 or 0) at rank i for a target t_j in the activity class being considered.

Figure 1 shows the mean prediction of activity at each rank n (1 £ n £ 10) for each of the original similarity methods and each of the fusion methods for activity subclass L₄. On initial inspection these results do not appear to be significant, in so far as one or more fusion methods do not appear to be more successful at producing active compounds as hits than the original similarity methods. Similar results were obtained for the other seven activity subclasses. These results are summarised in Table 2 which shows the mean number of actives in the top ten hits for each activity subclass. It will be seen that the SUM fusion method seems to be "about as good as the best" similarity measure, and that the best similarity measure, in terms of actives being highly ranked, varies across activity subclasses. Thus, one would expect that the SUM fusion method would do well when assessed on the activity classes as a whole rather than the individual subclasses. The figures for the mean performance (mean number of actives in top ten hits) in the bottom line of this table provide some support for this belief.

The second method employed for assessing the activity data was that of the Hamming Distance, which is defined as:

where t_i and h_i are the i-th components of the activity bit-strings for the target and for the hit, respectively. The Hamming Distance denotes the number of times that the two bit-strings differ: thus, a Hamming Distance of 0 indicates that the hit matches the target’s activity in all activity subclasses whereas a Hamming Distance of 8 indicates that the hit matches the target’s activity in none of its activity subclasses. This method was used as an

Figure 1. The mean prediction of activity for activity subclass L₄ at each rank n = 1-10.

Mean Number Of Actives In Top Ten Hits
Activity	Original Methods			Fusion Methods
Subclasses	2D	Phys	3D	MAX	MIN	SUM
L1	1.40	3.05	1.12	2.96	2.02	3.25
L2	2.14	3.50	5.93	4.36	3.57	5.00
L4	5.53	6.35	3.69	6.00	5.81	6.16
L5	5.33	4.44	4.06	5.17	4.67	5.5
M1	2.29	6.17	2.50	4.96	4.08	5.04
M2	6.48	5.00	5.52	6.52	5.86	6.17
N1	2.71	4.43	2.71	3.19	3.29	3.81
N2	3.67	4.19	3.67	4.00	4.24	4.00
Mean	3.99	4.64	3.65	4.65	4.19	4.93

Table 2. The mean number of actives in the top ten hits for each activity class

for the original similarity methods and after data fusion.

evaluation criterion to assess how well a hit, produced by a similarity measure or a fusion measure, matched a target’s precise activity. For example, if the target is active for subclasses L₁, M₁ and N₁ then a Hamming Distance of 0 between a hit and the target indicates that the hit is also active in subclasses L₁, M₁ and N₁and only in those classes, and would thus be a most appropriate hit for that target molecule.

We define the mean Hamming Distance as:

where D_i is the Hamming Distance at rank i for a target t_j. Fig. 2 shows the mean Hamming Distance for each similarity method across all 136 targets for n £ 10, and it can be seen that the SUM and MAX fusion algorithms give results that are consistently better (i.e., a smaller mean Hamming Distance) than those from any of the individual similarity methods. A pairwise comparison of similarity methods was carried out using the Wilcoxon Matched-Pairs Signed-Ranks Test (Siegel and Castellan, 1988). Specifically, the test was used to compare the Hamming Distances for each fusion method with each of the original similarity methods, target by target, and thus to indicate whether the two methods that are being compared are significantly different. Table 3 shows the p values for n = 1-10. It can be seen that SUM is significantly better than each of three original similarity methods for all values of n, with 28 out of 30 comparisons being highly significant. MAX also performs well and is found to be significantly better than two of the original similarity methods for n = 1-10. The performance of MIN is noticeably inferior to the other two fusion methods studied here.

Fig. 2. The mean Hamming Distance at each rank n, 1 £ n £ 10.

p values for n=1-5

n=1

n=2

n=3

n=4

n=5

Method

MAX

MIN

SUM

MAX

MIN

SUM

MAX

MIN

SUM

MAX

MIN

SUM

MAX

MIN

SUM

<0.01

0.18

<0.05

<0.01

0.24

<0.01

0.79

<0.01

0.43

<0.01

0.58

<0.01

0.84

<0.01

0.53

<0.01

0.43

<0.01

0.43

<0.01

0.42

<0.01

Phys

<0.01

0.34

<0.01

0.58

<0.01

0.54

<0.01

0.33

<0.01

0.38

<0.01

p values for n=6-10

n=6

n=7

n=8

n=9

n=10

Method

MAX

MIN

SUM

MAX

MIN

SUM

MAX

MIN

SUM

MAX

MIN

SUM

MAX

MIN

SUM

<0.01

0.42

<0.01

0.32

<0.01

0.38

<0.01

<0.05

<0.01

0.20

<0.01

<0.05

<0.01

<0.05

<0.01

Phys

0.43

0.60

<0.01

0..07

0.53

<0.01

0.11

0.43

<0.05

0.21

0.43

<0.05

0.36

0.28

<0.05

Table 3. The p values from the Wilcoxon Test for (a) n=1-5 and (b) n=6-10 Light shading

indicates p<0.05, heavy shading indicates p<0.01.

Conclusions

In this paper, we have used data fusion to combine rankings of chemical compounds that have been generated using several different measures of inter-molecular structural similarity. Our results show that the fused similarity measures can enable better predictions to be made of the cell-staining activities of the molecules than can the original measures. The good performance of SUM and, to a lesser extent, MAX indicate that the fusion process is more successful, in terms of ranking actives highly, than are the original similarity methods. This improvement has been shown to be statistically significant in the case of the SUM Hamming Distance experiments. When we take account of the rather variable performance of the individual similarity measures from one activity to another, it can be concluded that SUM-based fusion provides effective ways of generating a reliable single ranking with respect to both a single activity and the activity classes as a whole.

The success of this study can be compared with the results obtained in two other recent studies of chemical data fusion, both of which have involved similarity searches for drug molecules in the Standard Drug File database. Kearsley et al. (1996) discuss a method for fusing similarity measure scores in any user-defined linear combination (whereas the SUM algorithm used here considers all the scores to be of equal importance, and uses ranks rather than scores) and also the MIN algorithm used here. They found that for each of the two-descriptor combinations they investigated, approximately half of the fused searches were better than the original, individual measures. It was never the case that any combination of descriptors was less successful than the worst descriptor in the combination considered on its own. Ginn et al. (1997) have reported database searching experiments in which SUM, MIN and a modified version of SUM were used to fuse rankings based on the EVA descriptor, which is based on information derived from infra-red vibrational spectra, and on 2D fingerprints. They found that the use of data fusion on the two types of ranking resulted in combined rankings that contained very different sets of nearest neighbours and that often performed better in simulated property prediction than did the individual measures. When taken with the results reported in this paper, it sems reasonable to conclude that data fusion provides a simple way of increasing the effectiveness of similarity searches in chemical databases.

Acknowledgement. We thank the Engineering and Physical Sciences Research Council and GlaxoWellcome Research and Development Limited for funding, and Barnard Chemical Information Limited and Tripos Inc. for software support.

References

Bath, P.A.; Poirrette, A.R.; Willett, P.; Allen, F.H. (1994). "Similarity Searching In Files Of Three-Dimensional Chemical Structures: Comparison of Fragment-Based Measures of Shape Similarity." Journal of Chemical Information and Computer Sciences, 34, 141-147.

Belkin, N.J.; Kantor, P.; Fox, E.A.; Shaw, J.B. (1995). "Combining The Evidence Of Multiple Query Representations For Information Retrieval"; Information and Processing Management, 31, 431-448.

Brown, R.D.; Martin, Y.C. (1996). "Use of Structure-Activity Data to Compare Clustering Methods and Descriptors for Use in Compound Selection"; Journal of Chemical Information and Computer Sciences, 36, 572-584.

Downs, G.M.; Willett, P. (1995). "Similarity Searching in Databases of Chemical Structures", Reviews in Computational Chemistry, 7, 1-66.

Ginn, C.M.R.; Turner, D.B.; Willett, P.; Ferguson, A.M.; Heritage, T.W. (1997). "Similarity Searching in Files of Three-Dimensional Chemical Structures: Evaluation of the EVA Descriptor and Combination of Rankings Using Data Fusion." Journal of Chemical Information and Computer Sciences, 37, 23-37.

Horobin, R.W. (1982). Histochemistry; Butterworths, London; UK.

Johnson , M.A.; Maggiora, G.M. (editors) (1990). "Concepts and Applications of Molecular Similarity"; John Wiley; New York.

Kearsley, S.K.; Sallamack, S.; Fluder, E.M.; Andose, J.D.; Mosely, R.T.; Sheridan, R.P. (1996). "Chemical Similarity Using Physicochemical Property Descriptors."; Journal of Chemical Information and Computer Sciences, 36, 118-127.

Ranade, S.S. (1998). Prediction of Cellular Uptake of Foreign Chemicals Using Cluster Analysis, PhD. Thesis, University of Sheffield.

Rashid, F. (1991). Choosing And Designing Biological Stains, PhD Thesis, University of Sheffield.

Siegel, S. and Castellan, N.J. (1988). Nonparametric Statistics, McGraw-Hill, New York.