Hamming-Grey Codes for Fingerprinting Descriptors
Roger Sayle
Bioinformatics Group,
Metaphorics LLC,
Santa Fe, New Mexico.
1. Introduction
This talks describes a number of schemes for encoding numeric and
non-numeric values in binary fingerprints. Encodings are suggested
that significantly reduce the number of bits required by existing
methods. These reductions are achieved by defining the required
"resolution" of similarity a priori. Different "codes" are
proposed for both the "Hamming" and "Tanimoto" measures of similarity.
Finally we present the example application of these encoding methods
to pseudo-structural similarity and to 3D pharmacophore matching.
These solutions attempt to answer some of the problems posed by
John Bradshaw in his MUG98 talk entitled "
Adding non-structural data into fingerprints".
2. Fingerprints and Similarity
2.1 Overview
A "fingerprint" is a fixed length bit vector, often used to describe the
features of a molecule in chemical information systems, such as those
provided by Daylight or MDL. Conventionally each bit is used to encode
the presence (or absence) of a particular structural "motif", such as the
occurance of a particular bond or atom type. These "binary" fingerprints
may then be used for pre-screening database searches or as a measure of
"similarity" between two database entries. The characteristics of encoding
for these two applications are outlined below.
For pre-screening database searches, a fingerprint is considered only
if it contains a prescribed bit pattern. Algebraically, if A is a query
fingerprint and B a database fingerprint, if all the features of A appear
in B then A&B == A (where "&" is the bitwise AND operator). This technique
is typically used for superstructure searching. This use of fingerprints
is most effective when no value's bits are a subset of another value's
bits, and when each value typically has several/many bits set. For example,
encoding a descriptor by all zeros, provides no ability to pre-screen a
search for that value, and encoding a descriptor by all ones, leads to
false drops on all queries over that descriptor column.
For similarity comparisons between fingerprints, several measures are
commonly used including Hamming, Euclidean, Tanimoto and Tversky indices.
In this article, we restrict ourselves to describing Hamming and Tanimoto
suitable encodings. Hamming "distance" is defined as the number of bits
that are different between two binary fingerprints, [A^B] (where "[x]" is
the number of bits set in "x", and "^" is the bitwise XOR operator).
The Tanimoto coefficient is defined as the number of bits in common
between the two fingerprints divided by the bits set in either fingerprint,
[A&B]/[A|B] (where "|" is the bitwise OR operator).
To encode descriptors suitable for Hamming comparison, similar encodings
should have similar bit patterns. Additionally for Tanimoto comparison
all encodings should preferably have the same number of bits, with
similar encodings sharing the same bits set.
2.2 Composition of Fingerprints
Finally, in this section we briefly mention composition of descriptor
encodings. The use of fingerprint composition allows multiple descriptors
to be compared in a single "logical" operation. Composition of fingerprints
(written here as A:B where A and B are "subfingerprints") by concatenation
of the two bit vectors into a single fingerprint. A number of trivial
identities hold over concatenated fingerprints, [A:B] = [A]+[B], (A:B)^(C:D)
= (A^C):(B^D), etc...
Perhaps the most useful corollary of the above idenities is that
Hamming(A:B,C:D) = Hamming(A,C) + Hamming(B,D). This allows us to
test for similarity over several descriptors simultaneously, potentially
trading similarity of one with disimilarity of another. This trade-off
may be customized by choosing encodings such the Hamming distance between
any pair of related significant descriptors is higher than between any
pair related by less significant descriptors.
One trivial way of achieving this without affecting the descriptor
encodings, is to simply encode the more significant descriptors multiple
times. Duplicating a subfingerprint twice effectively doubles the number
of bits that are different in a similarity comparison,
Hamming(A:A:B,C:C:D) = 2*Hamming(A,C) + Hamming(B,D). Although this
method is effective typically it is possible to achieve the same result
through changing the descriptor encodings.
Finally, notice that it is difficult to derive Tanimoto(A:B,C:D) from
Tanimoto(A,C) and Tanimoto(B,D). However, such a definition is possible
if any encoding has a constant number of bits. Rewriting Tanimoto(A,B)
as the more common [A&B]/([A]+[B]-[A&B]), and replacing [A] and [B] by
the constant "kAB", [A&B]/(kAB-[A&B]). Which is a function of the
number of bits in common. Composition then results in Tanimoto(A:B,C:D)
= ([A&C]+[B&D]) / ((kAC+kBD)-([A&C]+[B&D])), which once again is a function
of the number of bits in common.
2.3 Conversion of Hamming Encodings to Tanimoto Encodings
There is a clever trick for converting descriptor encodings intended
for Hamming similarity into encodings suitable for Tanimoto similarity
and database search pre-screening. This method was first introduced
for error detecting codes by Berger in 1961. The method is that given
a descriptor encoding A, represent this in a fingerprint as A:~A (where
"~" is the bitwise NOT operator), i.e. compose the encoding with its
own compliment. These Berger Codes have the property that
exactly half of their bits are set (making them a subset of Sperner
codes). Now all encodings of a descriptor have the same number of
bits. Given codes of length n, A and B which are distance
x=Hamming(A,B) apart, we can define y the Tanimoto threshold for the
corresponding Berger codes, Tanimoto(A:~A,B:~B). By analysis
[(A:~A)&(B:~B)]=n-x and [A:~A]=n, hence y=(n-x)/(n+x).
One other advantage of Berger codes (and their principal use in error
correcting and delay-insensitive asynchronous circuits) is that no code
is a subset of another. As mentioned previously this is a desirable
property for pre-screening database searches.
3. Enumeration (Nominal) Data
Nominal data is where a descriptor may take any of m different
states, but these states cannot be arranged in any meaningful order.
A special case is where it has two states, usually represented by the
digits 0 and 1 when it is called a binary or boolean variable. The
only notion of similarity for such variables is equality.
We do not consider the case where a variable may be in more than one
state simultaneously.
3.1 Single Bit (1-of-m) Codes
A trivial, but commonly used method, is to represent such descriptors
as a m-bit vector, where each encoding has only a single bit
set. Although this encoding is inefficient for large m it has
several benefits. No code is a subset of another, all encodings contain
the same number of bits, every code is Hamming distance 2 from every
other. However the greatest benefits are that it is possible to encode
variables in multiple states in the same scheme, and that the sparse
fingerprints are easily folded.
3.2 Arbitrary Conventional Codes
Another obvious choice is to encode each state as an integer between
0..m-1, and encode the resulting value in the conventional
ceil(log2(m)) bits. A Hamming distance
greater than zero indicates that the descriptors are in different states.
Note that no meaning may be assigned to distances greater than zero.
3.3 Hamming Codes
We have seen that the weighting of fingerprints can be increased by
duplicating bits. A better encoding technique is to use Hamming
Codes. A set of Hamming codes are a set of code words that have
atleast a specified number of bits different between any two non-identical
codes. The use of a parity bit can be seen as ensuring that all distinct
codes have a hamming distance of atleast two. To encode m values
with a minimum Hamming distance of k requires 2k-1
additional bits. This results in an encoding scheme requiring
ceil(log2(m))+2k-1 bits.
Once again note that no meaning may be ascribed to Hamming distances greater
than or equal to k.
Link 1: Hamming Codes for m=16
4. Numeric (Interval Scaled) Data
Numeric (or cardinal) data is where a descriptor may take any of m
sequential integer values. These are nominal states that not only may be
arranged in a meaningful order but also differences between sucessive
values are of equal sizes. Ordinal variables (as defined in John Bradshaw's
talk) where sucessive values are not equally similar may be handled by
mapping these values onto suitable numeric values. For example, the
formal charges -3,-2,-1,0,1,2,3 may be mapped onto the values 0,1,2,4,6,7,8
making same signed charges similar, neutral to charged changes less similar
and sign changes less similar still.
We do not consider the encoding of variables that may have more than one
numeric value, or range information giving upper and/or lower bounds on
values. We also do not consider cyclic values, where the largest value
is considered similar to the smallest (such as times or angles).
4.1 Explicit Run Encoding
The traditional way of encoding m distingishable values is to use
m-1 bits. Each value in the range 0..m-1 being represented
by the appropriate number of consecutive 1 bits followed by the remaining
0 bits. Two values that differ by delta values, are a Hamming
distance on delta apart. This representation becomes increasingly
inefficient with larger values of m.
4.2 Binning Values
Another less than satifactory solution is to "bin" values into a smaller
number of subintervals. For example, percentage values between zero and
one hundred, could be placed in ten bins: 0-10, 11-20, 21-30, ..., 91-100.
These 10 bin values can then be encoded either as nominal states (with no
notion of similarity between bins) or as numeric data but with a smaller
range. One problem (as demonstrated above) is that its not always possible
to place values in equal sized bins. However the greatest disadvantage
with such a scheme is the "boundary effects", where very similar values
that fall either side of a partition are considered disimilar, and where
all values within a bin are considered equal.
4.3 Hamming-Grey Codes
In this talk, we present an improved solution to encoding numeric
descriptors in binary fingerprints. The observation is initially
made that if the complete range of similarity values needs to be
ordered with respect to a query, then the "Explicit Run" encoding
of section 4.1 can be shown optimal. However in practice such
complete orderings are rarely required, most queries only consider
similarity over a restrcited range and need not rank disimilar
values precisely. This is similar to the "Horizon Effect" in Tanimoto
similarity where all database entries that have no bits in common
with a query look identical. Hence we propose a numeric descriptor
encoding that can be used to rank all neighboring values within some
delta, but is unable to distinguish values further than delta
away.
The proposed solution combines the coding methods described by Hamming
and Grey, and shall therefore be refered to as Hamming-Grey codes. As
has already been mentioned in section 3.3, Hamming codes are n
bit codes that are guarantee a Hamming distance of atleast k
between distinct values. Grey codes are an ordering of n bit
codes where each code differs from the previous code by a single bit,
i.e. Hamming(Grey(i),Grey(i+1))=1. Example Grey codes for 16 values
are given in link 2 below. Note that Grey
codes do not describe an encoding "per se", but an ordering on codes.
Grey codes themselves cannot be used directly to encode similarity as
several pairs of codes (in addition to neighboring values) have a
Hamming distance of 1, i.e. only a single bit difference. A Grey code
can be envisioned as a non-intersecting path through a n-dimensional
space traversing only a single axis on each move. Hamming codes can be
considered suitably distant points in the same n-dimensional space.
Hamming-Grey codes are an encoding of numeric values, paramterized by
two values, the length of the code and the resolution delta of
the encoding. Each code has the property that sucessive values differ
by only a single bit. Hamming(Hamming-Grey(i),Hamming-Grey(i+1)) = 1.
Similarly, if two values differ by x, less than delta, then
their two encodings have a Hamming distance of x. Any pair of values
further than delta apart are guaranteed to have a Hamming distance
of atleast delta. This is expressed by the constraint, that if
Hamming(Hamming-Grey(A),Hamming-Grey(B)) < delta, then
Hamming(Hamming-Grey(A),Hamming-Grey(B)) = abs(A-B). This
encoding allows different queries to retrieve descriptors up to a
maximum of delta-1 away from the query value. Hamming-Grey
encodings can easily be found by searching through an n-dimensional
space changing one bit at a time, and not changing that bit again
for the next delta codes, and never coming within delta
bits of a code seem more than delta values previously. Example
Hamming-Grey encodings for 16 values are given in
link 3 below.
Given the definition above, the first question that springs to mind is
how large does a Hamming-Grey code have to be to encode m
different values. Unfortunately, this is not a trivial question and
I have been unable to derive an analytical formula, however the coding
is relatively efficient an a table Hamming-Grey code encoding densities
is given in link 4 below.
An interesting reference point in the link 4
table, is that a Hamming-Grey code of length 16 and resolution 5 can be
used to encode atleast 111 values. We can use such an encoding to solve
the percentage problem described in section 4.2. With a resolution of
5, a query can retrieve all values within the 11 percentile interval
x-5..x+5 by limiting the Hamming distance from x to be 5 bits. This
scheme doesn't suffer from the boundary problem 79 is encoded one bit
away from 80, and there are no binning problems 70 is disimilar to 89.
Such a percentage encoding using a Hamming-Grey(16,5) code is given in
link 5 below.
Link 2: Example Grey Codes for m=16
Link 3: Hamming Grey Codes for m=16
Link 4: Table of Hamming-Grey Code Ranges
Link 5: Example Encoding of Percentage Values
5. Example Applications
One potential application is in the selection of "similar" compounds
based on higher-level structural features. For example, it would be
easy to encode fingerprints containing the number of hydrogen bond
donors and acceptors, the molecular weight, the number of rotatable
bonds etc... Then using the rules, given a particular query molecule
select similar molecules that have no more than one or two fewer/more
rotable bonds and/or approximately the same molecular weight purely
by specifying the appropriate Hamming or Tanimoto threshold.
Another potential application is in 3D pharmacophore searching.
Consider encoding all 6 pairwise distances in a four-point pharmacophore
as a single fingerprint. Each distance could be represented as one of
200 values with a resolution of 0.1 Angstroms (giving a 20 Angstrom
maximum). Using either Hamming or Tanimoto similarity it is then
possible to select molecule conformations whose total pairwise error
over all size distances is less than 0.8 Angstroms. Such an encoding
can be achieved with a Hamming-Grey(24,8) encoding, only requiring
18 bytes per small molecule conformation.
6. Acknowledgements
I wish to thank John Bradshaw from Glaxo Wellcome, Stevenage, U.K., and
David Weininger and Jack Delany from Daylight Chemical Information Systems
for the discussions that first described the problem to me and much of
the inspiration that led to the solutions described above.
7. Bibliography
- J.M. Berger, "A Note on Error Detection Codes for Asymmetric
Channels", Information and Control, Vol. 4, pp. 68-73,
1961.
- B. Bose and T.R.N. Rao, "Theory of Unidirectional Error
Correcting/Detecting Codes", IEEE Transactions on Computers,
Vol. C-31, No. 6, pp. 521-530, June 1982.
- R.W. Hamming, "Error Detecting and Error Correcting Codes",
Bell Systems Technology Journal, Vol. 29, pp. 147-160, April
1950.
- D.E. Knuth, "Efficient Balanced Codes", IEEE Transactions
on Information Theory, Vol. IT-32, pp. 51-53, 1986.
- E. Sperner, "Ein Satz uber Untermengen Einer Endlichen Menge",
Math. Z., Vol. 27, pp. 544-548, 1928.
- Tom Verhoeff, "Delay-insensitive Codes: An Overview",
Distributed Computing, Vol 3, pp. 1-8, 1988.
roger@metaphorics.com