Daylight Summer School 2001, June 5-7, Santa Fe, NM

Daylight Fingerprints

The fingerprint program is included in the Database Package ,Cluster Package and there is a Fingerprint Toolkit.

Fingerprints are bit arrays (aka "bitmaps"), and were devised to enable high-speed structural screening and similarity measurement.


Fingerprint program syntax/options:

fingerprint [-b minbits] [-c maxsize] [-d dens] [-id fpid] [-t TAG] [-x]
            [-z] [-s minstep/maxstep]
            [-m [-mb minbits] [-mt TAG] [-md dens] [-mz]]
            [ in.tdt [ out.tdt ] ]

in.tdt ....... .tdt file contining $SMI data (default: stdin)
out.tdt ...... .tdt file with $FPG and FP data added (default: stdout)

standard options:
 -b minbits .. minimum fingerprint size allowed, bits (default: 64)
 -c maxbits .. creation size of fingerprint, bits (default: 2048)
 -d dens ..... density below which fingerprints are folded (default: 0.3)
 -id fpid .... identify this run by `fpid'
 -t TAG ...... use `TAG' instead of `FP' for fingerprint dataitems
 -x .......... generate difference fingerprints (XFP<>)
 -z .......... zap existing FP and $FPG data
 -s min/max .. Compute bits for pathlength in this range (default: 0/7)

options for mixtures:
 -m .......... generate fingerprints for mixture components ("parts")
 -mb minbits . minimum fingerprint size allowed, bits (default: 64)
 -mt TAG ..... use 'TAG' instead of `FPP' for mixture fingerprints
 -md dens .... density below which fingerprints are folded (default: 0.3)
 -mz ......... zap existing FPP data from TDT stream

produces: FP<fp;obits;oset;nbits;nset;ver;fpid>
          FPP<part-ntuple;fpid>

Reaction Fingerprints

Structural Reaction Fingerprints - for structural screening =
the fingerprint of the reactant part
+ the fingerprint of the product part
+ the bit-shifted fingerprint of the product part

Difference Fingerprints - reflects atom/bond changes in a reaction =
reactant fingerprint XOR products fingerprint

Example

Sn2 displacement reaction:

[I-].[Na+].C=CCBr>>[Na+].[Br-].C=CCI

The paths generated for the molecules would be as follows:

Enumerated Fingerprint Paths:
Path Length: Reactant (count/path): Product (count/path):
0 1 I, 1 Na, 3 C, 1 Br 1 I, 1 Na, 3 C, 1 Br
1 1 C=C, 1 C-C, 1 C-Br 1 C=C, 1 C-C, 1 C-I
2 1 C=C-C, 1 C-C-Br 1 C=C-C, 1 C-C-I
3 1 C=C-C-Br 1 C=C-C-I

Difference in Path Counts:
Path Length: Difference (count/path):
0 0 I, 0 Na, 0 C, 0 Br
1 0 C=C, 0 C-C, 1 C-Br, 1 C-I
2 0 C=C-C, 1 C-C-Br, 1 C-C-I
3 1 C=C-C-Br, 1 C=C-C-I

After generating the difference in counts, we only use the six paths with non-zero differences to set bits in the difference fingerprint. These are the paths which walk through bonds that change during the reaction. By considering only these paths, we get a fingerprint which reflects the overall bond changes in the reaction.


Mixture Fingerprints - part-tuple fingerprints


Comparing Fingerprints

Three Similarity Metrics: Tanimoto, Euclidean, and Tversky

Terms:

Symbol Definition Description
bits(F)   A function that returns the number of "1" bits in a bitmap
BT   The total number of bits (the fingerprint's size); a constant
B1 bits(F1) The number of 1's in F1
B2 bits(F2) The number of 1's in F2
BC bits( F1 AND F2 ) The number of 1's in common between F1 and F2
BI bits(F1 XOR (NOT F2)) The number of identical bits (1's and 0's) between F1 and F2
BU1 bits(F1 AND (NOT F2)) The number of unique bits (1's) in F1
BU2 bits(F2 AND (NOT F1)) The number of unique bits (1's) in F2
Tanimoto Coefficient
The number of bits in common divided by the total number of bits that could be in common. Scale, 1.0 identical fingerprints, 0.7 highly similar, 0.5 roughly similar

TC = BC / (B1 + B2 - BC)

Euclidian distance
A measure of the geometric distance between two fingerprints. Scale , 0.0 identical fingerprints, 0.3 highly similar, 0.5 roughly similar

DE(F1,F2) = (BT - BI) / BT

The distance-as-substructure metric is:

DSE(F1,F2) = (B1 - bits(F1 AND F2)) / B1

Tversky Similarity
For a complete description of Tversky similarity see John Bradshaw's MUG '97 presentation, "Introduction to Tversky similarity measure".

Tversky similariy compares features in a given structure (the "prototype") to features in database structures (as "variants") with user specified weighting for each set of features.

TS = BC / ( BU1 + BU2 + BC)

Example: Setting the weighting of prototype features to 100% and variant features to 100%, i.e.=1,=1, produces a symmetrical similarity metric identical to the Tanimoto metric.

Example: Setting the weighting of prototype and variant features asymmetrically produces a similarity metric in a more-substructural or more-superstructural sense. Setting the weighting of prototype features to 100% (=1) and variant features to 0% (=0) means that only the prototype features are important, i.e., this produces a "superstucture-likeness" metric. In this case, a Tversky similarity value of 1.0 means that all prototype features are represented in the variant, 0.0 that none are.

Example: Setting the weights to 0% prototype (=0) / 100% variant (=1) produces a "substucture-likeness" metric, where completely embedded structures have a 1.0 value and "near-substructures" have values near 1.0.

Tversky metrics where the two weightings add up to 100% (1.0) are of special interest (e.g., the 50/50 metric is known as the Dice index).


Daylight Chemical Information Systems Inc.