Daylight Fingerprints-
The fingerprint program is included in the Database Package ,Cluster
Package and there is a Fingerprint Toolkit.
Fingerprints were devised to enable high-speed structural screening.
Daylight MolecularFingerprints contain:
-
a pattern for each atom
-
a pattern representing each atom and its nearest neighbors (plus the bonds
that join them)
-
a pattern representing each group of atoms and bonds connected by paths
up to 2 bonds long
-
... atoms and bonds connected by paths up to 3 bonds long
-
... continuing, with paths up to 4, 5, 6, and 7 bonds long. Default is
7, can be up to a max of 31
Example:
the molecule OC=CN
would generate the following patterns:
0-bond paths: |
C |
O |
N |
1-bond paths: |
OC |
C=C |
CN |
2-bond paths: |
OC=C |
C=CN |
3-bond paths: |
OC=CN |
-
each pattern sets a set of bits (typically 4 or 5 bits per pattern) which
is added to the fingerprint.
-
If a pattern is a substructure of a molecule, every bit that is
set in the pattern's fingerprint will be set in the molecule's fingerprint.
-
Fingerprints can be variable length (folded) to increase the information
density and decrease the size, to save on storage without creating false
negative results-Fingerprint Density
Reaction Fingerprints
-
Structural Reaction Fingerprints-For Superstructural Matching
-
the fingerprint of the reactant part
-
the fingerprint of the product part
-
the bit-shifted fingerprint of the product part
-
Difference Fingerprints- Reflects bond changes in a Reaction
-
count of each path in the reactant
-
count of each path in the product
-
subtract the counts of a given path
-
if >< 0, then a bit is set in the difference fingerprint
-
if == 0, then no bit is set in the difference fingerprint
Example
Sn2 displacement reaction:
[I-].[Na+].C=CCBr>>[Na+].[Br-].C=CCI
The paths generated for the molecules would be as follows:
Enumerated Fingerprint Paths: |
Path Length: |
Reactant (count/path): |
Product (count/path): |
0 |
1 I, 1 Na, 3 C, 1 Br |
1 I, 1 Na, 3 C, 1 Br |
1 |
1 C=C, 1 C-C, 1 C-Br |
1 C=C, 1 C-C, 1 C-I |
2 |
1 C=C-C, 1 C-C-Br |
1 C=C-C, 1 C-C-I |
3 |
1 C=C-C-Br |
1 C=C-C-I |
Difference in Path Counts:
|
Path Length:
|
Difference (count/path):
|
0
|
0 I, 0 Na, 0 C, 0 Br
|
1
|
0 C=C, 0 C-C, 1 C-Br, 1 C-I
|
2
|
0 C=C-C, 1 C-C-Br, 1 C-C-I
|
3
|
1 C=C-C-Br, 1 C=C-C-I
|
After generating the difference in counts, we only use the
six paths with non-zero differences to set bits in the difference fingerprint.
These are the paths which walk through bonds that change during the reaction.
By considering only these paths, we get a fingerprint which reflects the
overall bond changes in the reaction.
Mixture Fingerprints-Fingerprint tuples
-
Mixtures stored as Dot Disconnected SMILES are fingerprinted
-
Each component is fingerprinted
-
FPP datatype contains fingerprint for the resulting combination fingerprint
Example:
$SMI<CCC(C)C(N)C(=O)NCC(=O)NC(CO)C(=O)O.CCC(C)C(N)C(=O)NCC(=O)NC(CCCCN)C(=O)O.
CCC(C)C(N)C(=O)NCC(=O)NC(CCSC)C(=O)O.CCC(C)C(N)C(=O)NCC(=O)NC(CC(C)C)C(=O)O.
CCC(C)C(N)C(=O)NCC(=O)NC(Cc1c[nH]cn1)C(=O)O....> FPP<63,59,60,58,56,61,57,62,7,3,4,2,0,5,1,6,39,35,36,34,32,37,33,38,15,11,12,
10,8,13,9,47,43,44,42,40,45,41,55,51,52,50,48,53,49,14,46,54,23,19,20,18,16,21,17,31,27,28,26,24,29,25
,22,30;....E..kcb6Aoe6aF87,0,rr68W,EW0Y.aVYC0J8UQAAedM7.67,VSA,6f.N,FEInJ0Q6ZmUiNZo4kmHJCM0.,CI6...>
$D<FPP>
_V<"Component FP indicies;Component fingerprints;FPP ID">
_B<"FPP/nos;FPP/fps;FPP/id"gt; _N<"PART_NTUPLE 1;BINARY;">
_P<"*;*;"gt; _S<"Component fingerprint indicies;Component
fingerprints;FPP ID">
_M<System>
_O<Daylight Chemical Information Systems Inc.>
Fingerprint options:
-
Minimum/Maximum Size (power of 2, typically 1024 for small molecules)
-
Density (0.3)
-
Minimum/Maximum Path length
-
Difference Fingerprints
-
Mixture Fingerprints
Comparing Fingerprints-
Three Similarity Metrics
-
Tanimoto Coefficient
-
Euclidian Distance
-
Tversky Similarity
Symbol |
Definition |
Description |
bits(F) |
|
A function that returns the number of "1" bits in a bitmap |
BT |
|
The total number of bits (the fingerprint's size); a constant |
B1 = |
bits(F1) |
The number of 1's in F1 |
B2 = |
bits(F2) |
The number of 1's in F2 |
BC = |
bits( F1 AND F2 ) |
The number of 1's in common between F1 and F2 |
BI = |
bits(F1 XOR (NOT F2)) |
The number of identical bits (1's and 0's) between F1 and F2 |
BU1 = |
bits(F1 AND (NOT F2)) |
The number of unique bits (1's) in F1 |
BU2 = |
bits(F2 AND (NOT F1)) |
The number of unique bits (1's) in F2 |
Tanimoto Coefficient- the number of bits in common divided
by the total number of bits that could be in common. Scale, 1.0
identical fingerprints, 0.7 highly similar, 0.5 roughly similar
Euclidian distance- a measure of the geometric distance between
two fingerprints. Scale , 0.0 identical fingerprints, 0.3 highly similar,
0.5 roughly similar
DE(F1,F2) = (BT - BI) / BT |
The distance-as-substructure metric is:
DSE(F1,F2) = (B1 - bits(F1 AND F2)) / B1 |
Tversky Similarity-
For a complete description of Tversky similarity see John Bradshaw's
MUG '97 presentation, "Introduction
to Tversky similarity measure".
-
Tversky similariy compares features in a given structure (the "prototype")
to features in database structures (as "variants") with user specified
weighting for each set of features.
TS = BC / ( BU1 +
BU2 + BC) |
Example, Setting the weighting of prototype features to 100%
and variant features to 100%, i.e. =1, =1,
produces a symmetrical similarity metric identical to the Tanimoto metric.
Example, Setting the weighting of prototype and variant features
asymmetrically produces a similarity metric in a more-substructural or
more-superstructural sense. Setting the weighting of prototype features
to 100% (=1) and variant features to 0% (=0)
means that only the prototype features are important, i.e., this produces
a "superstucture-likeness" metric. In this case, a Tversky similarity value
of 1.0 means that all prototype features are represented in the variant,
0.0 that none are.
Example, etting the weights to 0% prototype (=0)
/ 100% variant (=1) produces a
"substucture-likeness" metric, where completely embedded structures have
a 1.0 value and "near-substructures" have values near 1.0.
-
Tversky metrics where the two weightings add up to 100% (1.0) are
of special interest (e.g., the 50/50 metric is known as the Dice index).
Daylight Chemical Information Systems Inc.