Fingerprints are bit arrays (aka "bitmaps"), and were devised to enable high-speed structural screening and similarity measurement.
OC=CN
would generate the following
patterns:
0-bond paths: | C | O | N |
1-bond paths: | OC | C=C | CN |
2-bond paths: | OC=C | C=CN | |
3-bond paths: | OC=CN |
fingerprint [-b minbits] [-c maxsize] [-d dens] [-id fpid] [-t TAG] [-x] [-z] [-s minstep/maxstep] [-m [-mb minbits] [-mt TAG] [-md dens] [-mz]] [ in.tdt [ out.tdt ] ] in.tdt ....... .tdt file contining $SMI data (default: stdin) out.tdt ...... .tdt file with $FPG and FP data added (default: stdout) standard options: -b minbits .. minimum fingerprint size allowed, bits (default: 64) -c maxbits .. creation size of fingerprint, bits (default: 2048) -d dens ..... density below which fingerprints are folded (default: 0.3) -id fpid .... identify this run by `fpid' -t TAG ...... use `TAG' instead of `FP' for fingerprint dataitems -x .......... generate difference fingerprints (XFP<>) -z .......... zap existing FP and $FPG data -s min/max .. Compute bits for pathlength in this range (default: 0/7) options for mixtures: -m .......... generate fingerprints for mixture components ("parts") -mb minbits . minimum fingerprint size allowed, bits (default: 64) -mt TAG ..... use 'TAG' instead of `FPP' for mixture fingerprints -md dens .... density below which fingerprints are folded (default: 0.3) -mz ......... zap existing FPP data from TDT stream produces: FP<fp;obits;oset;nbits;nset;ver;fpid> FPP<part-ntuple;fpid>
[I-].[Na+].C=CCBr>>[Na+].[Br-].C=CCI
The paths generated for the molecules would be as follows:
Enumerated Fingerprint Paths: | ||
Path Length: | Reactant (count/path): | Product (count/path): |
0 | 1 I, 1 Na, 3 C, 1 Br | 1 I, 1 Na, 3 C, 1 Br |
1 | 1 C=C, 1 C-C, 1 C-Br | 1 C=C, 1 C-C, 1 C-I |
2 | 1 C=C-C, 1 C-C-Br | 1 C=C-C, 1 C-C-I |
3 | 1 C=C-C-Br | 1 C=C-C-I |
|
|
Path Length: | Difference (count/path): |
0 | 0 I, 0 Na, 0 C, 0 Br |
1 | 0 C=C, 0 C-C, 1 C-Br, 1 C-I |
2 | 0 C=C-C, 1 C-C-Br, 1 C-C-I |
3 | 1 C=C-C-Br, 1 C=C-C-I |
After generating the difference in counts, we only use the six paths with non-zero differences to set bits in the difference fingerprint. These are the paths which walk through bonds that change during the reaction. By considering only these paths, we get a fingerprint which reflects the overall bond changes in the reaction.
$D<FPP>
$SMI<"[I-].[Na+].C=CCBr>>[Na+].[Br-].C=CCI">
FPP<..2..2...E2.2,..+.06...+..2,G60.EoH0e+o.2,..+U1A...+U.2,1.1.ME......2,vA5U.qPXXFw.2;>
|
_V<"Component fingerprints;Component fingerprints/ID">
_B<"FPP;FPP/ID">
_N<"PART_NTUPLE 1;">
_P<"*;">
_S<Component fingerprints>
_M<System>
_O<Daylight Chemical Information Systems Inc.>
|
Symbol | Definition | Description |
---|---|---|
bits(F) | A function that returns the number of "1" bits in a bitmap | |
BT | The total number of bits (the fingerprint's size); a constant | |
B1 | bits(F1) | The number of 1's in F1 |
B2 | bits(F2) | The number of 1's in F2 |
BC | bits( F1 AND F2 ) | The number of 1's in common between F1 and F2 |
BI | bits(F1 XOR (NOT F2)) | The number of identical bits (1's and 0's) between F1 and F2 |
BU1 | bits(F1 AND (NOT F2)) | The number of unique bits (1's) in F1 |
BU2 | bits(F2 AND (NOT F1)) | The number of unique bits (1's) in F2 |
The distance-as-substructure metric is:
Tversky similariy compares features in a given structure (the "prototype") to features in database structures (as "variants") with user specified weighting for each set of features.
TS = BC / (![]() ![]() |
Example: Setting the weighting of prototype features to 100%
and variant features to 100%, i.e.=1,
=1,
produces a symmetrical similarity metric identical to the Tanimoto metric.
Example: Setting the weighting of prototype and variant features
asymmetrically produces a similarity metric in a more-substructural or
more-superstructural sense. Setting the weighting of prototype features
to 100% (=1) and variant features to 0% (
=0)
means that only the prototype features are important, i.e., this produces
a "superstucture-likeness" metric. In this case, a Tversky similarity value
of 1.0 means that all prototype features are represented in the variant,
0.0 that none are.
Example: Setting the weights to 0% prototype (=0)
/ 100% variant (
=1) produces a
"substucture-likeness" metric, where completely embedded structures have
a 1.0 value and "near-substructures" have values near 1.0.
Tversky metrics where the two weightings add up to 100% (1.0) are of special interest (e.g., the 50/50 metric is known as the Dice index).