Cluster Package
The Daylight Cluster Package generates clusters based on Daylight
Fingerprints, tanimoto similarity and a non-hierarchical clustering algorithm-Jarvis-Patrick
Motivation:
-
Automatically group compounds into structurally
related families. The result is a set of clusters.
-
Determine the degree of clustering of structures
in a large set for the purpose of characterizing that set of structures-
clustering statistics.
-
Select a limited number of compounds to represent
all structural classes in the total set
-
Locate unique or unusual compounds in a set. The
result is a list of compounds that don't cluster.
-
Provide a very fast and compact form of similarity
searching (i.e. given one structure, find all members of its cluster).
Methodology:
-
Characterize substructural content as generally
as possible- Basic Descriptor is Daylight Fingerprint
-
Establish intermolecular similarity based on substructural
characterization-Simiarity is Tanimoto
-
Perform non-parametric clustering based on similarity-Clustering
Algorithm is Jarvis-Patrick
-
Postprocess resultant data for improved usablility-Cluster
Statistics
Jarvis-Patrick clustering
-
For each item, find its J jearest neighbors. This needs to be done only
once. (nearneighbors)
-
Two structures cluster together if: (jarpat)
(a) They are in each other's list of J jearest
neighbors, and
(b) K of their J nearest neighbors are in
common.
Cluster statistics (listclusters, showclusters)
-
number of clusters,
-
number and percentage of structures clustered
-
frequency distribution of clusters by cluster size.
-
the amount of variance of the cluster that is unexplained by each member.
-
if n is the number of members of a clusters and Tij is the Tanimoto metric
between members i and j, the statistic for member i is:
-
centroid, a representative structure of the group.
-
singletons, structures which are not assigned to a cluster
Included programs and clustering procedure:
-
smi2tdt -- create file of Thor datatrees
from SMILES
-
fingerprint -- characterize substructural
content
-
nearneighbors -- find nearest neighbors
by comparing fingerprints
-
(mergeneighbors -- combine nearest neighbors
lists )
-
jpscan -- tabulate J-P clustering results
with varying parameters
-
jarpat -- Jarvis-Patrick clustering
-
showclusters -- process cluster data for
textual display, ascii output
-
listclusters -- sort and reformat clustering
data for further processing, TDT output
Example
%smi2tdt -t '$CAS' smicas.smi > smicas.tdt
% fingerprint -id 7DEC smicas.tdt
> fingers.tdt
% nearneighbors -FID 7DEC fingers.tdt >
neighbors.tdt
% jpscan
neighbors.tdt > /dev/printer
%jarpat -JP_NEED 8 -JP_NEAR 14 -NNID 7DEC neighbors.tdt
> clusters.tdt
%showclusters -h -q -v clusters.tdt | more
%showclusters -h -q -x clusters.tdt > /dev/printer
Example jpscan output (For 11 nearneighbors):
Program ......... jpscan
Version ......... Daylight Software Release 4.51
Function ........ scan Jarvis-Patrick clustering parameters
Input ........... NN (nearest neighbors) data
NN data set ..... na
created by ... nearneighbors
version ...... 4.51
from ......... FP
with params .. 16
Trees read in ... 1999
Clustered by .... standard Jarvis-Patrick method
NUMBER OF STRUCTURES CLUSTERED
------- NEED ---------------------------------------------------------
2 3 4 5 6 7 8 9 10 11
NEAR ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
2: 980 - - - - - - - - -
3: 1338 779 - - - - - - - -
4: 1506 1186 663 - - - - - - -
5: 1633 1424 1072 579 - - - - - -
6: 1705 1563 1332 965 525 - - - - -
7: 1749 1648 1459 1209 877 467 - - - -
8: 1788 1707 1571 1381 1120 796 394 - - -
9: 1823 1761 1657 1503 1288 1057 741 368 - -
10: 1850 1797 1715 1582 1401 1208 972 679 344 -
11: 1866 1823 1748 1639 1502 1326 1134 898 596 319
PERCENTAGE OF STRUCTURES CLUSTERED
------- NEED ---------------------------------------------------------
2 3 4 5 6 7 8 9 10 11
NEAR ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
2: 49.02 - - - - - - - - -
3: 66.93 38.96 - - - - - - - -
4: 75.33 59.32 33.16 - - - - - - -
5: 81.69 71.23 53.62 28.96 - - - - - -
6: 85.29 78.18 66.63 48.27 26.26 - - - - -
7: 87.49 82.44 72.98 60.48 43.87 23.36 - - - -
8: 89.44 85.39 78.58 69.08 56.02 39.81 19.70 - - -
9: 91.19 88.09 82.89 75.18 64.43 52.87 37.06 18.40 - -
10: 92.54 89.89 85.79 79.13 70.08 60.43 48.62 33.96 17.20 -
11: 93.34 91.19 87.44 81.99 75.13 66.33 56.72 44.92 29.81 15.95
NUMBER OF CLUSTERS
------- NEED ---------------------------------------------------------
2 3 4 5 6 7 8 9 10 11
NEAR ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
2: 490 - - - - - - - - -
3: 466 333 - - - - - - - -
4: 365 378 264 - - - - - - -
5: 280 333 317 220 - - - - - -
6: 209 278 318 273 193 - - - - -
7: 154 200 267 271 241 172 - - - -
8: 121 142 213 259 246 221 145 - - -
9: 94 118 164 223 240 240 202 130 - -
10: 71 92 133 188 226 234 226 184 136 -
11: 61 78 107 147 194 219 232 209 171 127
AVERAGE CLUSTER SIZE
------- NEED ---------------------------------------------------------
2 3 4 5 6 7 8 9 10 11
NEAR ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
2: 2.0 - - - - - - - - -
3: 2.8 2.3 - - - - - - - -
4: 4.1 3.1 2.5 - - - - - - -
5: 5.8 4.2 3.3 2.6 - - - - - -
6: 8.1 5.6 4.1 3.5 2.7 - - - - -
7: 11.3 8.2 5.4 4.4 3.6 2.7 - - - -
8: 14.7 12.0 7.3 5.3 4.5 3.6 2.7 - - -
9: 19.3 14.9 10.1 6.7 5.3 4.4 3.6 2.8 - -
10: 26.0 19.5 12.8 8.4 6.1 5.1 4.3 3.6 2.5 -
11: 30.5 23.3 16.3 11.1 7.7 6.0 4.8 4.2 3.4 2.5
SIZE OF LARGEST CLUSTER
------- NEED ---------------------------------------------------------
2 3 4 5 6 7 8 9 10 11
NEAR ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
2: 3 - - - - - - - - -
3: 9 4 - - - - - - - -
4: 31 11 5 - - - - - - -
5: 79 15 10 6 - - - - - -
6: 371 106 17 9 7 - - - - -
7: 946 204 27 16 11 8 - - - -
8: 1114 829 133 34 16 12 9 - - -
9: 1264 1009 460 58 23 18 13 10 - -
10: 1337 1177 873 123 35 21 18 15 11 -
11: 1459 1239 1018 306 60 33 23 18 15 11
NUMBER OF SINGLETONS
------- NEED ---------------------------------------------------------
2 3 4 5 6 7 8 9 10 11
NEAR ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
2: 1019 - - - - - - - - -
3: 661 1220 - - - - - - - -
4: 493 813 1336 - - - - - - -
5: 366 575 927 1420 - - - - - -
6: 294 436 667 1034 1474 - - - - -
7: 250 351 540 790 1122 1532 - - - -
8: 211 292 428 618 879 1203 1605 - - -
9: 176 238 342 496 711 942 1258 1631 - -
10: 149 202 284 417 598 791 1027 1320 1655 -
11: 133 176 251 360 497 673 865 1101 1403 1680
Example showcluster output (showclusters -h -q -v clusters.tdt | more
HEADER AND SUMMARY:
program ................... showclusters
function .................. analysis and display of structure clusters
version ................... DCIS Release 4.61 (c) 1995
output requested .......... Summary Frequencies Sorted lists
singletons to be listed ... no
datatype(s) to show ....... all SMILES
display long data items ... normal
input file ................ jp810.tdt
tree allocation, initial .. 10000
tree allocation, final .... 10000
total datatrees read ...... 2002
trees with SMILES ......... 1999
cluster id required ....... none
trees with CL data ........ 1999
trees with FP data ........ 1999
trees with other data ..... 0 (0 items read)
trees used ................ 1999
clusters + singletons ..... 1253
number of singletons ...... 1027
number of clusters ........ 226
average cluster size ...... 4.3
largest cluster ........... 17
Generation of CLUSTERS:
ID ........... na
Program ...... jarpat
Version ...... 4.61
Source ....... NN (near neighbors)
Parameters ... 8,10,0
Generation of NEAR NEIGHBORS:
ID ........... na
Program ...... nearneighbors
Version ...... 4.61
Source ....... FP (fingerprints)
Parameters ... 16
Generation of FINGERPRINTS:
ID ........... na
Program ...... fingerprint
Version ...... 4.61
Source ....... med98.tdt
Parameters ... 2048,64,0.30,0/7
FREQUENCIES OF CLUSTER SIZES:
size | frequency size | frequency size | frequency
----------+---------- ----------+---------- ----------+----------
1 | 1027 7 | 10 13 | 1
2 | 107 8 | 1 14 | 4
3 | 35 9 | 9 15 | 5
4 | 15 10 | 8 16 | .
5 | 15 11 | 2 17 | 1
6 | 10 12 | 3 . | .
CLUSTERS LISTED BY SIZE, SMILES BY VAR(TANIMOTO):
CLUSTER 0 (64) size 17
0.0 0.0219 CC1(C)SC2C(NC(=O)Cc3ccccc3)C(=O)N2C1C(=O)O
0.1 0.0383 CC1(C)SC2C(NC(=O)C(C(=O)O)c3ccccc3)C(=O)N2C1C(=O)O
0.2 0.0477 CC1(C)SC2C(NC(=O)C(N=[N+]=[N-])c3ccccc3)C(=O)N2C1C(=O)O
0.3 0.0486 CC(=O)OCOC(=O)C1N2C(SC1(C)C)C(NC(=O)Cc3ccccc3)C2=O
0.4 0.0497 CC(C)(C)C(=O)OCOC(=O)C1N2C(SC1(C)C)C(NC(=O)C(N)c3ccccc3)C2=O
0.5 0.0507 CC1(C)SC2C(NC(=O)C3(N)CCCCC3)C(=O)N2C1C(=O)O
0.6 0.0566 CC1(C)SC2C(NC(=O)C(C(=O)Oc3ccccc3)c4ccccc4)C(=O)N2C1C(=O)O
0.7 0.0589 COC(C(=O)NC1C2SC(C)(C)C(N2C1=O)C(=O)O)c3ccc(Cl)c(Cl)c3
0.8 0.0657 CC1(C)SC2C(NC(=O)C(C(=O)O)c3ccsc3)C(=O)N2C1C(=O)O
0.9 0.0677 CC1(C)SC2C(NC(=O)COc3ccccc3)C(=O)N2C1C(=O)O
0.10 0.0720 CCC(Oc1ccccc1)C(=O)NC2C3SC(C)(C)C(N3C2=O)C(=O)O
0.11 0.0740 CC1(C)NC(C(=O)N1C2C3SC(C)(C)C(N3C2=O)C(=O)O)c4ccccc4
0.12 0.0791 COc1cccc(OC)c1C(=O)NC2C3SC(C)(C)C(N3C2=O)C(=O)O
0.13 0.0920 COC1(NC(=O)C(C(=O)O)c2ccsc2)C3SC(C)(C)C(N3C1=O)C(=O)O
0.14 0.0954 CC1(C)SC2C(NC(=O)C(N)c3ccccc3)C(=O)N2C1C(=O)OC4OC(=O)c5ccccc45
0.15 0.0977 CC1(C)SC2C(NC(=O)C(N)C3=CCC=CC3)C(=O)N2C1C(=O)O
0.16 0.0986 CC1(C)SC2C(NC(=O)C(NC(=O)N3CCN(C3=O)S(=O)(=O)C)c4ccccc4)C(=O)N2C1C(=O)O
CLUSTER 1 (5) size 15
1.0 0.0020 Cn1c(=O)n(C)c2[nH]c(=O)[nH]c2c1=O
1.1 0.0021 Cn1c(=O)[nH]c2[nH]c(=O)[nH]c2c1=O
1.2 0.0022 Cn1cnc2[nH]c(=O)n(C)c(=O)c12
1.3 0.0023 Cn1cnc2n(C)c(=O)[nH]c(=O)c12
1.4 0.0023 Cn1c(=O)[nH]c2[nH]c(=O)n(C)c(=O)c12
1.5 0.0024 Cn1c(=O)[nH]c2n(C)c(=O)[nH]c(=O)c12
1.6 0.0024 Cn1cnc2c(=O)n(C)c(=O)[nH]c12
1.7 0.0025 Cn1c(=O)[nH]c2nc[nH]c2c1=O
1.8 0.0026 Cn1c(=O)[nH]c(=O)c2[nH]cnc12
1.9 0.0027 Cn1c(=O)[nH]c(=O)c2[nH]c(=O)[nH]c12
1.10 0.0028 Cn1cnc2n(C)c(=O)n(C)c(=O)c12
1.11 0.0030 Cn1c(=O)[nH]c2n(C)c(=O)n(C)c(=O)c12
1.12 0.0035 Cn1cnc2c(=O)[nH]c(=O)[nH]c12
1.13 0.0051 Cn1c(=O)[nH]c2[nH]c(=O)[nH]c(=O)c12
1.14 0.0105 Cn1cnc2c(=O)[nH]cnc12
CLUSTER 2 (75) size 15
2.0 0.0181 CN1C(=O)CN=C(c2ccccc2)c3cc(Cl)ccc13
2.1 0.0256 CN1C(=O)CN=C(c2ccccc2F)c3cc(Cl)ccc13
2.2 0.0259 Clc1ccc2NC(=O)CN=C(c3ccccc3)c2c1
2.3 0.0259 CN1C(=O)C(O)N=C(c2ccccc2)c3cc(Cl)ccc13
2.4 0.0282 Clc1ccc2N(CC#C)C(=O)CN=C(c3ccccc3)c2c1
2.5 0.0325 Clc1ccc2NC(=O)CN=C(c3ccccc3Cl)c2c1
2.6 0.0332 Clc1ccc2N(CC3CC3)C(=O)CN=C(c4ccccc4)c2c1
2.7 0.0336 CN1C(=O)C(O)N=C(c2ccccc2Cl)c3cc(Cl)ccc13
2.8 0.0342 FC(F)(F)CN1C(=O)CN=C(c2ccccc2)c3cc(Cl)ccc13
2.9 0.0367 CCN(CC)CCN1C(=O)CN=C(c2ccccc2F)c3cc(Cl)ccc13
2.10 0.0374 Clc1ccc2NC(=O)CN(=O)=C(c3ccccc3)c2c1
2.11 0.0391 OCCN1C(=O)C(O)N=C(c2ccccc2F)c3cc(Cl)ccc13
2.12 0.0525 CN1CCN=C(c2ccccc2)c3cc(Cl)ccc13
2.13 0.0575 CN(C)C(=O)OC1N=C(c2ccccc2)c3cc(Cl)ccc3N(C)C1=O
2.14 0.0766 CN1C(=O)CN=C(c2ccccc2F)c3cc(ccc13)N(=O)=O
CLUSTER 3 (286) size 15
3.0 0.0245 CC1CC2C3CC(F)C4=CC(=O)C=CC4(C)C3(F)C(O)CC2(C)C1(O)C(=O)CO
3.1 0.0300 CC1CC2C3CC(F)C4=CC(=O)C=CC4(C)C3(F)C(O)CC2(C)C1(O)C(=O)COC(=O)C(C)(C)C
3.2 0.0308 CC12CC(O)C3(F)C(CCC4=CC(=O)C=CC43C)C2CC(O)C1(O)C(=O)CO
3.3 0.0339 CC1CC2C3CC(F)(F)C4=CC(=O)C=CC4(C)C3(F)C(O)CC2(C)C1(O)C(=O)COC(=O)C
3.4 0.0361 CC1CC2C3CCC4=CC(=O)C=CC4(C)C3(F)C(O)CC2(C)C1(C)C(=O)CO
3.5 0.0361 CC(OC(=O)C)C(=O)C1(O)CCC2C3CCC4=CC(=O)C=CC4(C)C3(F)C(O)CC21C
3.6 0.0361 CC1CC2C3CCC4=CC(=O)C=CC4(C)C3(F)C(O)CC2(C)C1(O)C(=O)COC(=O)C
3.7 0.0403 CC12CC(O)C3C(CC(F)C4=CC(=O)C=CC34C)C2CCC1(O)C(=O)CO
3.8 0.0425 CCCCC(=O)OC1(CCC2C3CC(F)C4=CC(=O)C=CC4(C)C3C(O)CC21C)C(=O)CO
3.9 0.0450 CC1CC2C3CC(F)C4=CC(=O)C=CC4(C)C3C(O)CC2(C)C1C(=O)CO
3.10 0.0518 CCC(=O)OC1(C(C)CC2C3CCC4=CC(=O)C=CC4(C)C3(F)C(O)CC21C)C(=O)CCl
3.11 0.0612 CC(=O)OCC(=O)C1(CCC2C3CC(F)C4=CC(=O)C(=CC4(C)C3(F)C(O)CC21C)Br)OC(=O)C
3.12 0.0642 CCC(=O)OC1(C(C)CC2C3CC(F)C4=CC(=O)C=CC4(C)C3(F)C(O)CC21C)C(=O)SC
3.13 0.0697 CC1CC2C3CC(F)C4=CC(=O)C=CC4(C)C3(Cl)C(O)CC2(C)C1C(=O)COC(=O)C(C)(C)C
3.14 0.1074 CCSC1(CCC2C3CCC4=CC(=O)C=CC4(C)C3(F)C(O)CC21C)SC
...
228 singletons suppressed
...
Daylight Chemical Information Systems Inc.