Cluster Package

The Daylight Cluster Package generates clusters based on Daylight Fingerprints, tanimoto similarity and a non-hierarchical clustering algorithm-Jarvis-Patrick

Motivation:

Automatically group compounds into structurally related families. The result is a set of clusters.
Determine the degree of clustering of structures in a large set for the purpose of characterizing that set of structures- clustering statistics.
Select a limited number of compounds to represent all structural classes in the total set
Locate unique or unusual compounds in a set. The result is a list of compounds that don't cluster.
Provide a very fast and compact form of similarity searching (i.e. given one structure, find all members of its cluster).

Methodology:

Characterize substructural content as generally as possible- Basic Descriptor is Daylight Fingerprint
Establish intermolecular similarity based on substructural characterization-Simiarity is Tanimoto
Perform non-parametric clustering based on similarity-Clustering Algorithm is Jarvis-Patrick
Postprocess resultant data for improved usablility-Cluster Statistics

Jarvis-Patrick clustering

For each item, find its J jearest neighbors. This needs to be done only once. (nearneighbors)

Two structures cluster together if: (jarpat)

Cluster statistics (listclusters, showclusters)

number of clusters,
number and percentage of structures clustered
frequency distribution of clusters by cluster size.
the amount of variance of the cluster that is unexplained by each member.

if n is the number of members of a clusters and Tij is the Tanimoto metric between members i and j, the statistic for member i is:

centroid, a representative structure of the group.
singletons, structures which are not assigned to a cluster

Included programs and clustering procedure:

smi2tdt -- create file of Thor datatrees from SMILES
fingerprint -- characterize substructural content
nearneighbors -- find nearest neighbors by comparing fingerprints
(mergeneighbors -- combine nearest neighbors lists )
jpscan -- tabulate J-P clustering results with varying parameters
jarpat -- Jarvis-Patrick clustering
showclusters -- process cluster data for textual display, ascii output
listclusters -- sort and reformat clustering data for further processing, TDT output

Example

smi2tdt -t '$CAS' smicas.smi > smicas.tdt

fingerprint -id 7DEC smicas.tdt > fingers.tdt

nearneighbors -FID 7DEC fingers.tdt > neighbors.tdt

jpscan neighbors.tdt > /dev/printer

jarpat -JP_NEED 8 -JP_NEAR 14 -NNID 7DEC neighbors.tdt > clusters.tdt

showclusters -h -q -v clusters.tdt | more

showclusters -h -q -x clusters.tdt > /dev/printer

Example jpscan output (For 11 nearneighbors):

Program ......... jpscan
Version ......... Daylight Software Release 4.51
Function ........ scan Jarvis-Patrick clustering parameters
Input ........... NN (nearest neighbors) data
NN data set ..... na
   created by ... nearneighbors
   version ...... 4.51
   from ......... FP
   with params .. 16
Trees read in ... 1999

Clustered by .... standard Jarvis-Patrick method

NUMBER OF STRUCTURES CLUSTERED

      ------- NEED ---------------------------------------------------------
           2      3      4      5      6      7      8      9     10     11
NEAR  ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
   2:    980      -      -      -      -      -      -      -      -      -
   3:   1338    779      -      -      -      -      -      -      -      -
   4:   1506   1186    663      -      -      -      -      -      -      -
   5:   1633   1424   1072    579      -      -      -      -      -      -
   6:   1705   1563   1332    965    525      -      -      -      -      -
   7:   1749   1648   1459   1209    877    467      -      -      -      -
   8:   1788   1707   1571   1381   1120    796    394      -      -      -
   9:   1823   1761   1657   1503   1288   1057    741    368      -      -
  10:   1850   1797   1715   1582   1401   1208    972    679    344      -
  11:   1866   1823   1748   1639   1502   1326   1134    898    596    319

PERCENTAGE OF STRUCTURES CLUSTERED

      ------- NEED ---------------------------------------------------------
           2      3      4      5      6      7      8      9     10     11
NEAR  ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
   2:  49.02      -      -      -      -      -      -      -      -      -
   3:  66.93  38.96      -      -      -      -      -      -      -      -
   4:  75.33  59.32  33.16      -      -      -      -      -      -      -
   5:  81.69  71.23  53.62  28.96      -      -      -      -      -      -
   6:  85.29  78.18  66.63  48.27  26.26      -      -      -      -      -
   7:  87.49  82.44  72.98  60.48  43.87  23.36      -      -      -      -
   8:  89.44  85.39  78.58  69.08  56.02  39.81  19.70      -      -      -
   9:  91.19  88.09  82.89  75.18  64.43  52.87  37.06  18.40      -      -
  10:  92.54  89.89  85.79  79.13  70.08  60.43  48.62  33.96  17.20      -
  11:  93.34  91.19  87.44  81.99  75.13  66.33  56.72  44.92  29.81  15.95



NUMBER OF CLUSTERS

      ------- NEED ---------------------------------------------------------
           2      3      4      5      6      7      8      9     10     11
NEAR  ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
   2:    490      -      -      -      -      -      -      -      -      -
   3:    466    333      -      -      -      -      -      -      -      -
   4:    365    378    264      -      -      -      -      -      -      -
   5:    280    333    317    220      -      -      -      -      -      -
   6:    209    278    318    273    193      -      -      -      -      -
   7:    154    200    267    271    241    172      -      -      -      -
   8:    121    142    213    259    246    221    145      -      -      -
   9:     94    118    164    223    240    240    202    130      -      -
  10:     71     92    133    188    226    234    226    184    136      -
  11:     61     78    107    147    194    219    232    209    171    127

AVERAGE CLUSTER SIZE

      ------- NEED ---------------------------------------------------------
           2      3      4      5      6      7      8      9     10     11
NEAR  ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
   2:    2.0      -      -      -      -      -      -      -      -      -
   3:    2.8    2.3      -      -      -      -      -      -      -      -
   4:    4.1    3.1    2.5      -      -      -      -      -      -      -
   5:    5.8    4.2    3.3    2.6      -      -      -      -      -      -
   6:    8.1    5.6    4.1    3.5    2.7      -      -      -      -      -
   7:   11.3    8.2    5.4    4.4    3.6    2.7      -      -      -      -
   8:   14.7   12.0    7.3    5.3    4.5    3.6    2.7      -      -      -
   9:   19.3   14.9   10.1    6.7    5.3    4.4    3.6    2.8      -      -
  10:   26.0   19.5   12.8    8.4    6.1    5.1    4.3    3.6    2.5      -
  11:   30.5   23.3   16.3   11.1    7.7    6.0    4.8    4.2    3.4    2.5


SIZE OF LARGEST CLUSTER

      ------- NEED ---------------------------------------------------------
           2      3      4      5      6      7      8      9     10     11
NEAR  ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
   2:      3      -      -      -      -      -      -      -      -      -
   3:      9      4      -      -      -      -      -      -      -      -
   4:     31     11      5      -      -      -      -      -      -      -
   5:     79     15     10      6      -      -      -      -      -      -
   6:    371    106     17      9      7      -      -      -      -      -
   7:    946    204     27     16     11      8      -      -      -      -
   8:   1114    829    133     34     16     12      9      -      -      -
   9:   1264   1009    460     58     23     18     13     10      -      -
  10:   1337   1177    873    123     35     21     18     15     11      -
  11:   1459   1239   1018    306     60     33     23     18     15     11


NUMBER OF SINGLETONS

      ------- NEED ---------------------------------------------------------
           2      3      4      5      6      7      8      9     10     11
NEAR  ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
   2:   1019      -      -      -      -      -      -      -      -      -
   3:    661   1220      -      -      -      -      -      -      -      -
   4:    493    813   1336      -      -      -      -      -      -      -
   5:    366    575    927   1420      -      -      -      -      -      -
   6:    294    436    667   1034   1474      -      -      -      -      -
   7:    250    351    540    790   1122   1532      -      -      -      -
   8:    211    292    428    618    879   1203   1605      -      -      -
   9:    176    238    342    496    711    942   1258   1631      -      -
  10:    149    202    284    417    598    791   1027   1320   1655      -
  11:    133    176    251    360    497    673    865   1101   1403   1680

Example showcluster output (showclusters -h -q -v clusters.tdt | more

HEADER AND SUMMARY:

program ................... showclusters
function .................. analysis and display of structure clusters
version ................... DCIS Release 4.61 (c) 1995
output requested .......... Summary  Frequencies  Sorted lists
singletons to be listed ... no
datatype(s) to show ....... all SMILES
display long data items ... normal
input file ................ jp810.tdt
tree allocation, initial .. 10000
tree allocation, final .... 10000
total datatrees read ...... 2002
trees with SMILES ......... 1999
cluster id required ....... none
trees with CL data ........ 1999
trees with FP data ........ 1999
trees with other data ..... 0 (0 items read)
trees used ................ 1999
clusters + singletons ..... 1253
number of singletons ...... 1027
number of clusters ........ 226
average cluster size ...... 4.3
largest cluster ........... 17

Generation of CLUSTERS:
   ID ........... na
   Program ...... jarpat
   Version ...... 4.61
   Source ....... NN (near neighbors)
   Parameters ... 8,10,0

Generation of NEAR NEIGHBORS:
   ID ........... na
   Program ...... nearneighbors
   Version ...... 4.61
   Source ....... FP (fingerprints)
   Parameters ... 16

Generation of FINGERPRINTS:
   ID ........... na
   Program ...... fingerprint
   Version ...... 4.61
   Source ....... med98.tdt
   Parameters ... 2048,64,0.30,0/7


FREQUENCIES OF CLUSTER SIZES:

         size | frequency         size | frequency         size | frequency
    ----------+----------    ----------+----------    ----------+----------
            1 | 1027                 7 | 10                  13 | 1
            2 | 107                  8 | 1                   14 | 4
            3 | 35                   9 | 9                   15 | 5
            4 | 15                  10 | 8                   16 | .
            5 | 15                  11 | 2                   17 | 1
            6 | 10                  12 | 3                    . | .


CLUSTERS LISTED BY SIZE, SMILES BY VAR(TANIMOTO):

CLUSTER 0 (64) size 17
    0.0     0.0219 CC1(C)SC2C(NC(=O)Cc3ccccc3)C(=O)N2C1C(=O)O
    0.1     0.0383 CC1(C)SC2C(NC(=O)C(C(=O)O)c3ccccc3)C(=O)N2C1C(=O)O
    0.2     0.0477 CC1(C)SC2C(NC(=O)C(N=[N+]=[N-])c3ccccc3)C(=O)N2C1C(=O)O
    0.3     0.0486 CC(=O)OCOC(=O)C1N2C(SC1(C)C)C(NC(=O)Cc3ccccc3)C2=O
    0.4     0.0497 CC(C)(C)C(=O)OCOC(=O)C1N2C(SC1(C)C)C(NC(=O)C(N)c3ccccc3)C2=O
    0.5     0.0507 CC1(C)SC2C(NC(=O)C3(N)CCCCC3)C(=O)N2C1C(=O)O
    0.6     0.0566 CC1(C)SC2C(NC(=O)C(C(=O)Oc3ccccc3)c4ccccc4)C(=O)N2C1C(=O)O
    0.7     0.0589 COC(C(=O)NC1C2SC(C)(C)C(N2C1=O)C(=O)O)c3ccc(Cl)c(Cl)c3
    0.8     0.0657 CC1(C)SC2C(NC(=O)C(C(=O)O)c3ccsc3)C(=O)N2C1C(=O)O
    0.9     0.0677 CC1(C)SC2C(NC(=O)COc3ccccc3)C(=O)N2C1C(=O)O
    0.10    0.0720 CCC(Oc1ccccc1)C(=O)NC2C3SC(C)(C)C(N3C2=O)C(=O)O
    0.11    0.0740 CC1(C)NC(C(=O)N1C2C3SC(C)(C)C(N3C2=O)C(=O)O)c4ccccc4
    0.12    0.0791 COc1cccc(OC)c1C(=O)NC2C3SC(C)(C)C(N3C2=O)C(=O)O
    0.13    0.0920 COC1(NC(=O)C(C(=O)O)c2ccsc2)C3SC(C)(C)C(N3C1=O)C(=O)O
    0.14    0.0954 CC1(C)SC2C(NC(=O)C(N)c3ccccc3)C(=O)N2C1C(=O)OC4OC(=O)c5ccccc45
    0.15    0.0977 CC1(C)SC2C(NC(=O)C(N)C3=CCC=CC3)C(=O)N2C1C(=O)O
    0.16    0.0986 CC1(C)SC2C(NC(=O)C(NC(=O)N3CCN(C3=O)S(=O)(=O)C)c4ccccc4)C(=O)N2C1C(=O)O
CLUSTER 1 (5) size 15
    1.0     0.0020 Cn1c(=O)n(C)c2[nH]c(=O)[nH]c2c1=O
    1.1     0.0021 Cn1c(=O)[nH]c2[nH]c(=O)[nH]c2c1=O
    1.2     0.0022 Cn1cnc2[nH]c(=O)n(C)c(=O)c12
    1.3     0.0023 Cn1cnc2n(C)c(=O)[nH]c(=O)c12
    1.4     0.0023 Cn1c(=O)[nH]c2[nH]c(=O)n(C)c(=O)c12
    1.5     0.0024 Cn1c(=O)[nH]c2n(C)c(=O)[nH]c(=O)c12
    1.6     0.0024 Cn1cnc2c(=O)n(C)c(=O)[nH]c12
    1.7     0.0025 Cn1c(=O)[nH]c2nc[nH]c2c1=O
    1.8     0.0026 Cn1c(=O)[nH]c(=O)c2[nH]cnc12
    1.9     0.0027 Cn1c(=O)[nH]c(=O)c2[nH]c(=O)[nH]c12
    1.10    0.0028 Cn1cnc2n(C)c(=O)n(C)c(=O)c12
    1.11    0.0030 Cn1c(=O)[nH]c2n(C)c(=O)n(C)c(=O)c12
    1.12    0.0035 Cn1cnc2c(=O)[nH]c(=O)[nH]c12
    1.13    0.0051 Cn1c(=O)[nH]c2[nH]c(=O)[nH]c(=O)c12
    1.14    0.0105 Cn1cnc2c(=O)[nH]cnc12
CLUSTER 2 (75) size 15
    2.0     0.0181 CN1C(=O)CN=C(c2ccccc2)c3cc(Cl)ccc13
    2.1     0.0256 CN1C(=O)CN=C(c2ccccc2F)c3cc(Cl)ccc13
    2.2     0.0259 Clc1ccc2NC(=O)CN=C(c3ccccc3)c2c1
    2.3     0.0259 CN1C(=O)C(O)N=C(c2ccccc2)c3cc(Cl)ccc13
    2.4     0.0282 Clc1ccc2N(CC#C)C(=O)CN=C(c3ccccc3)c2c1
    2.5     0.0325 Clc1ccc2NC(=O)CN=C(c3ccccc3Cl)c2c1
    2.6     0.0332 Clc1ccc2N(CC3CC3)C(=O)CN=C(c4ccccc4)c2c1
    2.7     0.0336 CN1C(=O)C(O)N=C(c2ccccc2Cl)c3cc(Cl)ccc13
    2.8     0.0342 FC(F)(F)CN1C(=O)CN=C(c2ccccc2)c3cc(Cl)ccc13
    2.9     0.0367 CCN(CC)CCN1C(=O)CN=C(c2ccccc2F)c3cc(Cl)ccc13
    2.10    0.0374 Clc1ccc2NC(=O)CN(=O)=C(c3ccccc3)c2c1
    2.11    0.0391 OCCN1C(=O)C(O)N=C(c2ccccc2F)c3cc(Cl)ccc13
    2.12    0.0525 CN1CCN=C(c2ccccc2)c3cc(Cl)ccc13
    2.13    0.0575 CN(C)C(=O)OC1N=C(c2ccccc2)c3cc(Cl)ccc3N(C)C1=O
    2.14    0.0766 CN1C(=O)CN=C(c2ccccc2F)c3cc(ccc13)N(=O)=O
CLUSTER 3 (286) size 15
    3.0     0.0245 CC1CC2C3CC(F)C4=CC(=O)C=CC4(C)C3(F)C(O)CC2(C)C1(O)C(=O)CO
    3.1     0.0300 CC1CC2C3CC(F)C4=CC(=O)C=CC4(C)C3(F)C(O)CC2(C)C1(O)C(=O)COC(=O)C(C)(C)C
    3.2     0.0308 CC12CC(O)C3(F)C(CCC4=CC(=O)C=CC43C)C2CC(O)C1(O)C(=O)CO
    3.3     0.0339 CC1CC2C3CC(F)(F)C4=CC(=O)C=CC4(C)C3(F)C(O)CC2(C)C1(O)C(=O)COC(=O)C
    3.4     0.0361 CC1CC2C3CCC4=CC(=O)C=CC4(C)C3(F)C(O)CC2(C)C1(C)C(=O)CO
    3.5     0.0361 CC(OC(=O)C)C(=O)C1(O)CCC2C3CCC4=CC(=O)C=CC4(C)C3(F)C(O)CC21C
    3.6     0.0361 CC1CC2C3CCC4=CC(=O)C=CC4(C)C3(F)C(O)CC2(C)C1(O)C(=O)COC(=O)C
    3.7     0.0403 CC12CC(O)C3C(CC(F)C4=CC(=O)C=CC34C)C2CCC1(O)C(=O)CO
    3.8     0.0425 CCCCC(=O)OC1(CCC2C3CC(F)C4=CC(=O)C=CC4(C)C3C(O)CC21C)C(=O)CO
    3.9     0.0450 CC1CC2C3CC(F)C4=CC(=O)C=CC4(C)C3C(O)CC2(C)C1C(=O)CO
    3.10    0.0518 CCC(=O)OC1(C(C)CC2C3CCC4=CC(=O)C=CC4(C)C3(F)C(O)CC21C)C(=O)CCl
    3.11    0.0612 CC(=O)OCC(=O)C1(CCC2C3CC(F)C4=CC(=O)C(=CC4(C)C3(F)C(O)CC21C)Br)OC(=O)C
    3.12    0.0642 CCC(=O)OC1(C(C)CC2C3CC(F)C4=CC(=O)C=CC4(C)C3(F)C(O)CC21C)C(=O)SC
    3.13    0.0697 CC1CC2C3CC(F)C4=CC(=O)C=CC4(C)C3(Cl)C(O)CC2(C)C1C(=O)COC(=O)C(C)(C)C
    3.14    0.1074 CCSC1(CCC2C3CCC4=CC(=O)C=CC4(C)C3(F)C(O)CC21C)SC
...
228 singletons suppressed
...

Daylight Chemical Information Systems Inc.