cluster_hints.html

Daylight Summer School 2001, June 5-7, Santa Fe, NM

Daylight Worksheet - Cluster Package ... WITH HINTS!

The Cluster Package enables one to generate clusters of compounds based on the Daylight Fingerprint descriptor and the Jarvis-Patrick clustering algorithm. Subsets of large datasets can be selected as well as clustering data added to TDT files for insertion into Daylight Databases. Keep track of files from this exercise for use in Day 2 labs.

Generate a TDT file containing a clustered dataset from the "cluster.smi" dataset which uses fixed length fingerprints 5 nearest neighbors and tanimoto threshold of 0.7, and a "reasonable" JP clustering level chosen from jpscan output.

Generate Nearneighbors Table

$DY_ROOT/bin/smi2tdt -t '$SMI' cluster.smi cluster.tdt
fingerprint -b 1024 -c 1024 -id test cluster.tdt >cluster.fp.tdt
nearneighbors -fid test -NEIGHBORS 5 cluster.fp.tdt cluster.nn.tdt

sun1% $DY_ROOT/bin/smi2tdt -t '$SMI' cluster.smi cluster.tdt
sun1% fingerprint -b 1024 -c 1024 -id test cluster.tdt >cluster.fp.tdt
..................................................500 TDTs, 500 fingerprints added
..................................................1000 TDTs, 1000 fingerprints added

1002 TDTs, 1002 fingerprints added, 0 errors
Done.

sun1% nearneighbors -fid test -NEIGHBORS 5 cluster.fp.tdt cluster.nn.tdt
nearneighbors: reading input file (cluster.fp.tdt)
vvvvvvvvvv 1002 in 0.703 sec
 1003 trees in, 1002 w/FP's processed
nearneighbors: WARNING not all SMILES-rooted datatrees have fingerprints.
      1003 datatrees read in so far
      2004 datatrees contain SMILES ($SMI) data
      1002 datatrees contain valid fingerprints
nearneighbors: sorting 1002 fingerprints
nearneighbors: finding neighbors of 1002 new structures
^^^^^^^^^^ 1002 in 7.577 sec
 1002 new structures processed
nearneighbors: normal exit

sun1% more cluster.nn.tdt
$NNG<na;nearneighbors;4.71;FP,test;5>
|
$FPG<test;fingerprint;4.71;cluster.tdt;1024,1024,0.30,0/7>
|
$SMI<CC(C)(C)CC(C)(C)c1ccc(O)c(Cc2ccc(Cl)cc2Cl)c1>
FP<E+.0Y.+E..F5cIG3U.+..22.60+00U....U3E.EE0E...A.U8..W.2.c.1.A..c8.6IE2..0.5..EA..6+2d0.eU0+6.....6+0..2.0.0c...22...VE4.2+8..0+2+..0....8.U0+..M.2....2..+.7.U....630...F.+6.2;1024;127;1024;127;1;test>
NN<na;0,39,14,694,17;1.0000,0.5746,0.5588,0.5426,0.5401>
$SMI<CLOFOCTOL>
|

Choose JP level from jpscan output

jpscan -NN_BEST_THRESHOLD 0.7 -JP_NEAR 5 cluster.nn.tdt >jpscan.out

sun1% jpscan -NN_BEST_THRESHOLD 0.7 -JP_NEAR 5 cluster.nn.tdt >jpscan.out
vvvvvvvvvv 1002
jpscan: note, 1002 of 1004 input trees contain valid NN data
PERCENTAGE OF STRUCTURES CLUSTERED

     ------ NEED ------------------------------
       2   3   4   5
NEAR --- --- --- ---
  2:  46   -   -   -
  3:  60  47   -   -
  4:  65  62  43   -
  5:  69  67  63  40
sun1% more jpscan.out
Program ......... jpscan
Version ......... Daylight Software Release 4.71
Function ........ scan Jarvis-Patrick clustering parameters
Input ........... NN (nearest neighbors) data
NN data set ..... na
   created by ... nearneighbors
   version ...... 4.71
   from ......... FP
   with params .. test
Trees read in ... 1002

Clustered by .... standard Jarvis-Patrick method
Similarity threshold ... 0.700000
resulted in ... 266 automatic singletons.


NUMBER OF STRUCTURES CLUSTERED

      ------- NEED ---------------
           2      3      4      5
NEAR  ------ ------ ------ ------
   2:    462      -      -      -
   3:    600    467      -      -
   4:    656    621    433      -
   5:    689    675    629    401



PERCENTAGE OF STRUCTURES CLUSTERED

      ------- NEED ---------------
           2      3      4      5
 NEAR  ------ ------ ------ ------
   2:  46.10      -      -      -
   3:  59.88  46.60      -      -
   4:  65.46  61.97  43.21      -
   5:  68.76  67.36  62.77  40.01



NUMBER OF CLUSTERS

      ------- NEED ---------------
           2      3      4      5
NEAR  ------ ------ ------ ------
   2:    231      -      -      -
   3:    212    198      -      -
   4:    184    193    170      -
   5:    155    169    187    156



AVERAGE CLUSTER SIZE

      ------- NEED ---------------
           2      3      4      5
NEAR  ------ ------ ------ ------
   2:    2.0      -      -      -
   3:    2.8    2.3      -      -
   4:    3.5    3.2    2.5      -
   5:    4.4    3.9    3.3    2.5


 
SIZE OF LARGEST CLUSTER
 
      ------- NEED ---------------
           2      3      4      5
NEAR  ------ ------ ------ ------
   2:      3      -      -      -
   3:      6      4      -      -
   4:     12      8      5      -
   5:     18     13      9      6



NUMBER OF SINGLETONS

      ------- NEED ---------------
           2      3      4      5
NEAR  ------ ------ ------ ------
   2:    540      -      -      -
   3:    402    535      -      -
   4:    346    381    569      -
   5:    313    327    373    601

Generate clustered output in table form with showclusters

jarpat -JP_NEED 3 -JP_NEAR 5 cluster.nn.tdt >cluster.cl35.tdt

sun1%  jarpat -JP_NEED 3 -JP_NEAR 5 cluster.nn.tdt >cluster.cl35.tdt
vvvvvvvvvv 1002
jarpat: note, 1002 of 1004 input trees contain valid NN data^^^^^^^^^^ 1002
1002 total: 155 singletons; 847 (84.5%) in 180 clusters
sun1% more cluster.cl35.tdt

$CLG<na;jarpat;4.71;NN;3,5,0>
|
$NNG<na;nearneighbors;4.71;FP,test;5>
|
$FPG<test;fingerprint;4.71;cluster.tdt;1024,1024,0.30,0/7>
|
$SMI<CC(C)(C)CC(C)(C)c1ccc(O)c(Cc2ccc(Cl)cc2Cl)c1>
FP<E+.0Y.+E..F5cIG3U.+..22.60+00U....U3E.EE0E...A.U8..W.2.c.1.A..c8.6IE2..0.5
..EA..6+2d0.eU0+6.....6+0..2.0.0c...22...VE4.2+8..0+2+..0....8.U0+..M.2....2..+.
7.U....630...F.+6.2;1024;127;1024;127;1;test>
CL<0;3>
$SMI<CLOFOCTOL>
|
$SMI<CC(C)(C)NCC(O)c1cc(O)cc(O)c1>
FP<...020+E.U73UIE..2+...2.60+......EU+0EE....2.A2U7.2W....2+6M.0W0..oM2..G0+

..Q6.0.0U30..U.0............U1.8U.M.2..0.VE0.U.8...62+..8.2..+...+..EU.+.U....+.
3k..0..630...+.+..2;1024;108;1024;108;1;test>
CL<1;11>
$SMI<TERBUTALI>
|
...

showclusters -h -q -v cluster.cl35.tdt >cluster.cl35.out

sun1% showclusters -h -q -v cluster.cl35.tdt >cluster.cl35.out
sun1% more cluster.cl35.out

 
HEADER AND SUMMARY:

program ................... showclusters
function .................. analysis and display of structure clusters
version ................... DCIS Release 4.71 (c) 2000
output requested .......... Summary  Frequencies  Sorted lists  
singletons to be listed ... no
datatype(s) to show ....... all SMILES
display long data items ... normal
input file ................ cluster.cl35.tdt
tree allocation, initial .. 10000
tree allocation, final .... 10000
total datatrees read ...... 1005
trees with SMILES ......... 2004
cluster id required ....... none
trees with CL data ........ 1002
trees with FP data ........ 1002
trees with other data ..... 0 (0 items read)
trees used ................ 1002
clusters + singletons ..... 335
number of singletons ...... 155
number of clusters ........ 180
average cluster size ...... 4.7
largest cluster ........... 12

Generation of CLUSTERS:
   ID ........... na
   Program ...... jarpat
   Version ...... 4.71
   Source ....... NN (near neighbors)
 sun1% more cluster.cl35.out

 
HEADER AND SUMMARY:

program ................... showclusters
function .................. analysis and display of structure clusters
version ................... DCIS Release 4.71 (c) 2000
output requested .......... Summary  Frequencies  Sorted lists  
singletons to be listed ... no
datatype(s) to show ....... all SMILES
display long data items ... normal
input file ................ cluster.cl35.tdt
tree allocation, initial .. 10000
tree allocation, final .... 10000
total datatrees read ...... 1005
trees with SMILES ......... 2004
cluster id required ....... none
trees with CL data ........ 1002
trees with FP data ........ 1002
trees with other data ..... 0 (0 items read)
trees used ................ 1002
clusters + singletons ..... 335
number of singletons ...... 155
number of clusters ........ 180
average cluster size ...... 4.7
largest cluster ........... 12

Generation of CLUSTERS:
   ID ........... na
   Program ...... jarpat
   Version ...... 4.71
   Source ....... NN (near neighbors)
   Parameters ... 3,5,0
 
Generation of NEAR NEIGHBORS:
   ID ........... na
   Program ...... nearneighbors
   Version ...... 4.71
   Source ....... FP,test
   Parameters ... 5

Generation of FINGERPRINTS:
   ID ........... test
   Program ...... fingerprint
   Version ...... 4.71
   Source ....... cluster.tdt
   Parameters ... 1024,1024,0.30,0/7


FREQUENCIES OF CLUSTER SIZES:

         size | frequency         size | frequency         size | frequency
    ----------+----------    ----------+----------    ----------+----------
            1 | 155                  5 | 24                   9 | 5        
            2 | 44                   6 | 27                  10 | 4        
            3 | 20                   7 | 17                  11 | 3        
            4 | 28                   8 | 7                   12 | 1        


CLUSTERS LISTED BY SIZE, SMILES BY VAR(TANIMOTO):

CLUSTER 0 (100) size 12
    0.0     0.0468 CEFMENOXI
    0.1     0.0607 CEFOTIAM
    0.2     0.0672 CEFETAMET
    0.3     0.0764 CEFIXIME
    0.4     0.0771 CEFTERAM
    0.5     0.0810 CEFOTAXIM
    0.6     0.0848 CEFTAZIDI
    0.7     0.0930 CEFAMANDO
    0.8     0.1011 CEFORANID
    0.9     0.1100 CEFPIROME
    0.10    0.1253 CEFOTETAN
    0.11    0.1316 CEFBUPERA
CLUSTER 1 (1) size 11
    1.0     0.0171 ISOPRENAL
    1.1     0.0201 COLTEROL
    1.2     0.0256 ETILEFRIN
    1.3     0.0270 PHENYLEPH
    1.4     0.0278 DIOXIFEDR
    1.5     0.0314 NORADRENA
    1.6     0.0326 TERBUTALI
    1.7     0.0370 DIMETOFRI
    1.8     0.0446 DENOPAMIN
    1.9     0.0466 NORMETANE
    1.10    0.0480 ISOETARIN
...

Pick a representative subset of the clustered dataset from step one by selecting only the cluster centroids and the singletons.

listclusters -a cluster.cl35.tdt >cluster.cl.tdt

sun1% listclusters -a cluster.cl35.tdt >cluster.cl.tdt
sun1% more cluster.cl.tdt

$CLG<na;jarpat;4.71;NN;3,5,0>
|
$NNG<na;nearneighbors;4.71;FP,test;5>
|
$FPG<test;fingerprint;4.71;cluster.tdt;1024,1024,0.30,0/7>
|
$SMI<CEFTAZIDI>
CL<0;12>
|
$SMI<CEFORANID>
CL<0;12>
|
$SMI<CEFAMANDO>
CL<0;12>
|
$SMI<CEFOTETAN>
CL<0;12>
|

$SMI<CEFBUPERA>
CL<0;12>
|

Update the nearneighbors table generated from the cluster.tdt dataset with the "drugs.smi" dataset fingerprinted with the same parameter set used in step one.

sun1% smi2tdt drugs.smi >drugs.tdt
sun1% fingerprint -b 1024 -c 1024 -id test drugs.tdt > drugs.fp.tdt
.
10 TDTs, 10 fingerprints added, 0 errors
Done.
sun1% nearneighbors -NEIGHBORS 5 -UPDATE_FILE cluster.nn.tdt drugs.fp.tdt \
cluster.nn.update.tdt
nearneighbors: reading update file (cluster.nn.tdt)
vvvvvvvvvv 1002 in 0.800 sec
 1004 trees in, 1002 w/FP's processed
nearneighbors: WARNING not all SMILES-rooted datatrees have fingerprints.
      1004 datatrees read in so far
      2004 datatrees contain SMILES ($SMI) data
      1002 datatrees contain valid fingerprints
nearneighbors: reading input file (drugs.fp.tdt)
 1012 in 0.008 sec
 1015 trees in, 1012 w/FP's processed
nearneighbors: WARNING not all SMILES-rooted datatrees have fingerprints.
      1015 datatrees read in so far
      2014 datatrees contain SMILES ($SMI) data
      1012 datatrees contain valid fingerprints
nearneighbors: updating old neighbor lists
^^^^^^^^^^ 1002 in 1.945 sec
 1002 old structures processed
nearneighbors: sorting 1012 fingerprints
nearneighbors: finding neighbors of 10 new structures
 10 in 0.073 sec
 10 new structures processed
nearneighbors: normal exit

Daylight Chemical Information Systems Inc.
support@daylight.com