The Cluster Package enables one to generate clusters of compounds based on the Daylight Fingerprint descriptor and the Jarvis-Patrick clustering algorithm. Subsets of large datasets can be selected as well as clustering data added to TDT files for insertion into Daylight Databases. Keep track of files from this exercise for use in Day 2 labs.
$DY_ROOT/bin/smi2tdt -t '$SMI' cluster.smi cluster.tdt
fingerprint -b 1024 -c 1024 -id test cluster.tdt >cluster.fp.tdt
nearneighbors -fid test -NEIGHBORS 5 cluster.fp.tdt cluster.nn.tdt
sun1% $DY_ROOT/bin/smi2tdt -t '$SMI' cluster.smi cluster.tdt sun1% fingerprint -b 1024 -c 1024 -id test cluster.tdt >cluster.fp.tdt ..................................................500 TDTs, 500 fingerprints added ..................................................1000 TDTs, 1000 fingerprints added 1002 TDTs, 1002 fingerprints added, 0 errors Done. sun1% nearneighbors -fid test -NEIGHBORS 5 cluster.fp.tdt cluster.nn.tdt nearneighbors: reading input file (cluster.fp.tdt) vvvvvvvvvv 1002 in 0.703 sec 1003 trees in, 1002 w/FP's processed nearneighbors: WARNING not all SMILES-rooted datatrees have fingerprints. 1003 datatrees read in so far 2004 datatrees contain SMILES ($SMI) data 1002 datatrees contain valid fingerprints nearneighbors: sorting 1002 fingerprints nearneighbors: finding neighbors of 1002 new structures ^^^^^^^^^^ 1002 in 7.577 sec 1002 new structures processed nearneighbors: normal exit sun1% more cluster.nn.tdt $NNG<na;nearneighbors;4.71;FP,test;5> | $FPG<test;fingerprint;4.71;cluster.tdt;1024,1024,0.30,0/7> | $SMI<CC(C)(C)CC(C)(C)c1ccc(O)c(Cc2ccc(Cl)cc2Cl)c1> FP<E+.0Y.+E..F5cIG3U.+..22.60+00U....U3E.EE0E...A.U8..W.2.c.1.A..c8.6IE2..0.5..EA..6+2d0.eU0+6.....6+0..2.0.0c...22...VE4.2+8..0+2+..0....8.U0+..M.2....2..+.7.U....630...F.+6.2;1024;127;1024;127;1;test> NN<na;0,39,14,694,17;1.0000,0.5746,0.5588,0.5426,0.5401> $SMI<CLOFOCTOL> |
jpscan -NN_BEST_THRESHOLD 0.7 -JP_NEAR 5 cluster.nn.tdt >jpscan.out
sun1% jpscan -NN_BEST_THRESHOLD 0.7 -JP_NEAR 5 cluster.nn.tdt >jpscan.out vvvvvvvvvv 1002 jpscan: note, 1002 of 1004 input trees contain valid NN data PERCENTAGE OF STRUCTURES CLUSTERED ------ NEED ------------------------------ 2 3 4 5 NEAR --- --- --- --- 2: 46 - - - 3: 60 47 - - 4: 65 62 43 - 5: 69 67 63 40 sun1% more jpscan.out Program ......... jpscan Version ......... Daylight Software Release 4.71 Function ........ scan Jarvis-Patrick clustering parameters Input ........... NN (nearest neighbors) data NN data set ..... na created by ... nearneighbors version ...... 4.71 from ......... FP with params .. test Trees read in ... 1002 Clustered by .... standard Jarvis-Patrick method Similarity threshold ... 0.700000 resulted in ... 266 automatic singletons. NUMBER OF STRUCTURES CLUSTERED ------- NEED --------------- 2 3 4 5 NEAR ------ ------ ------ ------ 2: 462 - - - 3: 600 467 - - 4: 656 621 433 - 5: 689 675 629 401 PERCENTAGE OF STRUCTURES CLUSTERED ------- NEED --------------- 2 3 4 5 NEAR ------ ------ ------ ------ 2: 46.10 - - - 3: 59.88 46.60 - - 4: 65.46 61.97 43.21 - 5: 68.76 67.36 62.77 40.01 NUMBER OF CLUSTERS ------- NEED --------------- 2 3 4 5 NEAR ------ ------ ------ ------ 2: 231 - - - 3: 212 198 - - 4: 184 193 170 - 5: 155 169 187 156 AVERAGE CLUSTER SIZE ------- NEED --------------- 2 3 4 5 NEAR ------ ------ ------ ------ 2: 2.0 - - - 3: 2.8 2.3 - - 4: 3.5 3.2 2.5 - 5: 4.4 3.9 3.3 2.5 SIZE OF LARGEST CLUSTER ------- NEED --------------- 2 3 4 5 NEAR ------ ------ ------ ------ 2: 3 - - - 3: 6 4 - - 4: 12 8 5 - 5: 18 13 9 6 NUMBER OF SINGLETONS ------- NEED --------------- 2 3 4 5 NEAR ------ ------ ------ ------ 2: 540 - - - 3: 402 535 - - 4: 346 381 569 - 5: 313 327 373 601
showclusters
jarpat -JP_NEED 3 -JP_NEAR 5 cluster.nn.tdt >cluster.cl35.tdt
sun1% jarpat -JP_NEED 3 -JP_NEAR 5 cluster.nn.tdt >cluster.cl35.tdt vvvvvvvvvv 1002 jarpat: note, 1002 of 1004 input trees contain valid NN data^^^^^^^^^^ 1002 1002 total: 155 singletons; 847 (84.5%) in 180 clusters sun1% more cluster.cl35.tdt $CLG<na;jarpat;4.71;NN;3,5,0> | $NNG<na;nearneighbors;4.71;FP,test;5> | $FPG<test;fingerprint;4.71;cluster.tdt;1024,1024,0.30,0/7> | $SMI<CC(C)(C)CC(C)(C)c1ccc(O)c(Cc2ccc(Cl)cc2Cl)c1> FP<E+.0Y.+E..F5cIG3U.+..22.60+00U....U3E.EE0E...A.U8..W.2.c.1.A..c8.6IE2..0.5 ..EA..6+2d0.eU0+6.....6+0..2.0.0c...22...VE4.2+8..0+2+..0....8.U0+..M.2....2..+. 7.U....630...F.+6.2;1024;127;1024;127;1;test> CL<0;3> $SMI<CLOFOCTOL> | $SMI<CC(C)(C)NCC(O)c1cc(O)cc(O)c1> FP<...020+E.U73UIE..2+...2.60+......EU+0EE....2.A2U7.2W....2+6M.0W0..oM2..G0+ ..Q6.0.0U30..U.0............U1.8U.M.2..0.VE0.U.8...62+..8.2..+...+..EU.+.U....+. 3k..0..630...+.+..2;1024;108;1024;108;1;test> CL<1;11> $SMI<TERBUTALI> | ...
showclusters -h -q -v cluster.cl35.tdt >cluster.cl35.out
sun1% showclusters -h -q -v cluster.cl35.tdt >cluster.cl35.out sun1% more cluster.cl35.out HEADER AND SUMMARY: program ................... showclusters function .................. analysis and display of structure clusters version ................... DCIS Release 4.71 (c) 2000 output requested .......... Summary Frequencies Sorted lists singletons to be listed ... no datatype(s) to show ....... all SMILES display long data items ... normal input file ................ cluster.cl35.tdt tree allocation, initial .. 10000 tree allocation, final .... 10000 total datatrees read ...... 1005 trees with SMILES ......... 2004 cluster id required ....... none trees with CL data ........ 1002 trees with FP data ........ 1002 trees with other data ..... 0 (0 items read) trees used ................ 1002 clusters + singletons ..... 335 number of singletons ...... 155 number of clusters ........ 180 average cluster size ...... 4.7 largest cluster ........... 12 Generation of CLUSTERS: ID ........... na Program ...... jarpat Version ...... 4.71 Source ....... NN (near neighbors) sun1% more cluster.cl35.out HEADER AND SUMMARY: program ................... showclusters function .................. analysis and display of structure clusters version ................... DCIS Release 4.71 (c) 2000 output requested .......... Summary Frequencies Sorted lists singletons to be listed ... no datatype(s) to show ....... all SMILES display long data items ... normal input file ................ cluster.cl35.tdt tree allocation, initial .. 10000 tree allocation, final .... 10000 total datatrees read ...... 1005 trees with SMILES ......... 2004 cluster id required ....... none trees with CL data ........ 1002 trees with FP data ........ 1002 trees with other data ..... 0 (0 items read) trees used ................ 1002 clusters + singletons ..... 335 number of singletons ...... 155 number of clusters ........ 180 average cluster size ...... 4.7 largest cluster ........... 12 Generation of CLUSTERS: ID ........... na Program ...... jarpat Version ...... 4.71 Source ....... NN (near neighbors) Parameters ... 3,5,0 Generation of NEAR NEIGHBORS: ID ........... na Program ...... nearneighbors Version ...... 4.71 Source ....... FP,test Parameters ... 5 Generation of FINGERPRINTS: ID ........... test Program ...... fingerprint Version ...... 4.71 Source ....... cluster.tdt Parameters ... 1024,1024,0.30,0/7 FREQUENCIES OF CLUSTER SIZES: size | frequency size | frequency size | frequency ----------+---------- ----------+---------- ----------+---------- 1 | 155 5 | 24 9 | 5 2 | 44 6 | 27 10 | 4 3 | 20 7 | 17 11 | 3 4 | 28 8 | 7 12 | 1 CLUSTERS LISTED BY SIZE, SMILES BY VAR(TANIMOTO): CLUSTER 0 (100) size 12 0.0 0.0468 CEFMENOXI 0.1 0.0607 CEFOTIAM 0.2 0.0672 CEFETAMET 0.3 0.0764 CEFIXIME 0.4 0.0771 CEFTERAM 0.5 0.0810 CEFOTAXIM 0.6 0.0848 CEFTAZIDI 0.7 0.0930 CEFAMANDO 0.8 0.1011 CEFORANID 0.9 0.1100 CEFPIROME 0.10 0.1253 CEFOTETAN 0.11 0.1316 CEFBUPERA CLUSTER 1 (1) size 11 1.0 0.0171 ISOPRENAL 1.1 0.0201 COLTEROL 1.2 0.0256 ETILEFRIN 1.3 0.0270 PHENYLEPH 1.4 0.0278 DIOXIFEDR 1.5 0.0314 NORADRENA 1.6 0.0326 TERBUTALI 1.7 0.0370 DIMETOFRI 1.8 0.0446 DENOPAMIN 1.9 0.0466 NORMETANE 1.10 0.0480 ISOETARIN ...
listclusters -a cluster.cl35.tdt >cluster.cl.tdt
sun1% listclusters -a cluster.cl35.tdt >cluster.cl.tdt sun1% more cluster.cl.tdt $CLG<na;jarpat;4.71;NN;3,5,0> | $NNG<na;nearneighbors;4.71;FP,test;5> | $FPG<test;fingerprint;4.71;cluster.tdt;1024,1024,0.30,0/7> | $SMI<CEFTAZIDI> CL<0;12> | $SMI<CEFORANID> CL<0;12> | $SMI<CEFAMANDO> CL<0;12> | $SMI<CEFOTETAN> CL<0;12> |
$SMI<CEFBUPERA> CL<0;12> |
sun1% smi2tdt drugs.smi >drugs.tdt
sun1% fingerprint -b 1024 -c 1024 -id test drugs.tdt > drugs.fp.tdt
.
10 TDTs, 10 fingerprints added, 0 errors
Done.
sun1% nearneighbors -NEIGHBORS 5 -UPDATE_FILE cluster.nn.tdt drugs.fp.tdt \
cluster.nn.update.tdt
nearneighbors: reading update file (cluster.nn.tdt)
vvvvvvvvvv 1002 in 0.800 sec
1004 trees in, 1002 w/FP's processed
nearneighbors: WARNING not all SMILES-rooted datatrees have fingerprints.
1004 datatrees read in so far
2004 datatrees contain SMILES ($SMI) data
1002 datatrees contain valid fingerprints
nearneighbors: reading input file (drugs.fp.tdt)
1012 in 0.008 sec
1015 trees in, 1012 w/FP's processed
nearneighbors: WARNING not all SMILES-rooted datatrees have fingerprints.
1015 datatrees read in so far
2014 datatrees contain SMILES ($SMI) data
1012 datatrees contain valid fingerprints
nearneighbors: updating old neighbor lists
^^^^^^^^^^ 1002 in 1.945 sec
1002 old structures processed
nearneighbors: sorting 1012 fingerprints
nearneighbors: finding neighbors of 10 new structures
10 in 0.073 sec
10 new structures processed
nearneighbors: normal exit