K-modes Clustering:
K-modes is a variation of the well-known K-Means clustering algorithm, adapted to work for categorical data, in our case fingerprints.
$ kmodes -help KMODES USAGE SYNOPSIS: kmodes [options] in.tdt [out.tdt] options ...... options, see below in.tdt ....... readable .tdt file containing FP data out.tdt ...... writable file to receive .tdt output (default: stdout) options: -k <modals> .......... number of modes to find (default: 100 or count of input seeds) -km <max_modals> ..... maximum number of modes; causes splitting (default: don't) -seeds <seed_file> ... tdt file of initial modals/seeds (default: none) -d <cluster_size> .... if a cluster drops below this size due to relocations, eliminate it (default: 0) -fast <threshold> .... percentage of relocations in a pass to terminate processing (default: 0.0) -partition ........... don't relocate modals; one assignment only -iter <max_iter> ..... maximum relocation iterations (default: none) -nomove .............. don't relocate during initial assignments -random .............. pick random seeds (default: don't) -randseed <###> ...... pick random seeds, use value as randomizer seed -EXPRESSION <expr> ... use expr for comparison (default: tanimoto) -COMPARISON [DISTANCE|SIMILARITY] ........ relative goodness of expr values (default: similarity) -JP_RUNID val ........ identify run by `val' (default: don't) [-id] -in val .............. use fingerprints with id `val' (default: first) -min val ............. use seed fingerprints with id `val' (default: first) -RECORD_COUNT val .... expect `val' structures (default: 10000) [-m] NOTE: For the comparison option, DISTANCE means "lower is better" while SIMILARITY means "higher is better" for the computed expression values. The program will attempt to figure out the direction of the expression. This option is only needed if the automatic computation is incorrect. $ time kmodes med03.fp_512 > med03.cl kmodes: reading input file (/sfhome/jjdelany/TMP/zz/med03.fp_512) vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.118 sec vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.100 sec vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.099 sec vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.102 sec vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.111 sec vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.100 sec vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.099 sec vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.098 sec vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.098 sec vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 3776 in 0.076 sec 48777 trees in, 48776 w/FP's processed kmodes: Computing clusters of 48776 structures using 200 modals. kmodes: Terminate when no structures move in a relocation pass. Initial assignment in: 4.405 sec relocated: 9158 in 3.247 sec relocated: 4574 in 2.841 sec relocated: 2987 in 2.702 sec relocated: 2751 in 2.654 sec relocated: 2437 in 2.653 sec relocated: 3171 in 2.726 sec relocated: 3953 in 2.799 sec relocated: 2757 in 2.680 sec relocated: 2154 in 2.637 sec relocated: 1835 in 2.606 sec relocated: 1466 in 2.576 sec relocated: 1046 in 2.518 sec relocated: 869 in 2.496 sec relocated: 844 in 2.502 sec relocated: 566 in 2.478 sec relocated: 399 in 2.469 sec relocated: 376 in 2.468 sec relocated: 237 in 2.480 sec relocated: 108 in 2.459 sec relocated: 0 in 2.419 sec ivar: 0.178263 ovar: 0.165578 kstart: 200 kfinal: 200 reloc: 20 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.109 sec ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.097 sec ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.093 sec ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.092 sec ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.091 sec ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.092 sec ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.092 sec ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.092 sec ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.091 sec ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 3776 in 0.069 sec 48776 new structures processed kmodes: normal exit real 0m59.46s user 0m58.89s sys 0m0.09s $ more med03.cl $CLG<;kmodes;4.83;FP;TANIMOTO,SIMILARITY,200/200> | $FPG<na;fingerprint;4.83;med03.smi;512,512,0.30,0/7> | $SMI<N#COc1ccccc1> FP<.0204.EE.U6.U...W.7.6.2..G.26.3...U.w.E..E2..A.U..2WE.2.3+.......0IoA4......E.6...2+4...1;512;57;512;57;1> CL<79;208;> FPM<.0604.EE2U..U...0.6.M.20.G..6.5...UUs.E..E2..A.2..2WE.2.3+..0....0II24......E.6...2+4...1;512;333;512;56;1> | $SMI<CCCc1cc(I)cc(CNCCCN(C)C)c1O> FP<.100gA+O.kFZUKEXUY7+8F2.Sm++M03.6Mc+w.F.UE2..Q2UQk.W.+263P.OU.k.07KI456OG9G.tQe..+2F4...1;512;135;512;135;1> CL<1;203;> FPM<.1.0A.+E.k+2U.E0.26.8.2.0G+.M03.6Ec+s.E.YE2..A2.Q.2W.+2.3FE6..U.07II44.G070.NM8...2+4...1;512;135;512;87;1> | ... $SMI<COC(=O)C(C)NP(=O)(OCC1OC(C=C1)n2cc(C)c(=O)[nH]c2=O)Oc3cccc(c3)C(=O)C> FP<DPOqQcbsDcfZiSrqb5zJ8wjSTytIRPTIzKWRwcznsux99g0uuxLXYShz3xzt0SfzhDyqBrNHCRTruRSzutTzqU..1;512;330;512;330;1> CL<159;4923;> | $SMI<O=N(=O)C=C1SCCN1Cc2ccccc2> FP<0XEKB.Fd2k7Mk6SB448EeU6V0me2c02.AEXHtUE1UkyUU25HO..W.MI0bHF8YU1E1BJEF4GGed4UtQ8F20.55...1;512;165;512;165;1> CL<161;436;> | ...
The output includes the cluster number for each structure and the modal fingerprint for the cluster, output as an FPM<> datatype with the first entry seen in a cluster. The output can be further processed with listclusters(1) and showclusters(1) to summarize results, generate statistics, etc.