
Jack Delany, John Bradshaw
DAYLIGHT Chemical Information Systems, Inc. Mission Viejo, CA USA
Tie handling in Jarvis-Patrick
Jarvis-Patrick is based on premise that items with shared neighbors must also be close to one another. The method, as described in the original reference (Jarvis), is:
It's a non-parametric method; the clustering operates on the lists of shared neighbors. Provided that one can generate lists of near neighbors for a given set of items they can be clustered. The near neighbor generation is O(n^2) and is typically the slow step. In our world, we do tanimoto similarity searches to generate the near neighbors lists. It's fast for large datasets and has been applied in numerous ways to clustering and compound selection. It does have some peculiarities in its behavior:
Fingerprints are an information-reduced representation of structural information. It is not rare for different structures to result in the same generated fingerprint (path length, repeated structural units). Furthermore, fingerprint generation parameters (folding, size) will result in collisions between fingerprints. On top of this is the occurance of ties in proximity between items within the dataset.
| Database | Unique SMILES (#1) | Unique FPS (2048 bits) (#2) | Unique FPS (folded) (#3) | 
|---|---|---|---|
| WDI034 | 63009 | 58786 | 57870 | 
| ACD033 | 167768 | 153559 | 149949 | 
| SPRESI00 | 3082319 | 2869375 | 2795899 | 
1. thorlist dbname | grep -c "^FP<"
2. thorlist dbname | fingerprint -z -b 2048 -c 2048 | \
     grep "^FP<" | sort -u | wc
3. thorlist dbname | grep "^FP<" | cut -d ';' -f1 | \
     sort -u | wc
     Item: 0   2    44   6    8    12   | 54   | 56   78   125 
     Sim:  1.0 0.95 0.95 0.85 0.82 0.8  | 0.8  | 0.8  0.8  0.8
     Item: 6   77   56   0    44   78   | 12   | 8    133  200
           1.0 0.88 0.85 0.85 0.75 0.75 | 0.70 | 0.65 0.65 0.60
                                       4/6    5/7
| J/K, version options | AVG Cluster Size | # Clusters | # Singletons | 
|---|---|---|---|
| Ten largest cluster sizes | |||
| 10/16, v4.8 | 8.6 | 15884 | 31181 | 
| 132,111,104,102,102,99,99,93,90,89 | |||
| 10/15, v4.8 | 7.1 | 18376 | 37088 | 
| 89,76,73,72,67,66,65,64,63,63 | |||
| 10/16, v4.9 | 9.2 | 15025 | 29138 | 
| 211,164,150,145,133,128,128,126,116,111 | |||
| 10/15, v4.9 | 7.7 | 17211 | 34373 | 
| 111,99,92,83,83,81,81,80,80,77 | |||
| 10/16, v4.9 -count_all_ties | 9.6 | 14529 | 28227 | 
| 276,261,168,159,158,157,156,150,144,142 | |||
| 10/15, v4.9 -count_all_ties | 8.2 | 16517 | 33065 | 
| 215,144,136,120,111,103,102,101,99,96 | |||