Using Descriptor Counts in Clustering

BOOST: Close-Up of Leaf Nodes

1. Even with a small (2,000-compound), fairly diverse dataset, we get most high-similarity groupings. The leaf nodes capture

  • 80% of all pairings > 90% similarity
  • 94% of all pairings > 95% similarity
  • 96% of all pairings > 98% similarity


BOOST Leaf Node for a 2,000-Compound Dataset
(1024-bit Daylight Fingerprints)

  • 16 compounds in node.


  • 3 compounds have 1 hi-sim neighbor (>90%) in the dataset, and also in node.


  • 1 compound has 2 hi-sim neighbors in the dataset but only 1 in node.


  • The 12 unmatched compounds in the node have no hi-sim neighbors in the dataset.
Group 1
Group 2
12 Unmatched Compounds




2. In a larger dataset, condensation into small, high-similarity groups is very evident:

BOOST Leaf Node for a 10,000-Compound Dataset
(1024-bit Daylight Fingerprints)

Group 1
Group 2
Group 3

1 Unmatched Compound
  • 15 compounds in node.


  • 7 compounds have 6 hi-sim neighbors in the dataset, and all are in the node.


  • 5 compounds have 4 hi-sim neighbors in the dataset, and all are in the node.


  • 2 compounds have 2 hi-sim neighbors in the dataset, but only 1 in node.


  • The 1 unmatched compound has no hi-sim neighbors in the dataset.




3. With really large datasets, high-similarity groupings can be larger than the leaf nodes, so we do several iterations.


| Prev | Contents | Next | Robin Hewitt (rhewitt@acm.org), Feb 2003