BOOST: Close-Up of Leaf Nodes

Using Descriptor Counts in Clustering

BOOST: Close-Up of Leaf Nodes

1. Even with a small (2,000-compound), fairly diverse dataset, we get most high-similarity groupings. The leaf nodes capture

80% of all pairings > 90% similarity
94% of all pairings > 95% similarity
96% of all pairings > 98% similarity

BOOST Leaf Node for a 2,000-Compound Dataset
(1024-bit Daylight Fingerprints)

16 compounds in node.

3 compounds have 1 hi-sim neighbor (>90%) in the dataset, and also in node.

1 compound has 2 hi-sim neighbors in the dataset but only 1 in node.

The 12 unmatched compounds in the node have no hi-sim neighbors in the dataset.

Group 1

Group 2

12 Unmatched Compounds

2. In a larger dataset, condensation into small, high-similarity groups is very evident:

BOOST Leaf Node for a 10,000-Compound Dataset
(1024-bit Daylight Fingerprints)

Group 1
Group 2
Group 3
1 Unmatched Compound

15 compounds in node.

7 compounds have 6 hi-sim neighbors in the dataset, and all are in the node.

5 compounds have 4 hi-sim neighbors in the dataset, and all are in the node.

2 compounds have 2 hi-sim neighbors in the dataset, but only 1 in node.

The 1 unmatched compound has no hi-sim neighbors in the dataset.

3. With really large datasets, high-similarity groupings can be larger than the leaf nodes, so we do several iterations.

| Prev | Contents | Next |

Robin Hewitt (rhewitt@acm.org), Feb 2003