Theory: by holding the most variable bits constant,
- We should get a fairly well balanced set of child nodes.
- If a node has too many, we build the tree another level at that node.
- In each leaf node, we've raised the probability that an arbitrary pair
of compounds will be highly similar because we've maximally increased the
probability for bits to match between compounds.
If we make our leaf nodes fairly small, say 10-100 compounds each, we can quickly do all-by-all
comparisons within each one and pull out the groups of very similar compounds we find there.
|