Selecting Seeds

The choice and number of the initial seeds is key to using this method of clustering. In the current implementation the number of seeds is the same or greater than the number of final clusters.

kmodes -d n

implements a rule to delete clusters if their membership falls below a certain value. This is useful to protect against inadvertently choosing an outlier as a seed.

We would hope to implement a rule to increase the number of clusters, with concomitant increase in execution time.

In an ideal world one would choose the number and nature of the seeds to reflect the density of points in the similarity space. For most real world situations this is not a practical option.

Where the collection is well-known, choosing a prototype or modal from each class of compound will be a good start. The <seeds_file>reads the FP data not the SMILES, so there is no reason that these seeds should not themselves be modals.

If you are clustering to understand the collection content, then a random set of seeds is good. The best way of finding a random set of structures from the database and hence a random set of seeds is to use the deal option with the thorlist program.

thorlist -DEAL n/N my_database

If you do not have our thor programs then or use

kmodes -random

This will only guarantee that the selection is pseudo-random on the input order.

Daylight Chemical Information Systems, Inc.
support@daylight.com

John Bradshaw.