Pseudo-random Sampling for Combinatorial Libraries

Roger Sayle
Bioinformatics Group,
Metaphorics LLC,
Santa Fe, New Mexico.

1. Abstract

Combinatorial libraries, either specified by reactions or templates, and Markush structures are frequently used to represent thousands or even millions of molecular strcutures. These structures are often enumerated, or if there are too many randomly sampled, to determine ensemble features of the library. This talk describes the use of pseudo-random number generators as an efficient alternative to enumeration and/or sampling.

2. Combinatorial Library Enumeration

The enumeration (or random sampling) of combinatorial libraries is a frequent task in combinatorial library design. Enumeration is often required to determine the ensemble properties of a virtual library, such as redundancy or "drug-likeness", distributions (or profiles) of rotatable bonds, molecular weight, ClogP or predicted binding energy. Library design software often incorperates filters that reject a library should it not fit suitability criteria, e.g. less than 2% redundancy and no more than 5% of compounds with molecular weight above 500. Other potential ensemble properties include modal fingerprints for such libraries or Markush patents. Val Gillet in her EuroMug97 talk entitled "Selecting Combinatorial Libraries to Optimize Diversity and Physical Properties" showed that such ensemble properties should only be determined from analysis of the products and not just the reactants.

Traditionally, these libraries are either enumerated sequentially or by random sampling. In both cases, the software typically forms a mapping from a range of integers to a particular product or structure. As an example, we'll consider a three component library, such as Watson et al.'s Thiazoline-2-Imines library, mentioned in Val's EuroMug97 talk. This has 12 R1 isothiocyanates, 99 R2 amines and 59 R3 haloketones, producing a 70092 component library. Typically, we can map the integers 0..70091 to a unique component, which can be encoded or deconvoluted with the following rules (where R1=0..11, R2=0..98 and R3=0..58).

Encoding:
product = 5841*R1 + 59*R2 + R3
Decoding:
R1 = floor(product/5841)
R2 = (product/59) mod 99
R3 = product mod 59

2.1 Sequential Enumeration

During sequential enumeration each component is considered sequentially ordered by its encoded value. First component zero, then one, then two and so on. This effectively holds R1 and R2 constant while varying R3. After 59 components, the next R2 is considered, and so on until all 70092 components are examined. Sequential numeration of specific libraries is often implemented as nested loops over each reactant or attachment point.

One disadvantage of the above approach is that 64801 components must be examined before all R1 substituents are first seen. Should the 11th R1 group be too flexible or two heavy, or be related by symmetry to a prior R1 then virtually the entire library needs to be examined before it can be rejected.

2.2 Random Sampling

During random sampling, a random number is generated between 0 and one less than the number of components in the library (70091 in our example). This value indicates the component to consider next. These values are typically using a suitable random number generator such as the "rand" function provided by the standard C library. An example implementation is outlined below.


#include <stdlib.h>
#include <math.h>

int RandomComponentNumber( void ) {
    return (int)rint(70091.0*rand()/RAND_MAX);
}

Such an approach suffers from a number of disadvantages. The first is that the random generator can easily sample the same component multiple times. Therfore the application requires a table of components that have already been encountered in order to avoid processing them twice. The next disadvantage is that the period of most generators is relativey short. The standard UNIX rand(3), for example, has a period of 32768 which implies that less than 47% of our example library could ever be sampled. Even if period were not a problem, it is almost impossible to predict when and whether a library has been fully sampled. Finally, one minor point is that the use of floating point representations used to investigate the more randomly distributed higher order bits both slows down the sampling and results in an uneven sampling probability distribution.

2.3 Proposed Pseudo-Random Sampling

The proposed solution to the above problems is the use of customized pseudo-random number generators. Rather than suffer the problems of a generic random number generator, construct a random number generator specifically for the number at hand. In this case, this requires discovering a pseudo-random number generator exactly the same period as the size of the library. This effectively produces an algorithmic permutation of the components of a library, and merely processes them in an order that appears random to an observer.

This has the advantages that there is no need not test that a component has been encountered previously, every component is only visited once allowing progress to be reported, all substituents are visited relatively quickly and the entire library has been "enumerated" after the appropriate number of components have been sampled.

Note on of the major applications of this technique is the use of digital screen disolve effects in computer graphics, where each pixel of an image must fade from one frame to the next.

3. Linear Congruential Random Number Generators

A linear congruential random number generator is a method for algorithmically generating a pseudo-random sequence of numbers suitable for the application described above. Each number in the sequence is generated by the following recurence equation:

X_n+1 = (aX_n+c) mod m

where "m" is the modulus (m>0), "a" is the multiplier (0<=a<m) and "c" is the increment (0<=c<m). For simplicity, in encoding and decoding the result we are interested in such linear congruential generators where m is the number of components in our library, and the generator has period m (its maximum possible period). The starting value of the sequence X₀ may be randomly choosen as any value between 0 and m-1 inclusive, effectively the generator's "seed". The task is to find the parameters "a" and "c" for a given "m" such that the period is "m" and the sequence is suitably random. For example, the values a=1 and c=1, results in the sequential enumeration given above, i.e. it has the correct period but is far from "random".

Luckily, the conditions for a maximum period linear congruential sequence are given by Knuth in Volume 2 "Seminumerical Algorithms" of his classic "The Art of Computer Programming" series. These conditions are presented as Theorem A of section 3.2.1.2 on page 16.

1). c is relatively prime to m
2). a-1 is a multiple of p, for every prime p dividing m
3). a-1 is a multiple of 4, if m is a multiple of 4

For our demonstration example, m=70092 has factors 2,3,11 and 59. Because 70092 is a multiple of 4, a-1 must also be a multiple of 4. There are 8 possible values for a: 7789,15577, 23365,31153,38941, 46729,54517 and 62305. Each of these multipliers are then evaluated in terms of their "potency" of the modulii 70092, 5841 and 59. These modulii are the effective periods (after decoding the a compound) number for R1, R2 and R3. The potency of a linear congruential generator is defined by Knuth as the smallest s, such that (a-1)^s mod m = 0. Potency is shown by Knuth to be a good indicator of the pseudo-random nature of a generator. For the above multipliers, no single multiplier is significantly better than the rest, but multipliers 23365 and 46729 and shown to be slightly worse. Finally selection of increment "c" is made by finding the value closest to 1/2+sqrt(3)/6 or 1/2-sqrt(3)/6 that is relatively prime to m. Luckily, this 0.788675*70092=55279 is relatively prime, and hence we arbitrarily choose "a=38941" and "c=55279".

Given the linear congruential generator parameters a=38941, c=55729 and m=70092, we can now study some of the properties of the sequence. The first 10 compounds sampled from the library using the initial seed 0 are listed below:

Compound   R1  R2  R3
00000       0   0   0
55279       9  45  55
09314       1  58  51
25653       4  38  47
57568       9  84  43
58331       9  97  39
51306       8  77  35
59857      10  24  31
37256       6  37  27
06867       1  17  23
...

Further study of this generator reveals that all 12 R1 substituents are encountered after 21 samples, all 99 R2 substituents are encountered after 218 samples and all 59 R3 substituents in the first 59 samples.

4. Bibliography

D.E. Knuth, "The Linear Congruential Method", In "The Art of Computer Programming: Seminumerical Algorithms", Addison Welsey Publishers, 2^nd Edition, Vol. 2, Section 3.2.1, pp. 9-25, 1969.

Example C source code

roger@metaphorics.com