Traditionally, these libraries are either enumerated sequentially or by random sampling. In both cases, the software typically forms a mapping from a range of integers to a particular product or structure. As an example, we'll consider a three component library, such as Watson et al.'s Thiazoline-2-Imines library, mentioned in Val's EuroMug97 talk. This has 12 R1 isothiocyanates, 99 R2 amines and 59 R3 haloketones, producing a 70092 component library. Typically, we can map the integers 0..70091 to a unique component, which can be encoded or deconvoluted with the following rules (where R1=0..11, R2=0..98 and R3=0..58).
Encoding:
product = 5841*R1 + 59*R2 + R3Decoding:
R1 = floor(product/5841)
R2 = (product/59) mod 99
R3 = product mod 59
One disadvantage of the above approach is that 64801 components must be examined before all R1 substituents are first seen. Should the 11th R1 group be too flexible or two heavy, or be related by symmetry to a prior R1 then virtually the entire library needs to be examined before it can be rejected.
Such an approach suffers from a number of disadvantages. The first is that the random generator can easily sample the same component multiple times. Therfore the application requires a table of components that have already been encountered in order to avoid processing them twice. The next disadvantage is that the period of most generators is relativey short. The standard UNIX rand(3), for example, has a period of 32768 which implies that less than 47% of our example library could ever be sampled. Even if period were not a problem, it is almost impossible to predict when and whether a library has been fully sampled. Finally, one minor point is that the use of floating point representations used to investigate the more randomly distributed higher order bits both slows down the sampling and results in an uneven sampling probability distribution.#include <stdlib.h> #include <math.h> int RandomComponentNumber( void ) { return (int)rint(70091.0*rand()/RAND_MAX); }
This has the advantages that there is no need not test that a component has been encountered previously, every component is only visited once allowing progress to be reported, all substituents are visited relatively quickly and the entire library has been "enumerated" after the appropriate number of components have been sampled.
Note on of the major applications of this technique is the use of digital screen disolve effects in computer graphics, where each pixel of an image must fade from one frame to the next.
Xn+1 = (aXn+c) mod mwhere "m" is the modulus (m>0), "a" is the multiplier (0<=a<m) and "c" is the increment (0<=c<m). For simplicity, in encoding and decoding the result we are interested in such linear congruential generators where m is the number of components in our library, and the generator has period m (its maximum possible period). The starting value of the sequence X0 may be randomly choosen as any value between 0 and m-1 inclusive, effectively the generator's "seed". The task is to find the parameters "a" and "c" for a given "m" such that the period is "m" and the sequence is suitably random. For example, the values a=1 and c=1, results in the sequential enumeration given above, i.e. it has the correct period but is far from "random".
Luckily, the conditions for a maximum period linear congruential sequence are given by Knuth in Volume 2 "Seminumerical Algorithms" of his classic "The Art of Computer Programming" series. These conditions are presented as Theorem A of section 3.2.1.2 on page 16.
1). c is relatively prime to mFor our demonstration example, m=70092 has factors 2,3,11 and 59. Because 70092 is a multiple of 4, a-1 must also be a multiple of 4. There are 8 possible values for a: 7789,15577, 23365,31153,38941, 46729,54517 and 62305. Each of these multipliers are then evaluated in terms of their "potency" of the modulii 70092, 5841 and 59. These modulii are the effective periods (after decoding the a compound) number for R1, R2 and R3. The potency of a linear congruential generator is defined by Knuth as the smallest s, such that (a-1)s mod m = 0. Potency is shown by Knuth to be a good indicator of the pseudo-random nature of a generator. For the above multipliers, no single multiplier is significantly better than the rest, but multipliers 23365 and 46729 and shown to be slightly worse. Finally selection of increment "c" is made by finding the value closest to 1/2+sqrt(3)/6 or 1/2-sqrt(3)/6 that is relatively prime to m. Luckily, this 0.788675*70092=55279 is relatively prime, and hence we arbitrarily choose "a=38941" and "c=55279".
2). a-1 is a multiple of p, for every prime p dividing m
3). a-1 is a multiple of 4, if m is a multiple of 4
Given the linear congruential generator parameters a=38941, c=55729 and m=70092, we can now study some of the properties of the sequence. The first 10 compounds sampled from the library using the initial seed 0 are listed below:
Compound R1 R2 R3 00000 0 0 0 55279 9 45 55 09314 1 58 51 25653 4 38 47 57568 9 84 43 58331 9 97 39 51306 8 77 35 59857 10 24 31 37256 6 37 27 06867 1 17 23 ...Further study of this generator reveals that all 12 R1 substituents are encountered after 21 samples, all 99 R2 substituents are encountered after 218 samples and all 59 R3 substituents in the first 59 samples.