MUG '98 - Weininger

Simpatico

Dave Weininger
Daylight CIS

Synopsis of the "Markush problem"

"Markush structures" are the kinds of structure diagrams one finds in patents with R's and X's, which may represent a few molecules or a large infinity of molecules. The Markush concept is defined in terms of legal precedent rather than in mathematical or chemical terms, which is a bother. For instance, there is no requirement that they make chemical sense.
Most often, Markush structures represent variable graphs with node labels which are nearly arbitrary, e.g., a substituent might be labeled "-CH3", "heterocycle", "blocking group" or "non-aromatic substituent of any size".
The most common and important Markush problems involve establishing relationships between variable-graph sets, e.g.:
- identity -- is one identical to another?
- membership -- is given structure included in a given set?
- conjunction -- do these sets overlap?
- similarity -- which sets are most similar to a given set?
All four problems are known to be intractable in the general case (even identity).
Simplifying the problem helps (e.g., disallowing general boolean logic as in MD) but even so, a rigorous solution to the simplified problem has not been discovered. Much work has been done on developing algorithms which work in restricted domains. The currently-existing set of Markush structures is extremely complex, and these problems remain intractable in practice. Example

Changing the problem

The "ground rules" for Markush searching (and for chemical structure searching in general) have been: a database exists which contains a potentially large number of structures; a single, rigorously defined query is posed; the search function's job is to find the exact set of structures which satisfy the query. This is hard!
Searches following other ground rules might be just as useful and much less difficult to implement. For instance, one might consider relaxing the "exact set of structures" requirement by allowing a small percentage of near-answers to be reported (false drops).
In recent years, a tremendous amount of work has been expended in development of large-scale search engines, primarily for searching WWW. The NI2 method, developed at DEC and deployed as Alta Vista, was one of the earliest such methods and is still one of the most successful. NI2 uses a compact in-memory hashing scheme which does not resolve hash collisions. This means that some small percentage of results reported will inevitably be false drops. With the Alta Vista state about one year ago, this was about 3%. (Example: ~88,884 hits for +patent +searching.)
The astonishing thing is that this type of imprecision is not perceived as a disadvantage by users ... most don't even notice it. The reason is that most "true" hits are kind of false drops anyway, i.e., they aren't what the user was looking for. Of the ~88K web pages containing "patent" and "searching", almost 2% are about shoes ... only a few of which are about footwear patents ... most are about footwear, e.g., "searching for red patent leather shoes". In such an environment, 3% additional "real" false-drops are acceptable.
Most patent searching situations are very similar to those in which Alta Vista is used. Queries are rarely framed precisely. Many (most?) false drops are in fact real hits, but for the wrong reason. "Real" false drops are not considered a disadvantage, as long as they are "close" (in fact, a certain amount of such imprecision appears to be viewed as an advantage.)
Is it possible to create an NI2-like Markush structure searching system which solves practical search questions? If so, how much imprecision would be acceptable?
Alta-Vista-like systems are not trivial to implement on a large scale, but they are fast, robust and scale up very well. Such methodology is very attractive when compared with extant systems which try (and fail) to solve intractable problems with complex methods and brute force.

Do the structure tools already exist?

Imagine creating a very large, but otherwise normal Daylight fingerprint for each molecule described in a Markush set. Then imagine AND-ing and OR-ing all the molecules to form fingerprints which describe the set. AND fingerprints describe features which are common to all structures in a set (would include parent structure, if any). OR fingerprints describe the range of the structures in a set. Now imagine doing so for all Markush descriptions in the patent literature.
Comparing two such OR fingerprints would provide a measure of the similarity of two Markush sets.
Comparing two such AND fingerprints would provide a measure of the similarity of Markush parent structures.
Comparing the bits set "on" in one Markush's fingerprint with the same bits in another is a measure of how much of one set appears in the other.
Comparing the AND of two OR fingerprints provides a measure of conjunction (overlap) of the two sets.
Notice that this looks a lot like our friendly Tversky search. In fact, when using large, fixed-size, normal Daylight fingerprints, it is exactly the Tversky search which is implemented in Merlin. (Thanks, John!)
So the answer is a conditional "yes". Given Merlin's Tversky search and the SMILES, CHORTLES, and fingerprint toolkits, it appears that we have essentially all the algorithms and infrastructure needed to to try this out.

Problems

The thought-experiment above has a serious flaw: Many Markush structures represent an infinite (or near infinite) set of structures. We'd never get past the first step which requires ennumerating all of them. This is a bit of a problem.
One way around this quandry is to sample the set of molecules instead of ennumerating every one in the set. This is useful only if we have an algorithm which can efficiently sample a Markush's theoretical universe without doing the ennumeration first. It must also do so with a known sampling distribution. Fortunately, almost all Markush data are theoretically well-suited for such treatment (basically, constituent combinatorics). Although extant data format(s) are not well-suited for this task, conversion is certainly possible. Example gifs, movie.
A second flaw in the enumeration/sampling approach is that Markush structure components are not "simple" variable molecular graphs, but usually include generic expressions such as "halogen", "alkyl", "heterocyclic", etc. (Chuck and his friends). This poses no new problem in the case of fixed sets (e.g., halogens), we just sample those in the normal manner. But for infinite sets, we must define special artificial sets to represent generics (e.g., "alkyl" represented by 20 specific alkyl substituents).

The Experiment

We will be conducting experiments in the upcoming year to detetermine if a practical Markush search system can be built along these lines. These experiments are specifically designed to discover the following:

Is it possible to sample the structures in all extant Markush sets without enumerating them? If so, which sampling distribution works best?
Can binary fingerprints effectively represent the diversity and commonality of Markush structures? If so, what size and bit density are optimal? If not, what other representation might work (e.g., frequency counts)?
Can various forms of Tversky similarity encapsulate the types of variable graph comparisons that are needed? If not, what other metrics might be useful?
Is it reasonable to represent generics with sets of specific partial structures? If so, which ones?
Given that these experiments will be done on a specific subset of Markush sets, can we extrapolate to "all" Makush's in the extant patent literature? If so, what would the performance and hardware requirements be for a comprehensive system?
Can an Alta-Vista-like interface be created which people will actually use? If so, what level of imprecision is acceptable?

As per usual, we are asking for feedback, criticisms, suggestions and perspective from the muggers. Please also contact us if you'd like to be involved in the alpha/beta tests of a Markush search server.
Daylight Chemical Information Systems, Inc.
info@daylight.com