"Markush structures" are the kinds of structure diagrams one finds
in patents with R's and X's, which may represent a few molecules or
a large infinity of molecules. The Markush concept is defined in
terms of legal precedent rather than in mathematical or chemical
terms, which is a bother. For instance, there is no requirement
that they make chemical sense.
Most often, Markush structures represent variable graphs with
node labels which are nearly arbitrary, e.g., a substituent might
be labeled "-CH3", "heterocycle", "blocking group" or
"non-aromatic substituent of any size".
The most common and important Markush problems involve establishing
relationships between variable-graph sets, e.g.:
identity -- is one identical to another?
membership -- is given structure included in a given set?
conjunction -- do these sets overlap?
similarity -- which sets are most similar to a given set?
All four problems are known to be intractable in the general
case (even identity).
Simplifying the problem helps (e.g., disallowing general boolean
logic as in MD) but even so, a rigorous solution to the simplified
problem has not been discovered. Much work has been done on
developing algorithms which work in restricted domains.
The currently-existing set of Markush structures is extremely complex,
and these problems remain intractable in practice.
Example
Changing the problem
The "ground rules" for Markush searching (and for chemical structure
searching in general) have been: a database exists which contains a
potentially large number of structures; a single, rigorously defined
query is posed; the search function's job is to find the exact
set of structures which satisfy the query. This is hard!
Searches following other ground rules might be just as useful and
much less difficult to implement. For instance, one might consider
relaxing the "exact set of structures" requirement by allowing a small
percentage of near-answers to be reported (false drops).
In recent years, a tremendous amount of work has been expended in
development of large-scale search engines, primarily for searching WWW.
The NI2 method, developed at DEC and deployed as
Alta Vista, was one of the earliest such methods and is still one
of the most successful. NI2 uses a compact in-memory hashing scheme
which does not resolve hash collisions. This means that some small
percentage of results reported will inevitably be false drops.
With the Alta Vista state about one year ago, this was about 3%.
(Example: ~88,884 hits for +patent +searching.)
The astonishing thing is that this type of imprecision is not
perceived as a disadvantage by users ... most don't even notice it.
The reason is that most "true" hits are kind of false drops
anyway, i.e., they aren't what the user was looking for. Of the
~88K web pages containing "patent" and "searching", almost 2%
are about shoes ... only a few of which are about footwear patents ...
most are about footwear, e.g., "searching for red patent leather shoes".
In such an environment, 3% additional "real" false-drops are acceptable.
Most patent searching situations are very similar to those in
which Alta Vista is used. Queries are rarely framed precisely.
Many (most?) false drops are in fact real hits, but for the wrong
reason. "Real" false drops are not considered a disadvantage, as long
as they are "close" (in fact, a certain amount of such imprecision
appears to be viewed as an advantage.)
Is it possible to create an NI2-like Markush structure searching
system which solves practical search questions? If so, how much
imprecision would be acceptable?
Alta-Vista-like systems are not trivial to implement on a large
scale, but they are fast, robust and scale up very well.
Such methodology is very attractive when compared with extant systems
which try (and fail) to solve intractable problems with complex
methods and brute force.
Do the structure tools already exist?
Imagine creating a very large, but otherwise normal Daylight
fingerprint for each molecule described in a Markush set.
Then imagine AND-ing and OR-ing all the molecules to form fingerprints
which describe the set. AND fingerprints describe features which are
common to all structures in a set (would include parent structure, if any).
OR fingerprints describe the range of the structures in a set.
Now imagine doing so for all Markush descriptions in the patent literature.
Comparing two such OR fingerprints would provide a measure of the
similarity of two Markush sets.
Comparing two such AND fingerprints would provide a measure of the
similarity of Markush parent structures.
Comparing the bits set "on" in one Markush's fingerprint with the
same bits in another is a measure of how much of one set appears
in the other.
Comparing the AND of two OR fingerprints provides a measure of
conjunction (overlap) of the two sets.
Notice that this looks a lot like our friendly Tversky search.
In fact, when using large, fixed-size, normal Daylight fingerprints,
it is exactly the Tversky search which is implemented in Merlin.
(Thanks, John!)
So the answer is a conditional "yes".
Given Merlin's Tversky search and the SMILES, CHORTLES, and
fingerprint toolkits, it appears that we have essentially all the
algorithms and infrastructure needed to to try this out.
Problems
The thought-experiment above has a serious flaw:
Many Markush structures represent an infinite (or near infinite) set
of structures. We'd never get past the first step which requires
ennumerating all of them. This is a bit of a problem.
One way around this quandry is to sample the set of
molecules instead of ennumerating every one in the set.
This is useful only if we have an algorithm which can efficiently
sample a Markush's theoretical universe without doing the ennumeration
first. It must also do so with a known sampling distribution.
Fortunately, almost all Markush data are theoretically well-suited
for such treatment (basically, constituent combinatorics).
Although extant data format(s) are not well-suited for this task,
conversion is certainly possible.
Example gifs,
movie.
A second flaw in the enumeration/sampling approach is that Markush
structure components are not "simple" variable molecular graphs, but
usually include generic expressions such as "halogen", "alkyl",
"heterocyclic", etc. (Chuck and his friends). This poses no new problem
in the case of fixed sets (e.g., halogens), we just sample those in
the normal manner. But for infinite sets, we must define special
artificial sets to represent generics (e.g., "alkyl" represented by
20 specific alkyl substituents).
The Experiment
We will be conducting experiments in the upcoming year to detetermine
if a practical Markush search system can be built along these lines.
These experiments are specifically designed to discover the following:
Is it possible to sample the structures in all extant Markush
sets without enumerating them?
If so, which sampling distribution works best?
Can binary fingerprints effectively represent the diversity and
commonality of Markush structures?
If so, what size and bit density are optimal?
If not, what other representation might work (e.g., frequency counts)?
Can various forms of Tversky similarity encapsulate the types
of variable graph comparisons that are needed?
If not, what other metrics might be useful?
Is it reasonable to represent generics with sets of specific
partial structures? If so, which ones?
Given that these experiments will be done on a specific subset
of Markush sets, can we extrapolate to "all" Makush's in the
extant patent literature? If so, what would the performance and
hardware requirements be for a comprehensive system?
Can an Alta-Vista-like interface be created which people will
actually use? If so, what level of imprecision is acceptable?
As per usual, we are asking for feedback, criticisms, suggestions and
perspective from the muggers. Please also contact us if you'd like to
be involved in the alpha/beta tests of a Markush search server.