Hi Kevin,
> 1-methyl-2-hydroxy benzene with either a Cl or H at the 5 position.
> We tried "Cc1:c(O):c:c:c([Cl,H]):c1" and got all the Cl substituted > molecules and none of the hydrogen substituted molecules. > Any suggestions?
Sure. Ah, you've been caught by an all-to-common SMARTS gotcha. Sorry.
EXPLANATION
The "H" primitive in SMARTS means "total number of attached hydrogens", i.e., [C] will match C in [CH4] methane, [CH3] methyl, [CH2] methylene, etc., [CH3] will only match methyl. This is similar to the use of "H" in SMILES to specify hydrogen count. The default value for the SMARTS "H" primitive is 1 (same as SMILES, e.g., [CH2]=[CH]-[OH] same as CC=O). This H-specification value includes all attached hydrogens: implicit and explicit (e.g., isotopic [2H]). The "H" in your SMARTS literally means, "any atom with one attached hydrogen" and the 5-substituent spec [Cl,H] means "chlorine OR any atom with one attached hydrogen". This is probably broader than you intended, e.g., it would hit molecules with a 5-isopropyl substituent.
ASIDE
To make SMILES and SMARTS rules as similar as possible, there is an important exception to the above rule: H in brackets is taken to be a hydrogen atomic symbol if it is the *first* elemental primitive in the atomic expression, e.g., "[H]O[H]" means "an explicit hydrogen atom connected to oxygen connected to another hydrogen" and will match isotopic water [2H]O[2H], [3H]O[3H], etc., but not C=COC=C. (The SMARTS "[*H]O[*H]" means "any atom with one hydrogen connected to oxygen ..." and matches the other way around.)
This exception can lead to confusion, e.g., your specification "[Cl,H]" is different from "[H,Cl]"! To help avoid such confusion, we provide an atomic number primitive '#', e.g., "#1" always means a "hydrogen atom", "[Cl,#1]" means the same things as "[#1,Cl]". For the really twisted: "[Cl,#1,H]" means "chlorine or explicit hydrogen or any atom with one attached hydrogen". Whew. However, replacing your "[Cl,H]" with "[Cl,#1]" doesn't do what you want ... you're looking for a "total H- count of one", rather than an explicit hydrogen.
HOW TO DO IT IN A SINGLE SMARTS
This is pretty simple to do in general if you break it down. A SMARTS atomic expression for an "aromatic carbon with attached chlorine" is:
cCl
The "total H-count" primitive is "H". A SMARTS expression for an "aromatic carbon with exactly one attached hydrogen" is:
[cH] (or "[cH1]" if you prefer).
You can combine SMARTS recursively using the $() syntax. A SMARTS for "an aromatic carbon with attached chorine OR an aromatic carbon with exactly one attached hydrogen" is:
[$(cCl),$([cH])]
So, a valid SMARTS to specify what (I think) you want would be:
Cc1:c(O):c:c:[$(cCl),$([cH])]:c1
though my style is to write something like this:
+------ aromatic carbon |+-------- AND ( || +---------- atom with attached chlorine || | +---------- or || | | +-------- atom with one attached hydrogen ) || | | | -| ----- | ----- [c;$([*Cl]),$([*H1])]1ccc(O)c(C)c1 --------------------- | +-------- we're matching only one atom inside brackets
HOW TO DO IT EFFICIENTLY
Since you are using Merlin, another method is available to you which is more general, faster and simpler to understand than recursive SMARTS: using boolean hitlist operations. To do the above search:
1. Find your chloro-derivitives of interest, i.e.: "Select superstructures of Cc1c(O)ccc(Cl)c1" as SMARTS (OK) or SMILES (faster).
2. Store the hits (select "Store" item in "Hitlist" menu).
3. Find your 5-H-derivitives, i.e: Select superstructures of "Cc1c(O)cc[cH]c1" (as SMARTS).
4. Take union with stored hitlist ("Union" item in "Hitlist" menu).
You'll be left with the same set of molecules as specified by the recursive SMARTS mentioned above. This approach is more general because you can combine any set of molecules you like (the patterns need not be related). Merlin's "Store" feature is enabled by default, but you or your system manager may have disabled it to save server memory (if so, the "Store" menu item will be grayed out). Ha! You can still do it:
1. Find your chloro-derivitives of interest as above, with the default "Action" choice "Make new hitlist".
2b. Select the "Action" choice "Add matches to hitlist".
3. Find your 5-H-derivitives (same as above) and you've got it.
4b. Change the "Action" choice back to "Make new hitlist".
This does the same thing in an even simpler way. [I use the "Store" functionality because I tend to forget to reset the action per 4b :-]
Cheers, Dave.