SMARTS

In computer parlance SMILES is a language that represents molecules as a graph where the atoms are the nodes of the graph and the bonds are the edges separating the nodes. Atom properties such as atomic number, atom type are represented as node properties and bond properties are represented as edge properties.

    SMARTS can be thought of as an extension of this mechanism to represent variabilities and choices in the node and edge properties in this graph, i.e. a way to specify variable choices for any of the atoms or bonds in the molecule. Using SMARTS one can specify a pattern to search for.

There are 4 parts to understanding and creating a SMARTS expression:


Specifying atom variabilities:

The SMARTS CCN willl match ethylamine or any molecule with the group

      X3   Y1       Z1
      |    |       /
      |    |      /
      |    |     /
X2----C----C----N
      |    |     \
      |    |      \
      |    |       \
      X1   Y2       Z2
 

The SMARTS CC[O,N]  will match ethanol or ethylamine. If we examine how, we see

anything in square braces is a description for 1 atom only
               __|__
              |      |
CH2----CH2----[OH,NH2]
               | ||
               | ||
          Oxygen ||
                OR|
                  Nitrogen
 

Expressions within [ and ] are therefore atom SMARTS expressions.

Atom SMARTS may appear wherever an atom may occur in a SMILES.

This SMARTS CC[O,N] is read as

Aliphatic Carbon singly bonded to
    an aliphatic Carbon singly bonded to
        an atom that is (a Nitrogen OR an Oxygen)

',' is a logical representing OR
Here's the complete list of logicals in their order of precedence in SMARTS
 

SMARTS Logical Operators
Symbol Expression Meaning
    ! !N NOT Nitrogen
    & N&a Nitrogen AND aromatic (high precedence)
    , N,a Nitrogen OR aromatic
    ;  N;a Nitrogen AND aromatic (low precedence)
The default logical (when unstated) is '&'

'a' signifies an aromatic atom in SMARTS

Using this information let us analyse a few atom SMARTS expressions

        an atom
         _|__
        |    |
        [!N&a] : an atom that is ( (NOT a Nitrogen)AND is aromatic)
        ||||           i.e. an aromatic atom that is not a Nitrogen.
       NOT|||
          |||
   Nitrogen||
           ||
         AND|
            |
            is aromatic

[N,C&a] : an atom that is ( a Nitrogen OR a ( Carbon AND is aromatic ))
                  i.e. a Nitrogen or an aromatic Carbon

The number of Hydrogens attached to an atom is expressed in SMARTS as H<n> where <n> is any number.
<n> is optional and is by default 1 when not stated.

Therefore
    [H1] represents any atom with 1 attached hydrogen

Using this construct let us try figuring out the atomic SMARTS below:

[NH1]  = [N&H1] = atom that ( is a Nitrogen AND has 1 attached Hydrogen )
                e.g. a secondary amine

[nH1] = [n&H1] = atom that (is an aromatic Nitrogen AND has 1 hydrogen)
            = a pyrrole Nitrogen

[C,n&H1] = atom that (
                            is a Carbon OR
                                is an aromatic Nitrogen AND
                                    has 1 attached Hydrogen ) (prior to applying logicals)

                    = atom that (
                           is a Carbon OR
                           (is an aromatic Nitrogen AND
                                has 1 attached Hydrogen) )

                    = atom that (is a Carbon OR is a pyrrole Nitrogen )

[C,n;H1] = atom that (
                            is a Carbon OR
                                is an aromatic Nitrogen
                                    AND has 1 attached Hydrogen ) (prior to applying logicals)

                = atom that (
                       (is a Carbon OR
                            is an aromatic Nitrogen) AND
                               has 1 attached Hydrogen )

                = atom that ( is a ternary Carbon OR is a pyrrole Nitrogen )

Here's the complete list of atom property symbols in SMARTS
 
SMARTS Atomic Primitives
Symbol Symbol name Atomic property requirements Default
wildcard any atom (no default)
a aromatic aromatic (no default)
A aliphatic aliphatic  (no default)
D<n> degree <n> explicit connections (no default)
H<n> total-H-count <n> attached hydrogens exactly one
h<n> implicit-H-count <n> implicit hydrogens exactly one
R<n> ring membership in <n> SSSR rings any ring atom
r<n>  ring size in smallest SSSR ring of size <n> any ring atom
v<n> valence total bond order <n>  (no default)
X<n> connectivity <n> total connections (no default)
- <n> negative charge -<n> charge -1 charge (-- is -2, etc)
+<n> positive charge +<n> formal charge +1 charge (++ is +2, etc)
#n atomic number atomic number <n> (no default)
@ chirality anticlockwise anticlockwise, default class
@@ chirality clockwise clockwise, default class
@<c><n> chirality chiral class <c> chirality <n> (nodefault)
@<c><n>? chiral or unspec chirality <c><n> orunspecified (no default)
<n> atomic mass explicit atomic mass unspecified mass
Here are some atom SMARTS composed with these symbols.

  [#6] = a carbon atom

  [Ca] = a calcium atom

  [++] = any atom with a +2 charge

  [CH2] = atom that is (an aliphatic carbon and has two hydrogens)
                  = ( a methylene carbon)

  [35*] = any atom of mass 35

  [F,Cl,Br,I] = the 1st four halogens.

  [!C;R] = atom that is (( NOT aliphatic carbon ) AND is in a ring)
 


Bond Properties:

      The default bond in a SMARTS expression is a single or aromatic bond. Hence :

cc = c:c = any pair of attached aromatic carbons
':' is the symbol for an aromatic bond

CC = C-C = any pair of attached aliphatic carbons
'-' is the symbol for a single bond

c-c = 2 aromatic Carbons joined by a non-aromatic single bond
            e.g. as in biphenyl

Bonds can be variable as in atom SMARTS along with logicals e.g.

C-,=,#N = a Carbon bonded via a single or double or triple bond to
                a Nitrogen

C~N = a Carbon bonded via any bond to a Nitrogen
~ is the symbol for a wildcard bond

C@N = a Carbon bonded via a ring bond to a Nitrogen
@ is the symbol for any ring bond

C/?N=C/O     = C bonded via trans or unspecified chirality to a N double-bonded to a
                            C singly bonded to an oxygen

Here's a list of SMARTS primitives for bonds:
 

SMARTS Bond Primitives
Symbol Atomic property requirements
single bond (aliphatic)
/ directional single bond "up"
\ directional single bond "down"
/? directional bond "up or unspecified"
\? directional bond "down or unspecified"
= double bond
triple bond
aromatic bond
~ any bond (wildcard)
@ any ring bond

 
 
 

Hence one might represent a C ortho to a N via the SMARTS

CaaN = Carbon singly bonded to an aromatic atom bonded via an aromatic
                bond to an aromatic atom singly bonded to a Nitrogen.

CaaaN = C meta to an N

CaaaaN  = C para to N


Recursive SMARTS:

To say atom that is a C ortho or para to N, one can use

        an atom that is
         _______|________
        |                |
        [$(CaaN),$(CaaaN)]
        |_____|||______|
           |    |    |
           |    |    |
           |    OR   |
           |         |
           |         a Carbon meta to Nitrogen
           |
  a Carbon ortho to a Nitrogen
 

This is a recursive SMARTS, where in place of an atom property in an atom SMARTS one can use any logical SMARTS itself, with the rule that
a recursive SMARTS must always appear within the symbols '$(' and ')'.

The above SMARTS would read
atom that is ( ( a Carbon ortho to a N) OR ( A Carbon meta to a N))
 

A few examples follow:

Caa(O)aN = Carbon ortho to O and meta to N (but in a single path i.e. 2O,3N only)

Ca(aO)aaN = Carbon ortho to O and meta to N (but in differing paths i.e. 2O,5N only)

C[$(aaO);$(aaaN)]  = C ortho to O and C meta to N (all cases)
 
 


Component SMARTS:

Every SMILES expression is technically a valid SMARTS. However the meaning of the SMARTS expression is often quite different.

A SMILES like

C.C would be 2 methanes

However its SMARTS meaning is a pattern that matches

2 aliphatic Carbons which may be in the same molecule or in different molecules, therefore matching ethane(CH3CH3) or propane(CH3CH2CH3)

To restrict the Carbons to be in a single molecule, one can use the parentheses operators to group dot-disconnected fragments e.g.

(CC.C) would not match ethane, but would match propane or butane

(C).(C) would not match ethane propane or butane alone, since the 2 C's must come from differing components.
(C).(C) would match the SMILES CCCC.COC (i.e. CH3CH2CH2CH3 + CH3OCH3)

(CC).C would match butane  and propane since there is no restriction on the second Carbon

In SMARTS one might represent and (thereby match reactions representing) general esterifications as

C(=O)O.OCC>>C(=O)OCC.O     an acid and an alcohol in the reactant go to an ester and water in the product.

        O--H          Y2 Z1
        |            |  |
        |            |  |
   X2---C===O   + Y1--C--C--O--H
        |             |  |
        |             |  |
        X1           Y3  Z2
 

To convert this to represent and match intermolecular esterifications only
 

           in a separate mol
in 1 mol.  |
 __|___   _|_
|      | |   |
(C(=O)O).(OCC)>>C(=O)OCC.O     both acid and alcohol come from different
|            |  |        | molecules
 ------------    --------
      |               |
 in the reactant      |
                      in the product
 

To represent and match intramolecular esterifications only:

in 1 mol.
 __|_______
|          |
(C(=O)O.OCC)>>C(=O)OCC.O     both acid and alcohol come from the same
|           | |        | molecules
 -----------   --------
      |             |
 in the reactant    |
                    in the product
 
 

With the current state of SMARTS editors in order to use SMARTS powerfully a chemist needs to either

Daylight would love to assist you in both. We have an interesting set of SMARTS examples in the SMARTS tutorial. Please do feel free to work out these examples. Also queries on SMARTS that you may not be clear are always welcome. For an clarifying explanation of SMARTS please do remember to check out Dave's one page lesson on SMARTS

Examples:
 

A hydrogen bond donor

An aliphatic polar atom such as Oxygen or Nitrogen or Sulfur having at least 1 hydrogen and not in a ring

= (aliphatic O or N or S) AND (having at least 1 Hydrogen) AND (NOT in a Ring)

an atom that is
 ______|_____
|            |
[O,N,S;!H0;R0]
  | | ||| ||_____not in a ring
  OR| |||_____________with 0 Hydrogens
    | ||__________NOT
    OR|
      |
      AND (Lower precedence than OR)
 

A halogenated phenol

A halogen connected to a phenol
= A halogen connected ortho or meta or para to the hydroxy in a phenol
= A F OR Cl OR Br OR I connected to an aromatic atom that is ORTHO or META or PARA to a hydroxy phenol

=
                   an atom that is
 _______________________________|_____________________________
|                                                             |
[F,Cl,Br,I;[$(*[$(c1c(O)cccc1),$(c1cc(O)ccc1),$(c1ccc(O)cc1))]]
 |       ||
  ------- |
     |    AND (LOWER ORDER THAN OR)
a F OR Cl OR Br OR I

 
               an atom that is
 ________________________|________________________
|                                                 |
[$(*[$(c1c(O)cccc1),$(c1cc(O)ccc1),$(c1ccc(O)cc1))]
   | --------------| -------------| -------------
   |        |      |      |      OR        |
any atom    |     OR meta in a phenol      para in a phenol
connected   |
       ortho in a phenol

This can be further simplified to

[F,Cl,Br,I;[$(*c1[$(c(O)cccc1),$(cc(O)ccc1),$(ccc(O)cc1)])]
 

Now why can't the above be further simplified to:
 

[F,Cl,Br,I;[$(*c1[$(c(O)cc),$(cc(O)c),$(ccc(O))]cc1)]

Think and let me know your answer later.

A Primary or secondary amine and not an amide

   an atom that is
 ________|___________
|                    |
[C&!$(*=O)&$(*[N!H0])]
 |||------|  -------
 C||  |   |    |
  ||  | AND    |
AND|  |      connected to a (N with more than 1 Hydrogen)
   |  |
 NOT  |
double bonded to O