SMILES Tutorial: Conventions
This document is intended to be viewed with a tables-capable browser.
Hydrogen atoms do not normally need to be specified
when writing SMILES for most organic structures.
The presence of hydrogens may be specified in three ways:
- Implicitly
- for atoms specified without brackets, from normal valence assumptions.
- Explicitly by count
- inside brackets, by the hydrogen count supplied; zero if unspecified.
- As explicit atoms
- i.e., as explicit [H] atoms.
There is no distinction between "organic" and "inorganic" SMILES nomenclature.
One may specify the number of attached hydrogens for any atom in any SMILES.
For example, ethane be written as
CC or [CH3][CH3] or [H]C([H])([H])C([H])([H])[H].
There are four situations where specification of explicit hydrogen
specification is required:
- charged hydrogen, i.e. a proton, [H+]
- hydrogens connected to other hydrogens, e.g., molecular hydrogen, [H][H]
- hydrogens connected to other than one other atom,
e.g., bridging hydrogens
- isotopic hydrogen specifications, e.g. in heavy water, [2H]O[2H]
Most of the confusion in using SMILES arises from the SMILES definition
of aromaticity. That's a shame, because in virtually all cases,
one can simply (and safely) ignore aromaticity.
When should I specify a structure as aromatic?
You never need to do so.
If you find yourself typing in SMILES, it's a bit easier to type
"c1ccccc1" for benzene instead of "C1=CC=CC=C1" cyclohexatriene,
but it's just a matter of convenience,
since they mean exactly the same thing.
What does "aromatic" mean, anyway?
"Aromatic" means "it smells nice".
No kidding, that's the only defensible definition.
There is no single rigorous definition of aromaticity in chemistry.
To a synthetic chemist, aromaticity implies something about reactivity;
to a thermodynamicist, about heat of formation;
to a spectroscopist, about NMR ring current;
to a molecular modeler, about geometrical planarity;
to a cosmetic chemist, it probably means "smells nice".
The SMILES definition of aromaticity has nothing to do with the other
definitions, except that we'd all agree that benzene is "aromatic".
Why does SMILES provide an "aromatic" concept at all?
The SMILES language was specifically designed to be "uniquifiable",
i.e., not only to provide an unambiguous chemical nomenclature,
but also be able to express a single, unique SMILES for every structure
in the same language.
This implies a fundamental requirement to express the symmetry of a
molecule correctly. Consider the problem of generating a unique
SMILES for orthofluorophenol, Oc1ccccc1F, but without aromatic bonds.
There are two ways to write it, OC1=CC=CC=C1F
(with the substituted carbons joined by a single bond)
and OC1=C(F)C=CC=C1
(with the substituted carbons joined by a double bond).
These are two different molecular graphs:
the SMILES for these will always differ.
For purposes of unique nomeclature, it's not OK to have two
different "unique SMILES" for the same molecule.
SMILES language provides an "aromatic" concept to avoid this conundrum.
How does SMILES determine "aromaticity"?
Unfortunately it's not as trivial as "alternating single and double
bonds", but it's not rocket science, either.
The SMILES algorithm uses an extended version of Hueckel's rule
to identify aromatic molecules and ions.
To qualify as aromatic, all atoms in a ring must be sp2 hybridized
and the number of available "shared" p-electrons must satisfy Hueckel's
4N+2 criterion.
For example, an sp2 carbon shares one pi-electron,
so benzene (or cyclohexatriene) is aromatic (6 = 4(1) + 2).
Conversely, C1=CC=C1 cyclobutadiene and C1=CC=CC=CC=C1 cyclooctatetraene,
are (correctly) not aromatic, with 4 and 8 shared electrons, respectively.
Note that these are anti-aromatic compounds, i.e.,
FC1=CC=CC=CC=C1O and FC1=C(O)C=CC=CC=C1 are not the same structure.
The rules get a little hairy for heterocycles:
Oxygen and sulfur can share a pair of pi-electrons.
Nitrogen can also share a pair,
if three-connected as in methylpyrrole,
otherwise sp2 nitrogen shares just one electron (as in pyridine).
An exocyclic double bond to an electronegative atom "consumes" one
shared pi-electron, as in 2-pyridone or coumarin.
But that's about it.
Add up the electrons in rings (and ring systems, such as azulene);
if they meet the 4N+2 criterion, it's "aromatic".
Table 12.
Examples of aromatic compounds and their SMILES.
Depiction |
SMILES |
Name |
Remark |
|
C1=CC=CC=C1 same as c1ccccc1 |
cyclohexatriene same as benzene |
6 = 4N + 2 shared pi electrons. |
|
FC1=CC=CC=C1O FC1=C(O)C=CC=C1 Fc1ccccc1O |
orthofluorophenol |
All the same molecule, however you write it. |
|
n1ccccc1 same as N1=CC=CC=C1 |
pyridine |
"Normal" aromatic "n" nitrogen is pyridyl-N. |
|
[nH]1cccc1 same as N1C=CC=C1 |
1-H-pyrrole |
Pyrrolyl-N is written [nH] and shares two pi-electrons. |
|
O=n1ccccc1 same as O=N1=CC=CC=C1 |
pyridine-N-oxide, neutral representation |
Exocyclic =O "consumes" one pi electron from a N that would otherwise
share 2 pi electrons. |
|
[O-][n+]1ccccc1 same as [O-][N+]1=CC=CC=C1 |
pyridine-N-oxide, charge-separated representation |
One electron is missing (+) from a N that would
otherwise share 2 pi electrons. |
|
o1cccc1 same as O1C=CC=C1 |
furan |
Oxygen shares a pair of pi electrons, so furan is aromatic |
|
s1cccc1 same as S1C=CC=C1 |
thiophene |
Sulfur shares a pair of pi electrons, so thiophene is aromatic |
|
[cH-]1cccc1 same as [CH-]1C=CC=C1 |
cyclopentadienyl anion |
The - charge is an extra electron, making 6. |
|
c1cc2cccccc2c1 same as C1=CC2=CC=CC=CC2=C1 |
azulene |
3 + 2 + 5 = 10 = 4N+2, so azulene is aromatic. |
Tautomeric structures are explicitly
specified in SMILES. There are no "tautomeric bond",
"mobile hydrogen", nor "mobile charge" specifications.
Selection of one or all tautomeric structures is left to the user and
strongly depends on the application. Given one tautomeric form, most
chemical information systems will report data for all known tautomers
as needed. The role of SMILES is to specify exactly which tautomeric
form is requested, and for which there are data. A simple example,
with two possible tautomeric forms, is shown below:
Forward to "Related languages".
Back to "Reactions".
Return to table of contents.
Daylight Chemical Information Systems, Inc.
info@daylight.com