SMILES Tutorial: Conventions

This document is intended to be viewed with a tables-capable browser.

Hydrogen specification

Hydrogen atoms do not normally need to be specified when writing SMILES for most organic structures. The presence of hydrogens may be specified in three ways:

Implicitly
for atoms specified without brackets, from normal valence assumptions.

Explicitly by count
inside brackets, by the hydrogen count supplied; zero if unspecified.

As explicit atoms
i.e., as explicit [H] atoms. There is no distinction between "organic" and "inorganic" SMILES nomenclature. One may specify the number of attached hydrogens for any atom in any SMILES. For example, ethane be written as CC or [CH3][CH3] or [H]C([H])([H])C([H])([H])[H].

There are four situations where specification of explicit hydrogen specification is required:

Aromaticity

Most of the confusion in using SMILES arises from the SMILES definition of aromaticity. That's a shame, because in virtually all cases, one can simply (and safely) ignore aromaticity.

When should I specify a structure as aromatic?

You never need to do so. If you find yourself typing in SMILES, it's a bit easier to type "c1ccccc1" for benzene instead of "C1=CC=CC=C1" cyclohexatriene, but it's just a matter of convenience, since they mean exactly the same thing.

What does "aromatic" mean, anyway?

"Aromatic" means "it smells nice". No kidding, that's the only defensible definition. There is no single rigorous definition of aromaticity in chemistry. To a synthetic chemist, aromaticity implies something about reactivity; to a thermodynamicist, about heat of formation; to a spectroscopist, about NMR ring current; to a molecular modeler, about geometrical planarity; to a cosmetic chemist, it probably means "smells nice". The SMILES definition of aromaticity has nothing to do with the other definitions, except that we'd all agree that benzene is "aromatic".

Why does SMILES provide an "aromatic" concept at all?

The SMILES language was specifically designed to be "uniquifiable", i.e., not only to provide an unambiguous chemical nomenclature, but also be able to express a single, unique SMILES for every structure in the same language. This implies a fundamental requirement to express the symmetry of a molecule correctly. Consider the problem of generating a unique SMILES for orthofluorophenol, Oc1ccccc1F, but without aromatic bonds. There are two ways to write it, OC1=CC=CC=C1F (with the substituted carbons joined by a single bond) and OC1=C(F)C=CC=C1 (with the substituted carbons joined by a double bond). These are two different molecular graphs: the SMILES for these will always differ. For purposes of unique nomeclature, it's not OK to have two different "unique SMILES" for the same molecule. SMILES language provides an "aromatic" concept to avoid this conundrum.

How does SMILES determine "aromaticity"?

Unfortunately it's not as trivial as "alternating single and double bonds", but it's not rocket science, either. The SMILES algorithm uses an extended version of Hueckel's rule to identify aromatic molecules and ions. To qualify as aromatic, all atoms in a ring must be sp2 hybridized and the number of available "shared" p-electrons must satisfy Hueckel's 4N+2 criterion.

For example, an sp2 carbon shares one pi-electron, so benzene (or cyclohexatriene) is aromatic (6 = 4(1) + 2). Conversely, C1=CC=C1 cyclobutadiene and C1=CC=CC=CC=C1 cyclooctatetraene, are (correctly) not aromatic, with 4 and 8 shared electrons, respectively. Note that these are anti-aromatic compounds, i.e., FC1=CC=CC=CC=C1O and FC1=C(O)C=CC=CC=C1 are not the same structure.

The rules get a little hairy for heterocycles: Oxygen and sulfur can share a pair of pi-electrons. Nitrogen can also share a pair, if three-connected as in methylpyrrole, otherwise sp2 nitrogen shares just one electron (as in pyridine). An exocyclic double bond to an electronegative atom "consumes" one shared pi-electron, as in 2-pyridone or coumarin. But that's about it. Add up the electrons in rings (and ring systems, such as azulene); if they meet the 4N+2 criterion, it's "aromatic".

Table 12. Examples of aromatic compounds and their SMILES.
Depiction SMILES Name Remark
C1=CC=CC=C1
same as
c1ccccc1
cyclohexatriene
same as
benzene
6 = 4N + 2 shared pi electrons.
FC1=CC=CC=C1O
FC1=C(O)C=CC=C1
Fc1ccccc1O
orthofluorophenol All the same molecule, however you write it.
n1ccccc1
same as
N1=CC=CC=C1
pyridine "Normal" aromatic "n" nitrogen is pyridyl-N.
[nH]1cccc1
same as
N1C=CC=C1
1-H-pyrrole Pyrrolyl-N is written [nH] and shares two pi-electrons.
O=n1ccccc1
same as
O=N1=CC=CC=C1
pyridine-N-oxide, neutral representation Exocyclic =O "consumes" one pi electron from a N that would otherwise share 2 pi electrons.
[O-][n+]1ccccc1
same as
[O-][N+]1=CC=CC=C1
pyridine-N-oxide, charge-separated representation One electron is missing (+) from a N that would otherwise share 2 pi electrons.
o1cccc1
same as
O1C=CC=C1
furan Oxygen shares a pair of pi electrons, so furan is aromatic
s1cccc1
same as
S1C=CC=C1
thiophene Sulfur shares a pair of pi electrons, so thiophene is aromatic
[cH-]1cccc1
same as
[CH-]1C=CC=C1
cyclopentadienyl anion The - charge is an extra electron, making 6.
c1cc2cccccc2c1
same as
C1=CC2=CC=CC=CC2=C1
azulene 3 + 2 + 5 = 10 = 4N+2, so azulene is aromatic.

Tautomers

Tautomeric structures are explicitly specified in SMILES. There are no "tautomeric bond", "mobile hydrogen", nor "mobile charge" specifications. Selection of one or all tautomeric structures is left to the user and strongly depends on the application. Given one tautomeric form, most chemical information systems will report data for all known tautomers as needed. The role of SMILES is to specify exactly which tautomeric form is requested, and for which there are data. A simple example, with two possible tautomeric forms, is shown below:

Table 13. Examples of tautomers.
Depiction SMILES Name
O=c1[nH]cccc1 2-pyridone
Oc1ncccc1 2-pyridinol

Forward to "Related languages".
Back to "Reactions".
Return to table of contents.
Daylight Chemical Information Systems, Inc.
info@daylight.com