SMILES Tutorial

A full SMILES language tutorial.

This document is intended to be viewed with a tables-capable browser.

Table of Contents

  1. Introduction
  2. Atoms
  3. Bonds
  4. Branching
  5. Rings
  6. Disconnections
  7. Isomerism
  8. Reactions
  9. Conventions
  10. Related languages
  11. &etc

Introduction

SMILES is a simple yet comprehensive chemical nomenclature.

The answer to the most commonly asked question about SMILES is: yes, it is an acronym, meaning Simplified Molecular Input Line Entry Specification. (SMILES originated in the depths of the US government, where humorous names for things are frowned upon unless they are acronyms.)

This document is intended to serve as a comprehensive tutorial for the SMILES language itself. This page is not intended to provide other functions, e.g., summaries of SMILES-compliant software, test suites, language reference standards, pointers to SMILES parsers, etc. However, pointers to pages providing such functions will be updated here.

The structural diagrams in this document are hyperlinks to an interactive SMILES depiction facility. To view a larger picture of a structure and view/edit its SMILES, click on its drawing.

Fundamental concepts

SMILES is widely used as a general-purpose chemical nomenclature and data exchange format. However, SMILES differs in several fundamental ways from most chemical nomenclatures and other chemical formats. It is useful to review a few fundamental concepts before digging into the specifics of the SMILES language.

SMILES represents a valence model

SMILES specifically represents a valence model of a molecule, not a computer data structure, a mathematical abstraction, or an "actual substance". SMILES naturally represents things well that can be well-represented by a molecular valence model. The valence model of molecular structure has proved to be an incredibly useful model for chemistry, a "universal hook" upon which we hang our chemical information and chemical intelligence. Once SMILES syntax is understood, any book explaining valence theory can be used as a "SMILES user manual". See virtually any introductory college level chemistry textbook or, failing that, use Coulson's classic Valence (Oxford University Press).

The flip side is that SMILES is not useful for describing things that cannot be well-represented by valence model. SMILES is not suitable for representing many common substances, e.g., turpentine (distilled trees), Skelly-B or gasoline (a distilled fossils), beer, or milk. It isn't just that these substances are complex mixtures, but rather that a description of their properties is more useful than a description of their structure.

SMILES does not dictate a valence model

A valence model of a molecule is a way of allocating a molecule's protons, neutrons and electrons into atoms and bonds in a way that makes sense. Not too surprisingly, what makes sense to one chemist doesn't always make sense to another. This is perfectly reasonable (within limits) -- it's the language of chemistry. The function of SMILES is to clearly represent a particular valence model, not dictate which one should be used.

In practice, one chemist might represent nitromethane as C[N+](=O)[O-] with a nitrogen of valence 3 in a charge-separated structure while another might represent it as CN(=O)=O with a neutral 5-valent nitrogen. Which SMILES is correct? Both are. Is it ever possible to make an incorrect SMILES? Yes, for instance, the SMILES CN([O])[O] does not represent nitromethane (the electrons don't add up; it represents some wierd diradical).

SMILES is not defined in terms of a computer program

SMILES doesn't look like a dump of a C-struct or Fortran common block; it doesn't act that way either. Computer programs which read SMILES use it in a great variety of ways: a character string, a list of tokens, a tree, a graph, a molecular graph, a database index, source code for a substructure search program, etc. To intuitively understand why SMILES is organized the way it is, one must understand that SMILES represents a chemist's model of a molecule, not a computer scientist's model of a data structure.

The practical side of this is that the correctness of a given SMILES can't be determined by what happens when it is input to any particular program, even to the extent of asking, "Is this a valid SMILES?" For instance, nowhere in this document will you find specified limits such as the maxima for SMILES length, atoms per molecule, branch nesting depth, etc. -- they don't exist except in specific implementations. So it goes.

SMILES is a universal nomenclature

We give up a certain amount of simplicity by defining SMILES in terms of chemistry rather than in terms of data structure. But having done so, there is the great reward that SMILES becomes a "universal" nomenclature, i.e., given the SMILES definition (e.g., this document) an Austrialian chemist in 2025 will be able to understand a SMILES generated by an Japanese chemist in 1985. There's no assumption that they share common computer software, hardware architecture, etc. That's neat.

Examples and graphical index

SMILES and depictions of simple molecules are shown in the following table. This table may be used as a graphical index by clicking on the links in the "section" column.

Table 1. Prototypical SMILES and section references.
Depiction SMILES Name Section
[H+] proton atoms
hydrogens
C methane atoms
O water atoms
[OH3+] hydronium cation atoms
[2H]O[2H] deuterium oxide
heavy water
atoms
isotopes
[Au] elemental gold atoms
CCO ethanol bonds
O=C=O carbon dioxide bonds
C#N hydrogen cyanide bonds
CC(=O)O acetic acid bonds
branching
C1CCCCC1 cyclohexane rings
C1CC2CCCCC2CC1 decalin rings
c1ccccc1 benzene aromaticity
rings
[Na+].[O-]c1ccccc1 sodium phenoxide aromaticity
disconnects
rings
c1ccccc1[N+](=O)[O-] nitrobenzene aromaticity
disconnects
rings
valence model
CC(=O)O.CCO>>CC(=O)OCC esterification of acetic acid and ethanol to ethyl acetate disconnects
components
CC(=[O:1])[OH:2] . CC[OH:3] > [H+] > CC(=[O:2])[O:3]CC . [OH2:1] stoichiometric esterification with [H+] agent and atom-mapped O's disconnects
components
atom-mapping
C/C=C/C trans-2-butene DB/chirality
N[C@@H](C)C(=O)O L-alanine Th/chirality
O[C@H]1CCCC[C@H]1O cis-resorcinol Th/chirality

Forward to "Atoms".
Return to table of contents.
Daylight Chemical Information Systems, Inc.
info@daylight.com