7. SMILES Toolkit: Molecules

Back to Table of Contents

A molecule object represents the atoms, bonds, cycles and chiral centers of a molecule. Because it is such a fundamental object in computational chemistry, there are more functions that operate on molecules than any other object. One can:

  • Produce a molecule from a SMILES string.
  • Produce a SMILES string or a unique SMILES string from a molecule.
  • Build a molecule "from scratch" using functions to create an empty molecule, then adding atoms and bonds.
  • Add and delete atoms and bonds.
  • Change the properties of atoms and bonds.
  • Test for aromaticity of a molecule, atom, or bond. Aromaticity is determined automatically for Kekulé structures.
  • Find symmetry classes for atoms.
  • Tests for and set chiral features.
  • Generate streams of the atoms, bond, and cycles of a molecule, and streams of atoms of a cycle, bonds of a cycle, and so forth.

7.1 Creating Molecules

There are two ways to create a molecule object: "From scratch" (allocate an empty molecule), and by parsing a SMILES string:
dt_alloc_mol() => molecule
Returns a new, empty molecule.

dt_smilin(string smiles) => molecule
Interprets the given SMILES string and return a handle for the resulting molecule structure.

Efficiency Note: The Toolkit's internal representation of molecule objects is designed for efficient analysis of the molecule's properties, and for responding to queries about the molecule quickly. It is not intended to be a compact representation of the molecule, and uses many times more memory to store than a compact representation such as a SMILES string. Applications that require many thousands of molecules in memory simultaneously should use a more compact representation for those molecules that are not of immediate interest.

7.2 Constituents of a Molecule

These functions provide ways to enumerate (generate streams of) the atoms, bonds, cycles, and chiral features of molecules. Also included are two functions, dt_bond() and dt_xatom(), for accessing related constituents without the necessity of creating a stream.

dt_stream(Handle ob, integer typeval) => stream
Generate a stream of atoms, bonds or cycles -- a stream that contains all of the objects of the specified type that are part of the object.

Object can be a molecule, atom, bond or cycle. For example, a stream of dt_stream(bond, TYP_ATOM) returns the two atoms at either end of the bond; a stream of dt_stream(cycle, TYP_BOND) returns all the bonds that are part of the cycle.

Note: remember, dt_stream() is polymorphic -- it applies to other objects, too. Here, we are only discussing the molecule and its constituent parts.

dt_canstream(Handle object,Integer type, boolean iso, boolean addh) => stream
Allocates a stream of type 'type', in canonical order, for the molecule or reaction 'object'. Object can be a molecule, atom, bond or cycle.

dt_origstream(Handle object,Integer type) => stream
Returns a stream of objects in which the objects appear in "original" order. That is, dt_next() will return atoms in the same order as they appear in the original string used to create the object molecule via dt_smilin(), or in the order in which they were added to molecule using dt_addatom().

dt_bond(Handle at1, Handle at2) => bond
Returns the handle of the bond joining the two atoms.

dt_xatom(Handle a, Handle b) => atom
Return the atom that is across the bond b from the atom a.

dt_uid(Handle abc) => integer
Returns the unique id of an atom, bond or cycle within the containing molecule. A unique id is a smallish non-negative integer (i.e. it can be zero) that is guaranteed to not change for as long as the object abc exists. The intention is that unique id's, unlike handles, be reasonably dense; for this reason the uid makes a good array index but a handle does not. Note that unlike handles, uid's are only unique across a single containing object; for example, atoms from two different molecules may have the same uid.

dt_uidrange(Handle molecule, integer typ) => integer
Returns a number that is at least 1 greater than the largest uid currently associated with any constituent having type typ contained in the molecule.

7.3 Modifying Molecules

7.3.1 Derived Properties

Many molecule properties are derived properties. Derived properties are not explicitly specified as you create the molecule; rather, they are computed once the molecule is assembled. For example, you don't directly add a cycle (a ring) to a molecule; instead you add various bonds between the molecule's atoms; the Toolkit detects the existence of cycles after a molecule's atoms and bonds are completely specified. Cycles are thus a derived property. Other derived properties include aromaticity, chirality and, in some cases, bond type (see also dt_bondtype() and dt_bondorder()).

7.3.2 The Modify-on and Modify-off States

Before a molecule object can be modified it must be put into the modify-on state; when modifications are complete, the molecule object is returned to the modify-off state. Generally speaking, functions that modify significant properties of a molecule or its constituents may be applied only in the modify-on state. These functions are further divided into structural-modification functions (described below) which change the structure of the molecule, and non- structural-modification functions, which merely change the properties of the existing structure of the molecule.

These modify-on and modify-off states serve two purposes. First, when modifying a molecule or building one "from scratch," the molecule may enter temporary configurations in which it does not represent a valid chemical compound. The modify-on state indicates that the molecule may be in such a state, and prevents the application from asking questions (such as questions about derived properties) that the Toolkit may not be able to answer. Second, some of the derived properties take a significant amount of time to compute (e.g. finding a "smallest set of smallest rings" is a computationally difficult task for which no fast algorithm exists). The transition from modify-on to modify-off tells the Toolkit to recompute derived properties as necessary.

dt_mod_on(Handle m) => boolean
Puts the given molecule into the modify-on state; molecules in this state may be modified.

dt_mod_off(Handle m) => boolean
Puts the given molecule into the modify-off state. This function causes the molecule's structure to be analyzed; its properties may be changed as a result. The most notable change is to the aromaticities of constituents (atoms, bonds, and cycles). A recalculation of contained cycles may also take place.

If there is an error, the molecule is deallocated just as though dt_dealloc() had been called. (This is an unfortunate side-effect of the structure-analysis functions: if they fail, they leave the molecule in an unusable state. Molecules that are "precious" should be copied just prior to invoking dt_mod_off(); if it returns TRUE the copy can be discarded. The copy-and-discard operation is "cheap" (i.e. fast) compared to the structural analysis.)

dt_mod_is_on(Handle m) => boolean
Returns TRUE if the molecule is in the modify-on state, FALSE otherwise.

7.3.3 Functions Applicable Only During Modify-On

These functions can only be applied to a molecule or its constituent parts when the molecule is in the modify-on state. Generally speaking, such functions modify the structure of a molecule in some significant way.

     dt_dealloc() (when applied to an atom or bond)

7.3.4 Functions Applicable Only During Modify-Off

These functions can only be applied to a molecule or its constituent parts when the molecule is in the modify-off state. Generally speaking, such functions only make sense when applied to well-formed molecules.


7.3.5 Functions Applicable At All Times

All functions not listed either here or in the previous section that normally apply to molecules can be applied to a molecule in both the modify-on or the modify-off states.

7.4 Structural-Modification Functions

The three functions dt_addatom(), dt_addbond(), and dt_dealloc() (when applied to atoms or bonds) are collectively referred to as structural modification functions. After calling a structural modification function, future streams returned by dt_stream() are no longer guaranteed to return objects in the same order that they were returned before the modification. Note that this remains true even if the structure of the molecule is later restored to an equivalent form.

Also, remember that any structural modification to a molecule causes all streams of atoms, bonds or cycles over the molecule to be deallocated.

dt_addatom(molecule m, integer atno, integer hcount) => atom
Add an atom with atomic number atno and hcount hydrogens to the given molecule.

dt_addbond(atom a1, atom a2, integer btype) => bond
Add a bond with the given bond type between the two atoms.

dt_dealloc(object ab) => boolean
Atoms and bonds are removed from a molecule by deallocating them.

7.5 Properties of Atoms

Arbitrary SMILES: An Arbitrary SMILES is derived by the same algorithm as a unique SMILES, except that a user-specified set of labelings is used, allowing the generation of a SMILES in an arbitrary order. The user-specified labeling of each atom is called the arbitrary order of the atom. The SMILES begins with the atom whose arbitrary order is lowest; when branch points are reached, the branch with the atom whose arbitrary order is lowest is written first. The following functions are related to Arbitrary SMILES:
dt_setarborder(atom at, integer order) => boolean
Sets the atom's arbitrary order value

dt_arborder(atom at) => integer
Returns arbitrary order value for the given atom.

dt_arbsmiles(molecule m, boolean iso) => string
Returns an Arbitrary SMILES string for the given molecule. The iso parameter indicates whether the returned SMILES string should contain isomeric labelings.
Atomic Charge: Two functions are provided to set and get the charge on an atom:
dt_setcharge(atom at, integer charge) => boolean
Sets the atom's formal charge.

dt_charge(atom at) => integer
Returns the atom's formal charge.
Hydrogen Count: The graphs used to represent molecules are usually hydrogen- suppressed: hydrogens are represented as a property of the "heavy" atoms to which they are attached rather than as separate atom objects. Such hydrogens are called implicit hydrogens. In some cases hydrogens must be actual objects (e.g. when there is isotopic information or more than one bond to the hydrogen); in other cases it may be convenient to have hydrogen objects (e.g. when data, such as xyz coordinates, are known about them). Such hydrogens are called explicit hydrogens.

The following functions are used for implicit and explicit hydrogens (also see dt_addatom()):

dt_hcount(atom at) => integer
Returns the total number of hydrogen atoms (implicit and explicit hydrogens) bonded to the atom.

dt_imp_hcount(atom at) => integer
Returns the number of implicit hydrogens bonded to the atom.

dt_setimp_hcount(atom at, integer count) => boolean ok
Sets the number of implicit hydrogens on the atom.
Atomic Number, Symbol, and Weight: An atom's atomic number and weight are independent in the Daylight Toolkit. In real life, only certain isotopes exist for each atomic number; the Daylight Toolkit imposes no such constraint.

The atomic symbol is derived directly from the atomic number; the Toolkit doesn't provide a way to set it independently.

dt_number(atom at) => integer
Returns the atom's atomic number.

dt_setnumber(atom at, integer num) => boolean
Sets the atom's atomic number.

dt_symbol(atom at) => string
Returns the atom's atomic symbol (e.g. "C", "Si").

dt_weight(atom at) => integer
Returns the atom's atomic weight. The returned weight is '0' if the weight is unspecified (eg. the default weight of an atom). Returns an integer weight for atoms which have been set to a specific isotope value with dt_setweight().

dt_setweight(atom at, integer weight) => boolean
Sets the atom's atomic weight.

7.6 Properties of Bonds

Bond type and bond order are closely related but not identical properties of a bond object.

Bond order: a formal property of the bond, which can only be one of DX_BTY_SINGLE, DX_BTY_DOUBLE, DX_BTY_TRIPLE, representing single bonds, double bonds, and triple bonds, respectively.

Bond type: a derived property, which is normally computed by the Toolkit when the molecule goes from modify-on to modify-off. The primary situation where the bond type will differ from bond order is in aromatic structures, in which single and double bonds can be converted to aromatic bonds. Bond type can be any of DX_BTY_SINGLE, DX_BTY_DOUBLE, DX_BTY_TRIPLE, or DX_BTY_AROMAT.

When a molecule is in the modify-on state, a bond's type or order can be changed. Normally, one specifies a bond's type, and lets the Toolkit generate the bond order from that. If you specify a bond's order via dt_setbondorder(), its type will be changed too. If you specify its type via dt_setbondtype(), the bond order may not agree until dt_mod_off() is called.

dt_bondtype(Handle bond) => integer
Returns the bond's type. This value can change when a molecule changes from the modify-on state to the modify-off state (For example, it might change from single or double to aromatic).

dt_setbondtype(Handle bond, integer type) => boolean ok
Sets the bond's type. Also may affect the bond's order; if the bond type is set to single, double, or triple, the bond order is too; if the bond type is set to aromatic, bond order becomes unknown.

dt_bondorder(Handle bond) => integer order
Returns the bond's order.

dt_setbondorder(Handle bond, integer order) => boolean ok
Sets the bond's order. Also affects the bond's type, which is also set to the value order.

7.7 Properties of Cycles

There are no specific functions for accessing or modifying cycles in a molecule, as cycles are a derived property of the bonds. The general function dt_stream() will return the cycles of a molecule, bond, or atom.

7.8 Generating SMILES

dt_cansmiles(molecule m, boolean iso) => string
Returns a canonical SMILES string for the given molecule. (Note that this causes calculation of the canonical labelings if it has not yet been done, a potentially time-consuming operation.) The iso parameter tells whether the SMILES string should contain isomeric labellings (isotopic and chiral information). (A canonical SMILES string with isomeric labelings is called an Absolute SMILES. Without isomeric labelings, it is called a Unique SMILES.). The molecule must be in the modify-off state (see dt_mod_off()).

Note: The string returned is part of the molecule object and may change or be discarded if the molecule is modified or deallocated. In general, you should copy the string if you will need it later.

dt_xsmiles(molecule m, boolean iso, boolean explicit) => string
Returns an exchange SMILES string for the given molecule. An exchange SMILES is a SMILES with Daylight aromaticity conventions eliminated. The iso parameter tells whether the SMILES string should contain isomeric labellings (isotopic and chiral information). The explicit parameter tells whether to also explicitly list attached hydrogens for all atoms. The molecule must be in the modify-off state (see dt_mod_off()).

Note: The string returned is part of the molecule object and may change or be discarded if the molecule is modified or deallocated. In general, you should copy the string if you will need it later.

7.9 Aromaticity

These functions test the aromaticity of molecules, atoms, bonds and cycles, and where appropriate, allow you to set those attributes.

Aromaticity in the Daylight Toolkit is a complex subject. For a more thorough discussion of aromaticity in the Daylight System, please see SMILES Chapter of the Daylight Theory Manual.

dt_aromatic(object ob) => boolean
Returns TRUE if the given object (an atom, bond, cycle or molecule) is considered aromatic.

dt_setaromatic(atom at, boolean isarom)
Sets the aromaticity of the atom at to TRUE or FALSE according to the value of isarom.

7.10 Symmetry

The Daylight Toolkit can compute the symmetry of a molecule. There are two different symmetry values you can access.

Symmetry Class: Two atoms in a molecule will be in the same symmetry class if and only if they are symmetrically equivalent. The actual number assigned to a symmetry class is arbitrarily -- the the only significance of the numbers is whether two atoms have the same class number or not.

Symmetry Order: The algorithm that generates the symmetry order uses graph invarients (including the symmetry classes described above) to generate a unique labeling (the symmetry order) of the molecule's graph. An atom's symmetry order controls the generation of the Unique SMILES (see dt_cansmiles()).

Note that any change, however slight, to the molecule may cause the symmetry class and/or symmetry order values to change.

dt_symclass(atom at) => integer
Returns the unique symmetry class of an atom in its parent molecule.

dt_symorder(atom at) => integer
Returns the unique symmetry order of an atom in its parent molecule.

7.11 Chirality

The most complex attributes are chirality attributes, which are specified by single integer codes called chiral values. These values combine two separate pieces of information, a chiral class (corresponding to a geometric configuration such as tetrahedral, octahedral, and so on) and a chiral order (a particular ordering around the chiral center, such as clockwise, counter-clockwise, and so on).

Symbolic constants are defined to simplify the specification of chiral values. In the current implementation, only cis/trans and tetrahedral chirality are supported. The following symbolic constants combine the chiral class and chiral order information for convenience:

Cis/Trans Chirality
DX_CHI_NO_DBO cis/trans situation, but chirality is unspecified
DX_CHI_CIS cis configuration around a double bond
DX_CHI_TRANS trans configuration around a double bond
Tetrahedral Chirality
DX_CHI_NONE unspecified chirality
DX_CHI_THCCW tetrahedral center with counterclockwise configuration
DX_CHI_THCW tetrahedral center with clockwise configuration

dt_dbo(bond db, bond b1, bond b2) => integer
Returns the "double-bond orientation" between b1 and b2. The bond db should be a double bond that is at the center of a cis/trans configuration. Bonds b1 and b2 should single bonds attached to the atoms at the end of db, one on each of the two atoms. The return value will be equal to one of the symbolic constants DX_CHI_CIS, DX_CHI_TRANS, or DX_CHI_NO_DBO. The latter case indicates that the cis/trans configuration around db is unspecified.

dt_setdbo(bond db, bond b1, bond b2, integer dboval) => boolean
Sets the "double-bind orientation" between b1 and b2 to the given value. The first three parameters are as described above for dt_dbo(). The last parameter is one of DX_CHI_CIS, DX_CHI_TRANS, or DX_CHI_NO_DBO.

dt_chival(atom at, sequence seq) => integer
Returns the chiral value around the given chiral center at, determined with respect to the order of the bonds in this sequence. See the function's full description for details.

dt_chiseq(atom at, integer chival) => sequence
Returns a sequence of bonds having the chirality given by chival around the given atom at (the chiral center). the chiral order portion of the value is used to determine the ordering of the returned sequence. See the function's full description for details.

dt_setchival(atom at, sequence seq, integer chival) => boolean
Sets the chiral value at the given chiral center at. The parameter seq is a sequence of bonds that meets the conditions specified for dt_chival(); the chiral value is set with respect to the order of bonds in this sequence.

dt_chiperm(sequence seq, bond start, integer chival) => sequence
Given a sequence of bonds having the given chiral value, modify it (i.e., permute it) so that the chiral value is preserved, but so that it begins with the given bond start.

dt_chiclass(integer chival) => integer
Return an integer code for just the chiral class part of the given chiral value.

dt_chiorder(integer chival) => integer
Returns an integer code for just the chiral order portion of the given chiral value.

dt_isohydro() => atom
Returns a hydrogen-atom object that is useful for representing implicit-hydrogen atoms in calls to the isomeric functions. Each call to this function returns the same special atom. The atom may not be modified (attempts will fail) and it has no parent molecule (calls to dt_parent() will return NULL_OB). In general, applications should not attempt to play around with it too much; its only intended use is in calls to the isomeric functions defined above.
Back to Table of Contents
Go to previous chapter Basics: Streams and Sequences
Go to next chapter SMILES Toolkit: Substructures and Paths.