Euromug02 24th-26th September 2002, Cambridge UK

Introduction to Chemical Info Systems

John Bradshaw
Daylight CIS Inc., Sheraton House, Castle Park, Cambridge, CB3 0AX, UK

The Philosophy Bit

It is a fundamental of any chemical information system that it can store chemical structures. These structures should be searchable, so that one can retrieve

In addition, if one is interested in chemoinformatics, the system should

All these requirements should be fulfilled using chemical structure ( the natural language of the chemist ) as an entry point.

It is essential that the method used to store the chemical structures is robust enough the allow for completion of these tasks.

There are three rôles involved in implementing and running a chemical information system

  1. The person who assigns the structure to a sample, which maybe fictitious.
  2. The person who curates and archives the structure on some medium.
  3. The person who searches and retrieves the structure.

If we are using paper as our medium, then it is possible, if person 1 and person 3 have similar training, to interpret a picture or an icon such as the structure diagram. So to quote the example from Pierre Laszlo

...the structural formula of, say, p-rosaniline represents the same substance to Robert B. Woodward, say, in 1979 as it did to Emil Fischer in 1879, and even though such a signifier pointed to the same signified, nevertheless the entity signified had been enriched in meaning with time.

P. Laszlo in "Tools and Modes of Representation in the Laboratory Sciences"; Klein, U. Ed. Kluwer Academic Publishers; London 2001; p52

However for this communication to be possible it requires

The other point, which Lazlo makes, is that, in the intervening years, science has moved on. As techniques develop much more became known about the dyestuff, its UV-visible spectrum, its crystal structure, pKa, NMR and so on, which could not possibly be known to Fischer. Although these data were collected on different samples over the years, they were unified by being about the "entity signified" described by the structural formula of p-rosaniline.

This is where the third person in our triumvirate comes in. It is incumbent upon the curator/archivist that they store the structure in such a way that new data can be "attached" to the appropriate part of the molecule. An inappropriate choice of the level of abstraction for storage can lead to data loss. You cannot store C13 NMR chemical shift information about a peptide if the structure is represented as Ala-Phe-Gly. Chemical information science has had the advantage of over 200 years to learn these lessons. The answer is usually to store the structure at the most basic primitive level of atoms and bonds.

This is of more than passing philosophical interest. It is salutary to note that most of the chemistry used to prepare Zantac® is best part of 100 years old. ( Bradshaw J, "Ranitidine" In Ledneicer, D. Ed. Chronicles of Drug Discovery, Vol 3 American Chemical Society, Washington DC 1993 p 45 ). The information was accessible to the chemists involved as Allen and Hanburys Ltd had maintained complete runs of the paper copies of Beilstein and Chemical Abstracts in their library. As the saying goes "two months in the laboratory saves two hours in the library."

The History Bit

Those who would question the present should investigate the past. Those who do not understand what is to come should look at what has gone before.

The Guanzi

The point at which chemical structures and formulae become recognizable today is in the early 19th century. The essays by Berzelius starting in 1813 ( Annals of Philosophy 2 443-454 ) paved the way.( My emphasis )

But, though we must acknowledge that these [alchemic] signs were very well contrived, and very ingenious, they were of no use; because it is easier to write an abbreviated word than to draw a figure, which has but little analogy with letters, and which, to be legible, must be made of a larger size than our ordinary writing. In proposing new chemical signs, I shall endeavour to avoid the inconveniences which rendered the old ones of little utility. I must observe here that the object of the new signs is not that, like the old ones, they should be employed to label vessels in the laboratory: they are destined solely to facilitate the expression of chemical proportions, and to enable us to indicate, without long periphrases, the relative number of volumes of the different constituents contained in each compound body. By determining the weight of the elementary volumes, these figures will enable us to express the numeric result of an analysis as simply, and in a manner as easily remembered, as the algebraic formulas in mechanical philosophy.

The chemical signs ought to be letters, for the greater facility of writing, and not to disfigure a printed book. Though this last circumstance may not appear of any great importance, it ought to be avoided whenever it can be done. I shall take, therefore, for the chemical sign, the initial letter of the Latin name of each elementary substance: but as several have the same initial letter, I shall distinguish them in the following manner:-- 1. In the class which I call metalloids, I shall employ the initial letter only, even when this letter is common to the metalloid and some metal. 2. In the class of metals, I shall distinguish those that have the same initials with another metal, or a metalloid, by writing the first two letters of the word. 3. If the first two letters be common to two metals, I shall, in that case, add to the initial letter the first consonant which they have not in common: for example, S = sulphur, Si = silicium, St = stibium (antimony)[2], Sn = stannum (tin), C = carbonicum, Co = cobaltum (cobalt), Cu = cuprum (copper), O = oxygen, Os = osmium, &c.;

Note that Berzelius was suggesting a system appropriate to the communication medium (paper) and method ( manuscript or print ) but also more importantly that the "word" he produced ( the collection of characters and numbers ) represented the underlying chemistry ( mainly analysis). The idea was that this was a sort of algebra. Even today we "balance" chemical equations in the way pioneered by Berzelius, using, for the most part, the same symbols to represent atoms. In the early part of the 19th century figures and illustrations were printed separately, often on different paper and bound at the end of books or journals, Berzelius allowed integration of the chemical information with the rest of the text.
For a more detailed discussion of these formulae as tools in chemistry see Klein, U. "Berzelian Formulas as Paper Tools in early Nineteenth Century Chemistry" Foundations of Chemistry 3: 7-32, 2001.
Within a very few years these formulae appeared in textbooks ( Turner, E. "Elements of Chemistry: Including the Recent Discoveries and Doctrines of the Science", 4th Ed. London; John Taylor 1833 ) which shows the important rôle of textbooks in stabilizing notation.

As the 19th century progressed and organic chemistry grew, it became clear that there were frequently occurring groups in molecules, which, to a first approximation, were invariant in their properties. These were represented as shorthand, first in a Berzelian way CH3-, C6H5- etc and then in a more convenient way as Me- and Ph- etc. There was however no limit to the number of these abbreviations, or indeed, any rules as Berzelius had proposed, for their construction. This defeated the whole object of being able to communicate structures, as the underlying vocabulary was undefined.

Being able to represent a molecule as a parent structure, substituted with various groups, is very appealling though. This approach was followed by people interested in the nomenclature and indexing of the rapidly growing number of compounds. In particular Beilstein who published the first edition of his Handbuch in 1880. This introduced rigorous methods for classifying, naming and indexing compounds which brought together "related compounds". For this the Berzelian formulae were not sufficient, as it was important to know the spatial relationships between atoms. A key contributor to the development of these structures was Crum Brown in Edinburgh. Here we have his structure for phenol from 1861, which mathematically is a graph with atoms as nodes and edges as bonds.

Crum Brown Structure for phenol

It was also clear that the molecules which were being prepared were not two dimensional as the medium on which they were being portrayed was. There were two solutions.

Hofmann was one of the chemists who adopted a modelling approach. His famous lecture in 1865, at the Royal Society in London used croquet balls as the atoms and steel rods as the bonds.

Hofmann's models

To this day modelling kits tend to use the same colours as croquet balls for the atoms.

Others such as van't Hoff, abandoned all pretence of atoms and bonds in an effort to get across the spatial arrangements of groups and the possibility of stereoisomerism.

van't Hoff's models

Emil Fischer used the alternative strategy of projection. This is restricted to certain classes of compound and also requires that various conventions about orientation etc are applied. He did this by physically flattening a rubber model of tartaric acid made by his colleague Friedländer.

Fischer projection of tartaric acids

Once you move away from linear formulae constrained to read left to right by the text in which they are embedded, you need to provide a whole lot of information like numbering the atoms to ensure that all the readers get the same starting point for the eye movement which recognizes the structure.
For a more detailed discussion of the philosophy of this see P. Laszlo in "Tools and Modes of Representation in the Laboratory Sciences"; Klein, U. Ed. Kluwer Academic Publishers; London 2001; p52

Linear formulae continued to be used, embedded in text, but, with improvements in printing techniques and the imposition of typesetting standards and atom numbering by organizations such as Chemical Abstracts, graphical representations of structure began to predominate. There wasn't a need for a major change until the advent of computers in the middle of the last century. There were no good mechanisms in early computers to store the increasingly elaborate icons which had come to be used by one chemist to communicate with another. A simpler system was needed.

At this point Wiswesser returned to linear notations. If he could come up with a method to add structural and connectivity information to Berzelian molecular formulae these could be handled by a computer. Wiswesser threw away some of the two character symbols for elements such as chlorine and bromine. There were no less than four different symbols for nitrogen K,N,M, and Z, depending on the number of attached hydrogens and the charge. Carbon was reduced to a skeleton, element alkyl chains being replaced simply the number of carbon atoms in the straight chain and branches by Y and X. Benzene rings, so long the marker for the chemist to orient a view of a molecule were reduced to a branching point with a single character R. Somewhat obtusely, other ring systems did not suffer benzene's fate and were promoted to dominate the structure, building on the recent work of Patterson. ( Patterson A.M.; Capell, L.T. "The Ring Index" ; American Chemical Society: Washington D.C. 1940 ) So for a compound such as

6-dimethylamino-4-phenylamino-naphthalene-2-sulphonic acid the WLN is L66J BMR& DSWQ INI&1

However it only turned out to be a shorthand way of representing the systematic name. The advantage to this was that they could make use of all the sorting and indexing technologies which had been developed over the years for names using permuted indices, the downside was it ran into the same problems that the indexers and namers had.

Vast armies of people were kept busy coding and teaching others to code, compounds. The coder needed to know the rules to get the "right" WLN. Disputes went to a committee. There is only one valid WLN for a compound. There were no good automatic ways of generating the correct WLN from what had become the natural language of the chemist, the structure diagram. Equally the representations could not easily be parsed left to right without back tracking.

Despite all of this Wiswesser's notation was embraced by both industry and content providers such as ISI and CAS. Substructure searching became available through the CROSSBOW program. However as soon as technology allowed the structures to be input by drawing, as had been done in the paper systems, WLN was abandoned. For whilst it was not possible to autogenerate WLN by machine, it was possible to convert it to a connection table.

The new graphics entry systems stored the information in the form of a connection table representing the underlying graph. As the structure was now represented as a graph, subgraph matches could be made using existing algorithms.

Here we have our compound in MDL molfile format. Note the fixed column format required by FORTRAN code used in earlier times.

6-dimethylamino-4-phenylamino-naphthalene-2-sulphonic acid
  -ISIS-  02110115552D

 24 26  0  0  0  0  0  0  0  0999 V2000

   -1.7931    0.6000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.7984   -0.2273    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.0851   -0.6459    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.0786    1.0072    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.3647    0.5965    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.3692   -0.2315    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.3450   -0.6497    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.0642   -0.2368    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.0648    0.5943    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.3499    1.0046    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.3375   -1.4750    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    1.0542   -1.8875    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.0504   -2.7160    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.7662   -3.1326    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.4807   -2.7176    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.4790   -1.8859    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.7668   -1.4772    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.7792    1.0125    0.0000 S   0  0  3  0  0  0  0  0  0  0  0  0
   -2.5208   -0.6375    0.0000 N   0  0  3  0  0  0  0  0  0  0  0  0
   -2.5250   -1.4667    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -3.2417   -0.2208    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.4824    1.4383    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    1.3586    1.7304    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    2.3625    0.4292    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  2  3  1  0  0  0  0
 11 12  1  0  0  0  0
  5  6  1  0  0  0  0
 12 13  2  0  0  0  0
  3  6  2  0  0  0  0
 13 14  1  0  0  0  0
  6  7  1  0  0  0  0
 14 15  2  0  0  0  0
  1  2  2  0  0  0  0
 15 16  1  0  0  0  0
  7  8  2  0  0  0  0
 16 17  2  0  0  0  0
 17 12  1  0  0  0  0
  5  4  2  0  0  0  0
  9 18  1  0  0  0  0
  8  9  1  0  0  0  0
  2 19  1  0  0  0  0
  4  1  1  0  0  0  0
 19 20  1  0  0  0  0
  9 10  2  0  0  0  0
 19 21  1  0  0  0  0
 10  5  1  0  0  0  0
 18 22  2  0  0  0  0
 18 23  2  0  0  0  0
  7 11  1  0  0  0  0
 18 24  1  0  0  0  0
M  END

Clearly these connection tables have some advantages. As they map the internal program storage so closely, the computer has no problem reading them. However limitations were imposed by the computer languages.

Limitations were also imposed by the need to store different information for different purposes so each vendor came up with differing formats. Attempts were made to unite the formats ( Garavelli, J.S. (1990) Chemical Design and Automation News 5 2-5 ) but to no avail. All that happened was they became even larger to accommodate everyone's needs.

Beilstein made a foray into the line notation area too. They came up with a notation called ROSDAL which is a linear way of representing the connection table. So the ROSDAL code for the naphthalene sulphonic acid above is

1=-5-=10=5,10-1,1-11N-12-=17=12,3-18S-19O,18=20O,18=21O,8-22N-23,22-24

Notice again, as with WLN, carbon is treated differently from the other elements. This was designed to facilitate computer manipulation, not chemists visual recognition. It is defined by means of a formal grammar expressed in Backus-Naur form, the meta notation commonly used in the definition of computer languages.

It was not until the invention of SMILES that we had a linear representation of structure which was valid for a paper system using manuscript or typesetting and also for a computer system using either the keyboard or graphical input. For the naphthalene sulphonic acid above, a SMILES is

c1ccccc1Nc2cc(S(=O)(=O)O)cc3c2cc(N(C)C)cc3

This fixed vocabulary and ordering ensures that the SMILES language can be used to create a unique and universal name. The link to the graph ensures that all the required searches can be carried out, and its representation at the atom and bond level allows additive-constitutive properties to be calculated. A useful consequence of the fixed ordering in the name, is that atom and bond properties, such as 3-D coordinates can also be kept in ordered lists.



Daylight Chemical Information Systems, Inc.
support@daylight.com

John Bradshaw.