MUG '05 - Weininger - chemical information models

MUG '05 -- 19th Daylight User Group Meeting -- 9-11 Mar 2005

Chemical information models

Dave Weininger
Daylight/Metaphorics

ABSTRACT

Information modeling is at the heart of informatics. The molecular information model is seriously cool. The ability to represent pure substances as molecular structures has transformed chemistry from a science of substances into a molecular science in less than 100 years.
A basic chemical information "trick" is representing a molecule in a computer. Doing just that is useful for things like molecular structure registration.
The next step is developing the information model itself -- how to connect properties to the representations of molecular entities in a computer to do something more useful than just making lists, e.g., connect it to external information resources. For the most part, we have been stuck using the same data models that are used for registration. A number of different chemical information models will be presented with real-world examples and the prospects for chemical data integration will be discussed.

MUG '05 : Chemical information models
How can one do anything useful with a computer?

One usually needs to represent ("model") real things in a computer program using digital representations ("models").

Abstract things are easiest, e.g., numbers.

Even representing numbers is non-trivial.
Remember PL/1, with a dozen ways to represent numbers?

Useful though: IBM made a fortune using numbers to represent money.

It's not so easy representing a molecule.
Many approaches have been tried
... some good, some bad, some truly horrible
... we're pretty much stuck with all of them.

MUG '05 : Chemical information models : Preliminaries : 1

What chemical entities are useful to represent?

Molecules (valence, LCAO, ab initio/quantum models)

Reactions (same as molecular models, also dynamics)

Substances (not necessarily molecular models)

Mixtures (may or may not be molecular)

Bottles (e.g., batches)

Molecular patterns (theoretical, statistical, legal)

other ... crystals, large molecules, polymers, alloys, catalysts

MUG '05 : Chemical information models : Preliminaries : 2

What operations on chemical entities are useful?

Storage/retrieval
- Basic input and output.
- Efficiency counts. [Constant time retrieval is possible.]
- "Semantically well defined" counts even more ...

Identity
- Are two entities the same or different?
Presence

Submixture and supermixture relationships
Does one entity exist within another?
"Parent", "molecular ion", "salt" concepts
Special form: reactant > agent > product relationship

Substructure/Superstructure relationships
- Classic substructure search (find superstructures of query):
  open (H-replacement) and closed (specify positions) versions
- Superstructure search (find substructures of query):
  useful for things like synthesis planning.
Similarity
- There are many possible measures of similarity.
- Tanimoto between fingerprints ~proportion of common substructure.
- Tversky metric is a generalized Tanimoto metric.
Transformational relationships
- Tautomerism
- Formal charge representation
Multivariate relationships
- Clustering
- Discrimination

MUG '05 : Chemical information models : Preliminaries : 3

How can we represent chemical entities?

Name (IUPAC, index name, common name, etc.)

Number (e.g., CAS Number, Corporate registration number)

Picture (e.g., WIMP)

Properties (e.g., ECN)

Connection table (e.g., SDF, MOL, MOL2, etc.)

Linear notation (WLN, SMILES, ROSDAL)

Conformation (PDB, etc.)

More advanced models (Gaussian, etc.)

Derived/reduced representations (fingerprints, consensus FP, COMFA)

Various representations have different advantages

For instance:

SMILES are semantically well-defined representations of a specific valence models for molecules and reactions.

Names are more useful for representing things without a useful valence model, e.g., "Turpentine" or "Unknown 123".

MUG '05 : Chemical information models : Properties

Direct association of properties with chemical entities

Direct properties of molecular structure(s)

[These] are the [property] values of [these molecular structures].
Simplest possible chemical information model: molecular structure is the identifier, properties are connected to it. Perfect for the clean, idealistic world of non-overlapping molecular chemistry. Very powerful but not comprehensive and not very useful IRL.
Examples: TSCA Toxic Substance Control Act, Empath metabolic pathways, Primary tables in CRC Handbook, Chemist's Companion, etc.

Direct properties of hierarchical molecular structure(s)
[These] are the [property] values of [these specific kinds of] [generic molecular structures].
Entities are identified by molecular structure level-of-detail hierarchy. Properties are connected to entities at the appropriate level. Clean and idealistic yet more powerful and more useful than above.
Examples: MedChem masterfile (generic/isomers/ionic forms), QSAR (sets of molecules/molecular patterns/compounds)

MUG '05 : Chemical information models : Properties, cont.

Associate properties with arbitrary molecular identifiers

Properties of an arbitrary molecular identifier
The [entity with this registration number] has [this molecular structure] and has [these] [property] values.
Common model for traditional chemical registries. All possible molecular entities are represented by registration numbers; properties are assigned to these entities. Requires "god-like" (omniscient) structure identification and discrimination methods ... which IRL become unstable over time when used by normal human beings. Other problems include poor behavior with incomplete structural knowledge and this requires development of a religious "group or split" dogma. OK for closed, static, short-term delivery of homogenous data.
Examples: CAS, MACCS, WDI, some registration systems

Properties of an arbitrary molecular set identifier
The [entity with this registration number] contains [these molecular structures] and has [these] [property] values.

Similar to above systems, but for multiplicity of molecules. The problem with god-like systems is even worse than for discrete entities.
Example: Most "chemical" USPTO Patents (with legal caveat)

MUG '05 : Chemical information models : Properties, cont.

Associate properties with arbitrary identifiers

Property of arbitrary identifier
The [entity with this registration number] has [these] [property] values.
[This molecular identifier] is associated with [this entity].
Often used for the chemical portion of a larger non-molecular database system. Entities and property-associations are not necessarily molecular. In such systems there are usually no requirements for uniqueness, exclusiveness, nor comprehensiveness. This is the weakest chemical information model.

Property of arbitrary set identifier
The [entity with this registration number] contains [these molecular identifiers] and has [these] [property] values.
Often used when database entities are mixtures of things which are not necessarily molecular, or when non-molecular components are essential to the definition of a database entity.
Examples: FDA Orange Book (NDAs), MSDS collections

MUG '05 : Chemical information models : Properties, cont.

Reverse association of molecules with propertied entities

Molecular elucidation
The [entity with this registration name or number] has [these] [property] values. [These molecular structures] are associated with [this entity].
This has a similar "shape" to #5 above, but in this case, molecular structure-property associations are derived from the existence of specific molecular structures in database entities.
Examples: TCM Traditional Chinese Medicines, sample databases at analytical labs

Document databases
This [database entity is a document] which states that [these molecular structures] have [these] [property] values.
Document databases are special databases which normally contain non-molecular primary entities (e.g., journal articles, patents) which in turn reference molecular entities (e.g., compounds, reactions). They normally can be "inverted" to form a chemical information data set of one of the kinds mentioned above.
Example: Spresi, MSDS

MUG '05 : Chemical information models : Example 1

Example: WDI - Derwent's World Drug Index

"Traditional" cheminfo database with arbitrary molecular identifier

Entity identifier is Derwent External Registry Name (DXRN)
- DXRN is alphabetic equivalent of Registration Number
- DXRNs are structural (as able)
- DXRNs are comprehensive/adaptive rather than rigorous
- DXRNs represent Derwent's view of chemistry

Molecular properties are connected to DXRN

Many other names
Pharmacology
Other identifiers, e.g., CAS Numbers

WDI examples: dopamine

Primary page for DXRN "DOPAMINE": local, cabinet
WDI entries with same parent as dopamine: local, cabinet
WDI entries with structures similar to dopamine: local, cabinet
WDI entries with dopamine substructures (similar): local, cabinet
WDI entries with dopamine substructures: (distant) local, cabinet
zoom morphine/dopamine substructure: local, cabinet

MUG '05 : Chemical information models : Example 2

Example: Empath - Metabolic Pathways

"Non-traditional" database of molecular entities and their properties

Molecular entity identifiers are SMILES of molecules and reactions
- Empath is actually a great big multicompartment reaction scheme
- Empath home page: local, cabinet
- Empath map: local, cabinet
- "SMILES of life": local, cabinet

Molecular entities are extended to biochemistry

Metabolic steps are represented as normal, stoichimetrically correct chemical reactions
Biochemical roles such as enzymes, cofactors, regulators are recorded
All objects are also identified by a 2-D chart location

Empath examples: dopamine

Empath objects containing text "dopamine": local, cabinet
[Note that dopamine appears as a compound, reactant and product.]
Dopamine is a compound on the metabolic pathway chart: local, cabinet
Decarboxylation of L-Dihydroxy-phenylalanine to Dopamine: local, cabinet
Stoichiometry of this reaction: local, cabinet

MUG '05 : Chemical information models : Example 3

Example: Orange - FDA Orange Book (NDAs)

Database of arbitrary set identifiers and their properties

Database entities are NDAs.
- "Ingredients" may or may not be molecular
- Entity is never completely defined by molecular identity; also:

Primary data model is very weak

Even establishing an identity relationship is difficult.
Bioequivalence established via TEC (Therapeutic Equivalence Class) study, which is locally interpreted (e.g., by state).
Contains many inconsistencies due to historical realities.
Similar to patent data in many ways (high value, low rigor).

Even so, this pragmatic database is very useful

Molecular indexing greatly improves accessibility and utility.
Special similarity classes are used to retain FDA/NDA semantics.

Orange examples: dopamine

Orange objects containing text "dopamine": local, cabinet
[Note that dopamine appears in trade names and as an ingredient.]
"Dopamine HCL" as a trade name: local, cabinet
"Dopamine hydrochloride" as an ingredient: local, cabinet
An NDA for Dopamine HCL: local, cabinet
Oxycodone and Acetaminophen: local, cabinet [Note similarity measure]
Cortisporin: local, cabinet [Note strength units.]
NDA applicants by frequency: local, cabinet

MUG '05 : Chemical information models : Summary

Concluding thoughts about chemical information models

Molecular information models
- provide a very powerful tool for information organization
- may be "flat" for fixed molecule-property relationships
- may also represent more complex relationships
- allow manipulation of relationships as primary data

Most current chemical information systems
- use a very "flat" molecular information model,
- for very limited purposes (e.g., registration, reagent tracking),
- which generally works pretty well.

Many chemical information resources are not simple, e.g.,
- database entities may be mixtures
- molecular entities may have different roles in different contexts
- structure-property relationships themselves are data entities

Integrative access to diverse forms of chemical information is possible.
- representing chemical information with appropriate data models increases its utility and value
- crossing between data models is not intrinsically difficult
- machines today are powerful enough to "do it the hard way" if needed

Chemical information models

ABSTRACT

ABSTRACT

MUG '05 : Chemical information models How can one do anything useful with a computer?

One usually needs to represent ("model") real things in a computer program using digital representations ("models").

MUG '05 : Chemical information models : Preliminaries : 1

What chemical entities are useful to represent?

MUG '05 : Chemical information models : Preliminaries : 2

What operations on chemical entities are useful?

MUG '05 : Chemical information models : Preliminaries : 3

How can we represent chemical entities?

Various representations have different advantages

MUG '05 : Chemical information models : Properties

Direct association of properties with chemical entities

MUG '05 : Chemical information models : Properties, cont.

Associate properties with arbitrary molecular identifiers

MUG '05 : Chemical information models : Properties, cont.

Associate properties with arbitrary identifiers

MUG '05 : Chemical information models : Properties, cont.

Reverse association of molecules with propertied entities

MUG '05 : Chemical information models : Example 1

Example: WDI - Derwent's World Drug Index

MUG '05 : Chemical information models : Example 2

Example: Empath - Metabolic Pathways

MUG '05 : Chemical information models : Example 3

Example: Orange - FDA Orange Book (NDAs)

MUG '05 : Chemical information models : Summary

Concluding thoughts about chemical information models

MUG '05 : Chemical information models
How can one do anything useful with a computer?