Information modeling is at the heart of informatics. The molecular information model is seriously cool. The ability to represent pure substances as molecular structures has transformed chemistry from a science of substances into a molecular science in less than 100 years.A basic chemical information "trick" is representing a molecule in a computer. Doing just that is useful for things like molecular structure registration.
The next step is developing the information model itself -- how to connect properties to the representations of molecular entities in a computer to do something more useful than just making lists, e.g., connect it to external information resources. For the most part, we have been stuck using the same data models that are used for registration. A number of different chemical information models will be presented with real-world examples and the prospects for chemical data integration will be discussed.
Many approaches have been tried
... some good, some bad, some truly horrible
... we're pretty much stuck with all of them.
Simplest possible chemical information model: molecular structure is the identifier, properties are connected to it. Perfect for the clean, idealistic world of non-overlapping molecular chemistry. Very powerful but not comprehensive and not very useful IRL.[These] are the [property] values of [these molecular structures].
Examples: TSCA Toxic Substance Control Act, Empath metabolic pathways, Primary tables in CRC Handbook, Chemist's Companion, etc.
[These] are the [property] values of [these specific kinds of] [generic molecular structures].Entities are identified by molecular structure level-of-detail hierarchy. Properties are connected to entities at the appropriate level. Clean and idealistic yet more powerful and more useful than above.
Examples: MedChem masterfile (generic/isomers/ionic forms), QSAR (sets of molecules/molecular patterns/compounds)
The [entity with this registration number] has [this molecular structure] and has [these] [property] values.Common model for traditional chemical registries. All possible molecular entities are represented by registration numbers; properties are assigned to these entities. Requires "god-like" (omniscient) structure identification and discrimination methods ... which IRL become unstable over time when used by normal human beings. Other problems include poor behavior with incomplete structural knowledge and this requires development of a religious "group or split" dogma. OK for closed, static, short-term delivery of homogenous data.
Examples: CAS, MACCS, WDI, some registration systems
The [entity with this registration number] contains [these molecular structures] and has [these] [property] values.
Similar to above systems, but for multiplicity of molecules. The problem with god-like systems is even worse than for discrete entities.
Example: Most "chemical" USPTO Patents (with legal caveat)
The [entity with this registration number] has [these] [property] values.Often used for the chemical portion of a larger non-molecular database system. Entities and property-associations are not necessarily molecular. In such systems there are usually no requirements for uniqueness, exclusiveness, nor comprehensiveness. This is the weakest chemical information model.
[This molecular identifier] is associated with [this entity].
The [entity with this registration number] contains [these molecular identifiers] and has [these] [property] values.Often used when database entities are mixtures of things which are not necessarily molecular, or when non-molecular components are essential to the definition of a database entity.
Examples: FDA Orange Book (NDAs), MSDS collections
The [entity with this registration name or number] has [these] [property] values. [These molecular structures] are associated with [this entity].This has a similar "shape" to #5 above, but in this case, molecular structure-property associations are derived from the existence of specific molecular structures in database entities.
Examples: TCM Traditional Chinese Medicines, sample databases at analytical labs
This [database entity is a document] which states that [these molecular structures] have [these] [property] values.Document databases are special databases which normally contain non-molecular primary entities (e.g., journal articles, patents) which in turn reference molecular entities (e.g., compounds, reactions). They normally can be "inverted" to form a chemical information data set of one of the kinds mentioned above.
Example: Spresi, MSDS