MUG '97   Example database creation: AAA database

Sub-documents: aaa_dts.tdt, Makefile, aaa_orig.tdt, makelog

Database AAA -- alpha amino acid properties

In this example we create a very small database of data about amino acids per se, called AAA. Some new server features are illustrated in the context of creating a database "from scratch". This example can also serve as template for creating databases.

Strategy

Probably the most important step in designing a database is to establish a clear idea of what it will be used for and it's role in an information processing environment. All other design decisions follow. (Corollary: when the goals or role of a database change even slightly, the design needs to be re-evaluated.)

The main goal of database AAA is to serve as an example of how to build databases for MUG '97. The number of data and kinds of data will be kept to a minimum. The database should contain a only a small amount of easily-understood data. Data will be keyed to structures, including stereochemical information as available. Robust structure-based retrieval is desired (e.g., query as molecular amino acid or carboxylate). Stereochemical information will be recorded in the structures. Although there is no need here to be concerned about speed/space, indirect referencing will be used whenever appropriate for the purposes of illustration. AAA should be suitable for use in all possible current and future Thor and Merlin clients, including CD-ROMS, so we well choose file names which are valid under the (very restrictive) ISO-9660 rules. AAA should be expandable to other data about amino acids as they become available. AAA will not be clustered (due to the small number of structures), but that might change in the future.

Raw data

The following table appears in the "Handbook of Chemistry and Physics" (CRC, 53, C-744). We'll build the database AAA from this data.

Solubilities of the Amino Acids in Grams per 100 Grams of Organic Solvent.
Amino acid Entry Solvent Grams amino
acid per 100
grams solvent
Temp, C Ref No.
Alanine 1 Ethanol 0.0087 25.0 1
L-aspartate 2 Ethanol 0.000196 25.0 2
L-glutamate 3 Ethanol 0.000347 25.0 2
4 Ethanol 0.0056 44.93 2
Glycine 5 Acetone 0.000291 25.0 4
6 Butanol 0.000892 25.0 4
7 Ethanol 0.0037 25.0 1
8 Formamide 0.558 25.0 4
9 Methanol 0.0407 25.0 4
L-allo-isoleucine 10 Ethanol 0.13 20 5
11 Ethanol 0.19 79 5
L-isoleucine 12 Ethanol 0.09 20 5
13 Ethanol 0.13 79 5
L-leucine 14 Ethanol 0.217 25.0 1
L-proline 15 Ethanol 1.5 19 6
Valine 16 Ethanol 0.0136 0.03 3
17 Ethanol 0.019 25.0 1
References
1. Cohn, E. J., McMeekin, T. L., Edsall, J. T., and Weare, J. H., J. Am. Chem. Soc., 58, 2270 (1934)
2. McMeekin, T., L., Cohn, E. J., and Weare, J. H., J. Am. Chem. Soc., 103, 2270 (1933)
3. Dunn, M. S. and Rose, F. J., J. Biol. Chem., 124, 300 (1938)
4. McMeekin, T., L., Cohn, E. J., and Weeare, J. H., J. Am. Chem. Soc, 58, 2173 (1936)
5. Abderhalden, E., and Zeissctw., Z. physiol. Chem., 196, 121 (1931)
6. Kapfhammer, J., and Eck, R., Z. physiol. Chem., 170, 294 (1927)

Create datatypes to match data and goals

Most Thor databases use a number of "canned" datatypes, but almost all databases need to have at least some datatypes which are specific to that particular database. In this case we need to invent datatypes to represent solubility data (SOL) and to identify the source of the data ($DID). We also need to decide how each kind of data will be normalized to acheive our database goals.

Select an identifier: $DID

One of the fundamental principles of Thor is to record data as it is known, including: what data are known, what chemical entities data are known to be about, and where the data came from. Except in the case of small, temporary, special-purpose databases, it is almost always important to do a good job on this part. In this case, the original source of the data is given by reference but our actual source is the CRC compilation. Q: What do we know the data "to be about"? A: The entry in the CRC table. This database is being designed to be hold many such data sets. So we'll invent a "Data Source ID" datatype and store the (referenced) data under that. This is analogous to the usual case of storing data under a corporate registration number.

The Merlin system provides a spreadsheet-like view of a Thor database. The "normal" view where each row represents a single datatree (i.e., information about a generic chemical structure). This view may be augmented by separating subtree data for "distinguished identifiers" into their own rows, by setting the "pool inclusion flag" (_P) to `!'. Doing so doesn't affect how the data is stored, only how it is viewed. For purposes of this example, we will do so for the $DID identifier.

Here is the definition of the $DID datatype as a datatree:


  $D<"$DID">
  _V<Data source ID>
  _B<Data ID>
  _N<WHITE0 UPCASE>
  _P<!>
  _S<Identifies source of data>
  _D<Identifies precise source of data (compilation and entry)>
  _M<Lookup, Common>
  _O<Daylight Chemical Information Systems, Inc.>
  |

Select data representation: SOL (Solubility) datatype

The other main decision is how to represent solubility data. Here, we record solubility value itself *exactly* as it is appears in the table (in old-fashioned units and with the same number of significant digits). If one of our goals was to use this database for further processing, it might make sense to create an additional "derived" datatype which is recorded in other units (e.g., molality). Note that since we are storing all data under a $DID identifier, it is clear where to store such derived data and from which original datum it was derived.

Reference data will be stored as indirect data. It's more efficient and reliable to do so, since such data contains a lot of repeat values. For this example, the reference field is split into two fields, "Author(s)" and "Citation". This isn't really necessary, but it makes a good example here.

Solvent data would normally also be stored as indirect data, but for the purposes of this example, we will store it as the named SMILES of the solvent using the "USMILESANY" normalization. This will allows us to display and search the solvent field as structures.

The other field, "Temp., C" in the table, might be called "Temperature of measurement", but here we call it "at Temp, C" for brevity. It is a simple NUMERIC field.

Putting this all together, this is the definition of the SOL datatype as a datatree:


  $D<SOL>
  _V<"Solubility;Solvent;at Temp, C;Author(s);Citation">
  _B<"Sol;Sol/solv;Sol/temp;Sol/auth;Sol/ref">
  _N<"NUMERIC;USMILESANY;NUMERIC;INDIRECT $IAUTH;INDIRECT $ICITE;">
  _P<"*;*;*;*;*">
  _S<Solubility, grams per 100 grams of solvent>
  _D<"Solubility, grams / 100 grams of solvent;Solvent SMILES;Temp, C;Authors;Citation">
  _M<Physical Property, Measured, Common>
  _O<Daylight Chemical Information Systems, Inc.>
  |

Structural information

One of the design critera is to retain stereochemical information in the structures as available. The easiest way to do this is to enter the structures as "Isomer" data (the non-identifier ISM datatype), give it the ASMILES normalization (indicating that it contains an absolute SMILES of the database structure), and let Thor do the work of generating a unique SMILES datatree root.

In this case, structures are obtained from the name in the table for a given entry, we will connect ISM data to the same $DID identifier as the data.

Since the database design requires structural and similarity searching, we will need to add fingerprints to the datatrees. The fingerprint program generates fingerprints only for $SMI-rooted datatrees. Although the above approach of loading ISM's and letting Thor generate the $SMI root is very convenient and accurate, it means that we don't have $SMI-rooted datatrees in the input data for fingerprinting. We might generate $SMI roots in some other way, but this introduces another possible source of error. A better way would be to get the Thor system to generate and fingerprint SMILES in an automatic fashion -- as will be shown, this can be done quite simply.

Names

On one hand, we would like to keep the names exactly as they are transcribed from the table (e.g., if we ever need to correct the structure assignment for "L-allo-isoleucine", it would be nice to have the exact name as it orignally appeared). On the other hand, we'd like to be able to index names without worrying about case, spaces, and punctuation. The easiest way to do this in Thor is to create two different datatypes; one for "original" names (PCN, "Local name") and another for "lookup" names which have such distinctions removed ($NAM, "Name"). Thor provides a very convenient way do this via the AUTOGEN normalization. Specifying the normalization AUTOGEN $NAM for the PCN datatype causes the PCN data to be automatically re-entered as $NAM data (with $NAM's normalizations).

Note: If a database has many kinds of names, it's a good idea to crossreference all of them as $NAM's, so users can look up "Names" without also specifying the kind of name (IUPAC, USAN, INN, Trade Name, Common Name, etc.) This isn't the case here; there's only one kind of name in this example.

Collect all datatypes in a .tdt file

It is useful to collect all datatypes for a given database in a single TDT file (as opposed to loading them from different sources). Such files are usually named after the datatypes database, In this case, the datatypes tdt file is named AAA_DTS and the datatypes tdt file is named aaa_dts.tdt.

Create a database-building program

It's possible to create and load Thor databases from the interactive program sthorman or interactively via the thorfilters programs thormake and thorload. However, it's almost always a much better idea to build a reusable (and modifiable) program to do the job.

The most common way to do this is to write a sh(1) script and this approach works very well. A more powerful method is to describe the rules for building the database as a make(1) specification. We'll use make in this example and put the rules in a Makefile.

We have to pick a name for the database(s) at this point. In previous years, we might have used a name for the primary database such as "aminoacids" with "aminoacids_datatypes" and "aminoacids_indirect" for the auxiliary databases. Such choices would be just fine. But we've been experimenting with CD-ROM databases lately and the goals call for ISO-9660 compatibility, so we're limited to "8.3" names. We'll use the names "AAA", "AAA_DTS", and "AAA_IN" with the database extension ".TDB".

The Makefile delcares dependencies such that the three databases are created and loaded in a specific order. The datatypes database (AAA_DTS) must be created and loaded first, then the indirect database (AAA_IN) must be created next, then the primary database is created and loaded. All databases are created with thormake and loaded with thorload. The only unusual thing about the database loading is that indirect data is automatically generated (via the "+GENERATE_INDIRECT" option).

The database-building strategy used here allows the database to be created and fingerprinted from non-$SMI-rooted datatrees:

  1. Create the datatypes database AAA_DTS and load aaa_dts.tdt
  2. Create the empty indirect database AAA_IN
  3. Create a temporary database AAA_TMP and load aaa_orig.tdt
  4. List the (now $SMI-rooted) contents of AAA_TMP to aaa_smi.tdt
  5. Destroy the temporary database
  6. Add fingerprints to aaa_smi.tdt, save as aaa_fps.tdt
  7. Create the database AAA and load aaa_fps.tdt as RAW_DATA

This Makefile has only one "automatic" target, the database AAA.TDB. Typing either "make all" (or just "make") will create the entire database (or bring it up to date). Two other convenience targets are are provided: "clean" to remove the database and "cleanall" which also removes the temporary files.

Convert the data to datatrees

It might seem strange that creating datatrees is not done earlier, but it's usually best to wait until you know how you are going to use data before spending time doing conversion. At this point we know exactly how we want the data, from datatype definitions to the required file name -- aaa_orig.tdt.

Note that all datatrees are $DID-rooted and that the only structural data are in ISM datatypes, some of which have chiral specifications and some do not not. Remember that the ISM datatype has been given the ASMILES normalization so ISM's will end up on the correct Thor page keyed to their unique SMILES.

Build the database

This is the easy part: just type "make", and the database "builds itself". For a small database such as in this example, this is also the fast part (the whole process takes about 8 seconds on a minimal workstation). This makelog shows what happens.

Check the database

Open the database in xvthor, look up data by various identifiers, and check that data is organized in the manner you expected. In this case, the database datatrees look quite different from those input from aaa_orig.tdt, e.g,:

Note that the $DID-rooted datatree is itself rooted at a generic SMILES, a lookup "Name" has been properly derived from the "Local name". Also, "Graph" ID, Fingerprint, and Timestamp data have been generated and added to the datatree.

Open the database in xvmerlin, create columns of various fields, perform structure searching (check that fingerprints are OK) and try a few string and range searches. The following xvmerlin screen shows the AAA database with rows selected which have solubility data measured at 25 C in aliphatic alcholic solvents (SMARTS "[CX4]-[OH]"):

Back up the database

Other than saving your original data files (always a good idea!), there are are two basic ways to back up Thor databases: as binary databases and as an ASCII TDT "list" of the database contents. Binary Thor database files use a robust, architecture-independent format and many people choose to back up just these files, e.g., for this example:

   $ cd /home2/daylight/newdb
   $ tar c AAA*
Some people are more comfortable saving TDT files because they are stored as simple ASCII text. If you choose to do so, be sure to save tdt files for all the databases, e.g.,

   $ thorlist ACS_DTS > ACS_DTS.TDT
   $ thorlist ACS     > ACS.TDT
   $ thorlist ACS_IN  > ACS_IN.TDT
To avoid indirect data dependency, you might choose to save the TDT's with indirect data "expanded", e.g.,

   $ thorlist -KEEP_INDIRECT_DATA FALSE ACS > ACS_X.TDT
Finally, for the purposes of database backup only, you can use the the low-level program thordump to dump all the trees in a from the primary thor database (.DP) directly. Be sure that the database is not loaded writable by any thorserver (check for a .LCK file) and use thordump like thorlist:

   $ cd /home2/daylight/newdb
   $ thordump ACS_DTS.DP > ACS_DTS.TDT
   $ thordump ACS.DP     > ACS.TDT
   $ thordump ACS_IN.DP  > ACS_IN.TDT

Check database access security

The Thor system can be run at various levels of access security, from completely public to very restricted. To verify that you will be permitting the desired database access, you need to check three separate things.
  1. File system permissions

    The basis of Thor security is that the only way anyone except the Thor database administrator can get access to data in a Thor database is via thorserver. For these purposes, the administrator can not be the "root" super-user but another pseudo-user, conventionally named "thor". Check that only "thor" has access to the database files. (There's no sense in maintaining elaborate access control if unauthorized users can read the database files directly.)

  2. Server access

    Thor and Merlin servers control access via two separate mechanisms: user and host authorization. You can authorize access from specific machines or to specific (possibly passworded) users or (in "restricted hosts" mode) only to specific users when they are working on specific machines. Server access is controlled via sthorman. See the Daylight Administration Manual, Section 8. Security for more information.

  3. Thor database passwords

    Three passwords were supplied when the database was created, for read, write and executive privileges. By convention, completely unsecured databases have no read password and the passwords "write" and "exec". "Public" databases typically have no read password but do have passwords for write and executive access. Databases with restricted access should have a read password which is known only to authorized users. You can use sthorman or thorchange to change database passwords as needed (the executive password will be needed to do so.)


Daylight Chemical Information Systems, Inc.
info@daylight.com