17. THOR Toolkit: THOR Datatrees

Back to Table of Contents

Previous chapters discussed those aspects of the THOR Toolkit that are common with the Merlin Toolkit: servers and security, databases and datatypes. In this chapter, we will cover the THOR-specific capabilities of the THOR Toolkit.

17.1 THOR Streams

Streams are heavily used throughout THOR, in a way very analogous to their use in molecule objects. Before moving on to the details of datatree, dataitem and datafield objects, we will spend a few words discussing streams; methods for using the other THOR objects will be more apparent once the role of streams is clear.

When working with molecule objects, one usually needs to access the constituent parts (atoms, bonds, and cycles). The function dt_stream() is used for this purpose; for example, to get the atoms of a molecule, one invokes dt_stream(mol, TYP_ATOM). All of the constituent parts of a molecule are accessed this way; there is no other mechanism.

THOR objects' constituent parts are accessed the same way. For example, datafield objects are the constituent parts of a dataitem; a stream created by dt_stream(di, TYP_DATAFIELD) will contain all of the datafields in that dataitem.

Below is a description of the behavior of dt_stream() when applied to THOR objects. Although some of these object types have not yet been formally introduced, they are all presented here for completeness. If this is the first time you are reading through this material, you should skim through it just to get the general idea, then return later for a more thorough reading.

dt_stream(Handle thor_ob, int typeval) => stream
A stream over a THOR object contains the constituent parts that are of the specified type. The parameter thor_ob becomes the stream's base object (see dt_base()). If the base object is modified, the stream is deallocated and its handle is revoked.

The stream's contents will depend on the type of thor_ob; in the following, assume that datafield is an object of TYP_DATAFIELD, dataitem is TYP_DATAITEM, and so forth:

Returns NULL_OB for any typeval; there are no constituent parts to a datafield object.

Returns a stream of the datafields in the dataitem when typeval is TYP_DATAFIELD; NULL_OB for any other value of typeval.

If typeval is TYP_DATATREE, returns a stream containing all of the subtrees in the TDT object. Each subtree object is itself a TDT to which dt_stream() can be applied.

If typeval is TYP_DATAITEM, returns a stream containing all of the dataitems attached directly to the TDT object (i.e. those data associated with the root identifier, but not dataitems that are part of subtree objects). The first dataitem object in the stream is always the root identifier itself; subsequent dataitem objects are data about the root identifier.

If typeval is TYP_ANY, returns a stream containing both the dataitem objects and the subtree objects.

Any other value of typeval will return NULL_OB.

Returns NULL_OB; there are no constituent parts to a datatype object.

(See the advice below regarding this use of dt_stream())

If typeval is TYP_DATATREE, returns a stream of all TDTs in the database. Such streams are unusual in that the object returned is created as it is needed. And unlike other types of streams, if a stream of TDT objects over a database is reset, it will re-create the objects if they have been deallocated. This allows you to get an object from the stream, operate on it, then discard it before moving to the next object. Without this behavior, it would be impossible to use streams over a database, as the number and sizes of the objects is often quite large.

If typeval is TYP_STRING returns a stream of string objects containing the lexical representation of all TDTs in the database (indirect data are not expanded). Like streams of datatree objects, the string objects are created "on demand"; you must take care to deallocate them as you go. Furthermore, if you reset a stream of string objects over a database, you will get different objects the second time through. Each call to dt_next() causes a new string object to be created. No other stream in the Daylight Toolkit behaves this way (i.e. streams ordinarily return the same objects the second time through).

If typeval is TYP_DATATYPE, returns a stream of all datatype objects defined for the database. This type of stream is quite ordinary compared with the previous two, and behaves like "normal" streams in the Daylight Toolkit.

If typeval is TYP_DATABASE, returns a stream of all database objects that have been opened by this client on the server. Any other value of typeval will return NULL_OB.

The details and subtleties of the use of streams will become more apparent as we describe each THOR object type and the functions that operate on it. For now, simply keep in mind that streams over THOR objects work in a manner very parallel to their use in the SMILES Toolkit: They return the constituent parts of THOR objects.

You may have noticed that a stream can be formed that contains every TDT object in a database; this might tempt you to use this functionality to turn THOR into a searching system rather than its usual use as a look-up system (e.g. "Read through the database and find thus-and-such..."). This, in general, is a bad idea.

THOR's power comes from its ability to handle ambiguous identifiers, to look up TDTs very quickly, and its use of SMILES and hash tables. It is not designed to search through the entire database as one might a relational database. Although THOR streams certainly give you the power to do exactly this, it would be a poor use of THOR. Daylight's Merlin Toolkit is designed for searching; use it instead of THOR for such tasks.

The purpose of a stream that contains all the TDTs in a database is to provide a way to dump the database's entire contents; dt_stream() is ideally suited to this use.

17.2 Datatree Objects

In this section, we get to the heart of the matter: data storage and retrieval via TDT objects. The actual data stored in a database is accessible through these objects.

17.2.1 Creating Datatree Objects

The primary way to create TDT object is to retrieve it from a database using dt_thor_tdtget(). There is no way to create a TDT "from scratch", i.e. there is no such function as dt_alloc_tdt(). Instead, dt_thor_tdtget() has a parameter that will cause a new TDT to be created if it can't be found in the database. The idea is that you don't want to allocate an empty datatree for an identifier that already exists in the database, as you would very likely overwrite the existing data. By using dt_thor_tdtget() to create new TDTs, you are forced to examine the database first, hence are less likely to lose data.

SMILES is usually the root (main topic) of a TDT, but there is no requirement that this be the case. If you have data for a non-SMILES identifier and don't know the structure of the molecule (or the identifier is not for a molecule, i.e. "pine tar"), you can create a TDT with a non-SMILES root identifier.

THOR goes to considerable trouble to create a SMILES root for a TDT whenever possible For example, if you create a TDT with a $CAS number as its root and a $SMI subtree, THOR will recognize the $SMI and will invert the tree, making the $SMI part the root. Similarly, if you create a TDT with absolute SMILES in it but no unique SMILES, THOR will convert the absolute SMILES to a unique SMILES and put that at the tree's root. If you do store a non-SMILES-rooted TDT, then later store a SMILES-rooted TDT with the former identifier as a subtopic, THOR will automatically merge the data from the former TDT into the new SMILES-rooted TDT.

THOR does not allow non-SMILES-rooted TDT to have any subtopics (subtrees). The idea is that only structure (e.g. SMILES) is a valid main topic, as it is the only non-arbitrary identifier that works in this role. You can create such a TDT, and (as described above) THOR will try to find a SMILES somewhere in it, and rearrange the TDT to put a SMILES "on top". But if the attempt to rearrange doesn't result in a SMILES-rooted TDT, the TDT can have no subtrees, only data about the root (non-SMILES) identifier. Note that this error may not be detected until you try to write the TDT to its database (see dt_thor_tdtput(), below).

There is no way to move a TDT directly from one database to another; to do so, you must fetch a TDT from one database, create a TDT with the same identifier in the second database, then copy information from the former to the latter. There are several reasons for this restriction, the most important being that datatypes are a property of a database, so moving a TDT from one database to another could change the meaning of the data (in the case where the datatypes are defined differently).

A second function used to create TDT objects is dt_thor_str2tdt(), which will convert a string (lexical) representation of a TDT into an object. Some of the protection afforded by dt_thor_tdtget() is bypassed by dt_thor_str2tdt(): You can choose to merge the data in the string with existing data or to ignore data in the database. The latter strategy is useful when re-loading a database from a "dump" (e.g. when there is certain to be no conflict between incoming data and existing data) as it is significantly faster, but should not be used in general. Some additional protection is provided by the "timestamp" mechanism described below.

If record locking is enabled, fetching a "writeable" TDT also locks the TDT -- it gives the client who fetched the TDT exclusive write access until the TDT is explicitely unlocked. For a complete discussion, see the section on record locking in the Databases chapter.

17.2.2 Destroying Datatrees and Datatree Objects

There are two distinct operations that, at first glance, might seem similar. Both get rid of TDTs, but in very different ways:

dt_dealloc(tdt) deallocates a TDT object from the client program's memory, but does not affect the database at all (in particular, the TDT object can be re-created by re-reading it from the database).

dt_thor_tdtremove(tdt) removes a TDT from the database, but does not affect a TDT object that might represent that same TDT (in particular, if the TDT object exists, it is unaffected and can be written back to the database even though it was deleted from the database.

It is important to remember that these operations are entirely separate.

17.2.3 The Datatree Memory

Every TDT object that you create becomes a child object of its database (all functions that create TDTs have a database as their first parameter). Thus, THOR remembers each TDT object until it is explicitly deallocated or the database is closed.

Furthermore, THOR goes to some trouble to prevent the creation of two TDT objects with the same root identifier. If, for example, you try to read a TDT from a database using dt_thor_tdtget(), THOR will first check all of the child objects of the database to see if it already has that TDT; if so, you will get the existing TDT rather than a new one. The reason for this behavior is, once again, to prevent the accidental loss of data. You rarely want two TDT objects that represent the same actual TDT, as this would inevitably lead to one's modifications overwriting the other's.

If you have a TDT object that you suspect needs "refreshing" (i.e. to be re-read from the database) because it is out of date (see "Timestamps" below), it is necessary to deallocate the TDT before invoking dt_thor_tdtget(). As long as your copy of a particular TDT object exists, THOR will not re-read it from the database.

The function dt_thor_str2tdt() allows you to circumvent some of the protection afforded by dt_thor_tdtget(). Using it, you can create a TDT for an identifier that already has been fetched from the database, resulting in two object for the same main topic. Clearly this is a situation to be avoided.

17.2.4 Writing TDTs to a Database

The functions dt_thor_tdtput() and dt_thor_tdtput_raw() write a TDT or string to a database, respectively. Note that writing doesn't happen automatically when you modify a TDT object. That is, modifying a TDT object has no effect on the original data in the database: you can discard the TDT object if you like and the original data will be unaltered. Only when you invoke one of the two aforementioned functions is the database actually altered.

A write operation can be affected record locking. For a complete discussion, see the section on record locking in the Databases chapter.

17.2.5 Timestamps

The first time you write a TDT to a database, a timestamp is automatically added to it by the THOR server. On each subsequent write of a TDT, the server checks its timestamp to insure that it agrees with the one in the database. If the two match, then all is ok. But if they don't, the assumption is that some problem has occurred.. Possible causes for this situation include:

  • Two clients could have retrieved the same TDT and modified it. The first client wrote it back successfully, causing the timestamp to change; the second client encountered an error as it attempted to write its version of the TDT.

  • One client managed to create two versions of the same TDT. This is always due to the use of dt_thor_str2tdt(); the pitfalls of this function are outlined in its description below.

  • A client used dt_thor_str2tdt() to create a TDT for an identifier that was already present in the database. TDTs created this way have a bogus timestamp of "eons ago", so they always appear to be out of data compared to existing data.

Timestamps are intended to serve as a partial protection against inadvertently overwriting data. The only way to circumvent the protection they provide is with a "forced write" of data (see dt_thor_tdtput()).

17.2.6 Merging Datatrees

As noted above, THOR detects an error when two clients attempt to modify a single TDT simultaneously. Timestamps will normally prevent one client from unknowingly erasing the changes made by the other client. But merely preventing the second client from writing its data doesn't solve the problem, since presumably the second client's data are important too.

To solve this problem, and to provide a mechanism for merging two databases into one, THOR provides a TDT merge operation. "Merging" is the process of identifying the set of unique dataitems from two TDTs and producing a single TDT from that unique set.

There are several points at which you can merge datatrees. First, when creating a TDT from its lexical (string) representation (dt_thor_str2tdt()), you can choose to merge the data from the string with any existing data as the TDT is created. Second, a TDT can be merged as it is being written to the database (dt_thor_tdtput()). And third, two TDTs can be merged into one at any time (dt_thor_tdtmerge()).

Merging is mostly useful when adding new dataitems to a TDT; it has unpredictable behavior when dataitems are deleted or modified. Consider the following examples:

  • Two clients simultaneously read a TDT, and each adds one new dataitem to it. If these clients use the "merge" operation as they write the data, the resulting TDT will contain all of the original data (which was common to both client's version of the TDT) plus the two new dataitems (one from each TDT). It will be exactly as though one client had added both new dataitems, which is what we desired.

  • Two clients simultaneously read a TDT, and each deletes a different dataitem. The first client writes its TDT out without trouble. But when the second client writes (with merging) its version of the TDT, the TDT will be restored to its original form, since the dataitem each client deleted still appears in the other's version of the TDT (i.e. the union of the two smaller sets is the original set). This is not what we desired.

  • A single client reads a TDT, modifies one datafield, then uses the "merge" operation when writing it back to the database. Since the modified dataitem is no longer identical to the original, both versions of the data (the original and the modified) appear in the database. This is probably not what we desired (when modifying a dataitem, it is usually because it is wrong, and we want to overwrite the original data rather than merging the old with the new).

These examples should illustrate both the uses and the pitfalls of the merge operation. To briefly summarize, merging is primarily useful when you are adding data to a database, or when you are merging data from two or more databases into a single database. It is almost never useful when you are changing or deleting data.

17.2.7 Cross-Referencing

One of THOR's most important capabilities is that it can know a compound by many names (identifiers), and it can retrieve the compound using any identifier that is known. THOR achieves this by a cross-referencing mechanism: The identifier for each subtree of a SMILES-rooted TDT is stored in a secondary cross-reference database. Given any known identifier, the SMILES for all TDTs in which that identifier appears can be retrieved with a single access to the secondary database. (Recall from the THOR section of the Daylight Thoery Manual that a particular identifier may appear in several TDTs).

Cross-referencing is done automatically by the THOR server; you can not directly create cross references. Each time you write a TDT to its database, the server examines the TDT, extracts all of its identifiers, and creates a cross reference entry to the SMILES that is the root of the TDT for each one. Note that since non-SMILES- rooted TDTs aren't allowed to have subtrees, there is never a need to cross-reference an identifier to anything but a SMILES (for example, there will never be a cross-reference between a CAS number and a Wisswesser Line Notation).

The function dt_thor_xrefget() is the mechanism by which cross- reference information is retrieved. It returns a sequence of string objects, each containing a SMILES.

17.2.8 Functions on TDT Objects

dt_thor_tdtget(Handle parent, Handle dt, string id, boolean writeable, RETURN boolean isnew) => tdt
Gets or creates a THOR Data Tree (TDT) object from the object parent. The TDT's root identifier will be the identifier/datatype represented by id and dt, respectively, with id standardized according to the specifications in the datatype object dt.

The "parent" object can be either a database, or can be the root of a TDT. In the former case, the TDT is retrieved from the database and is a "root" TDT. In the latter case, a subtree is retrieved from an existing TDT object.

The parameter writeable indicates whether modifications to the TDT object are to be allowed. If the database is open read-only, then writeable must be FALSE. For a database open with "write" permission, you can choose to retrieve a TDT as "read-only". When record-locking is in effect (see Record Locking), this also controls whether a record is locked or not: Any TDT that is retrieved with writeable TRUE is automatically locked for exclusive access.

The parameter writeable also controls whether a new TDT can be created: If writeable is FALSE and the requested TDT is not in the parent, no TDT object is created and the function returns NULL_OB. If writeable is TRUE and the TDT doesn't exist, a new TDT object is created representing a TDT that is not yet in the database.

dt_thor_xrefget(Handle database, Handle datatype, string id, RETURN int iserror) => sequence
Gets a cross-reference sequence from a database, or NULL_OB if the identifier doesn't appear in any SMILES-rooted TDT or if an error is detected. If NULL_OB is returned, the return parameter iserror is TRUE if it was due to an error.

A cross-reference sequence contains a set of string objects. The first string object in the sequence contains the original identifier that you asked about. The second through last string objects each contain the SMILES of a TDT in which the specified identifier appears. For example, a request with the datatype object for "$NAM" and id "dichloroethene" might yield a sequence with three string objects, respectively containing "dichloroethene", "ClC=CCl", and "ClC(Cl)=C".

dt_thor_tdtput(Handle tdt, boolean merge) => integer
Writes a TDT to its database (see the THOR-specific description of dt_parent(), below); modifies the timestamp dataitem to reflect the current time.

If the THOR server detects that the TDT is out of date (its timestamp is older than that of the same TDT in the database), then the parameter merge indicates what is to be done:

  • If merge is FALSE, the write simply fails. Note that this function's return value (described below) clearly distinguishes between out-of-date timestamps and other failures, so it is possible to recover gracefully from these unlikely collisions between clients.

  • If merge is TRUE, the data from the database are merged with the data in TDT, the timestamp of TDT is changed to the that of the data from the database, and the TDT is written to the database.

The function returns the following values:

1 == Successful write: The TDT object was written to the database, and no problems were detected.

0 == Out-of-date timestamp: The timestamp of the TDT was out of date and merge was FALSE, or another client wrote the TDT as the merge was in progress. The TDT is not stored.

-1 == Error: Some problem was detected (invalid or revoked handle, database is a virtual database, database is NULL_OB, error communicating with the server, etc.). The TDT is probably not stored.

This function can also have some rather dramatic effects on the structure of the TDT itself, including changing the very datatype and identifier of TDT's root. See the manual page for details.

dt_thor_tdtremove(Handle tdt) => integer
Permanently removes (erases) a TDT from the database. Does not affect the TDT object (i.e. it only affects data in the database, not the TDT object itself).

dt_thor_tdt2str(Handle tdt, boolean expand) => string
Converts the TDT object into its lexical (string) representation. If expand is FALSE, indirect references in the datatree are not expanded; the string representation will contain the "raw" indirect reference identifier. If expand is TRUE, indirect references are expanded.

dt_thor_str2tdt(Handle database, string tdtstr, boolean merge) => tdt
Converts tdtstr, the lexical (string) representation of a TDT, into a TDT object associated with the database db.

If merge is FALSE, a TDT object representing only the data from tdtstr will be created. Note that without the merge operation, it is possible to create a TDT object with a root identifier and datatype that "conflicts" with one in the database or with an existing TDT object (i.e. the same root identifier/datatype but with different data).

If merge is TRUE, a TDT object representing data from the database, merged with data from the string, is created.

dt_thor_tdtmerge(Handle tdt1, Handle tdt2) => tdt1
Merge the dataitems from the TDT object tdt2 into the TDT object tdt1; deallocate the object tdt2. When the merge is complete, tdt1 contains the set of all unique dataitems from both tdt1 and tdt2, and its timestamp is the newer of tdt1 or tdt2.

dt_thor_tdtput_raw(Handle database, string tdtstr)
Puts a TDT string directly into the database without creating a TDT object, without any normalization, and without any of the safeguards that go with normalizations. Very fast and very dangerous.

This function is designed to be used only for re-loading data that was dumped (see dt_stream(db, TYP_STRING)) from an existing Thor database that was created using the exact same release of the software.

dt_thor_tdtrevise(Handle tdt) => boolean ok
Revises a TDT in preparation for storing it in a Thor database. This function is normally only used internaly by dt_thor_tdtput(), but can be called directly if needed. See the description of dt_thor_tdtput() for an explanation of the changes that this causes.

17.3 Dataitem and Datafield Objects

Previous sections described how to open a database and how to get and store a TDT; this section describes how to retrieve and modify the actual data contained in a TDT.

Most of the functions that access and modify a TDTs contents are of the "polymorphic" variety, so there are only a few new functions to be introduced here. In particular, you access the dataitems within a TDT and the datafields within a dataitem using dt_stream(), you access and change the value of a datafield using the same functions that work on string objects, and you find the "verbose tags" (the definitions for each datafield) using dt_tag(), dt_name(), dt_briefname(), dt_summary() and dt_description(). See the datatypes chapter for more information.

The following C code fragment illustrates this idea:

     /* The outer loop is over all dataitems in the TDT.  Note that
	it skips subtrees and their dataitems; we will only see the
	root identifier (1st dataitem) and the data attached to it.
	The inner loop is over the datafields of each dataitem */

     char *field, *label
     dt_Handle di_stream, di, df_stream, df;

     di_stream = dt_stream(tdt, TYP_DATAITEM);
     while ( (di = dt_next(di_stream)) != NULL_OB) {
	 df_stream = dt_stream(di, TYP_DATAFIELD)
	 while ( (df = dt_next(df_stream)) != NULL_OB) {
	     field = dt_stringvalue(df);
	     label = dt_info(df, strlen("name"), "name");
	    /* do something, like print the datafield's label and value.*/

17.3.1 Functions on Dataitems and Datafields

dt_thor_alloc_dataitem(Handle tdt, Handle datatype) => Handle dataitem
Allocates a new dataitem object in tdt. The dataitem is created with the number and types of fields specified by datatype; the datafields initially contain empty strings (note that these empty strings are not the invalid string -- they are valid strings with no characters in them). Returns a handle to the dataitem, or NULL_OB if a problem is detected.

dt_datatype(Handle dataitem) => Handle datatype
Returns the datatype object associated with dataitem.

dt_dfnorm(Handle datafield, integer norm) => boolean isnorm

Tests the datafield against the given normalization; returns TRUE if that is one of the datafield's normalizations. Normalizations are defined in the Toolkit's header files; for example:

     dt_dfnorm(df, DY_THOR_BINARY) => TRUE if df is binary data
     dt_dfnorm(df, DY_THOR_READONLY) => TRUE if df is read-only
dt_dfnormdata(Handle datafield, integer norm) => string normdata
If a normalization has extra data (i.e. REGEXP, INDIR, SMILES_NTUPLE), returns a string describing that data. For example, if a datatype is defined as:
     $D_V_N<;INDIRECT $I;>|
Then the call dt_dfnormdata(df, DY_THOR_INDIRECT) would return "$I" when df was the second datafield of a CLOGP dataitem.

dt_thor_raw_datafield(Handle datafield) => string fieldvalue
Returns the "raw" value of a datafield. Datafield's contents are normally accessed using dt_stringvalue(), which automatically expands indirect data. This function is like dt_stringvalue() except that it returns indirect datafields without expanding them. For datafields that are not indirect, it is identical to dt_stringvalue(datafield).

dt_thor_moveitem(Handle moveh, Handle afterh) => boolean ok
Moves the object moveh from one place to another in the TDT; in particular, to after the object afterh. There is no way to move an object to before the first object in a TDT, as this would replace the TDT's identifier, nor is it possible to move the first object in a TDT (its root identifier) somewhere else. The objects moveh and afterh must have the same root TDT (although one or the other may be part of a subtree; see below).

There are four possible combinations of object types for objects moveh and afterh (where "==" is shorthand for "is of type"):

1. moveh == TYP_DATAITEM, afterh == TYP_DATAITEM
This move is always legal; the dataitem moveh is moved so that it follows afterh . The two dataitems may be in different subtrees or the root tree to begin with; moveh ends up in the subtree of afterh (it is re-parented).

2. moveh == TYP_DATATREE, afterh == TYP_DATATREE
This is legal if both moveh and afterh are subtrees; moveh is moved so that it follows afterh.

3. moveh == TYP_DATATREE, afterh == TYP_DATAITEM
If afterh is a dataitem of the root of the tree, moveh becomes the first subtree of the TDT. If afterh is a dataitem in a subtree, moveh is moved to after the subtree that afterh is part of.

4. moveh == TYP_DATAITEM, afterh == TYP_DATATREE
This combination is never legal; it doesn't make sense.

Returns TRUE if the move is completed successfully; FALSE if errors are detected. If the move is illegal, the TDT is unaffected. If the move is legal but fails, the TDT will probably be corrupt (such failures are generally caused by corrupt TDTs being passed in; this almost never happens).

Back to Table of Contents
Go to previous chapter THOR and Merlin: Datatypes
Go to next chapter Merlin Toolkit.