TDTs- THOR Data Trees

TDT is the standard Daylight format for THOR Databases.

TDT format is comprised of Datatypes, Dataitems, Datafields,Datatrees

Datatypes are ascii tags preceding data and are of two types

Identifiers are data tags which begin with "$" which have other pure data tags associated with them

Pure data tags

Main "root" of the Data Tree is traditionally the $SMI identifier with a USMILES as data

Dataitems associated with a given data tag appears between <>

Datafields are sub-entries of a Dataitem, separated by ";"

A Datatree is one complete entry in the database for a root identifier, they are separated by a "|"

EXAMPLE TDT

Datatype Definitions- Gives meaning to data stored in THOR Databases

Daylight has a standard set of "reserved datatypes"

$DY_ROOT/data/datatypes/std_datatypes.dcis.tdt

Example Datatype Definition for $SMI

$D<$SMI>
_V<SMILES>

$D<tag>	Defines the internal tag of a new datatype.
_V<vtag[;vtag...]>	The verbose tags. This datatype is the only required part of a datatype definition. It serves two purposes: First, it defines the number of datafields in the datatype being defined. Second, it provides the "verbose tags" (human-readable labels) for the datafields of the datatype being defined.
_B<btag[;btag...]	The brief tags. A short name for the datatype suitable for labelling buttons and putting in "pull-down menus".
_N<ntype[;ntype...]>	The normalizations of each datafield in the datatype.
*_P<[][;[]...]>*	The Merlin-pool-inclusion flag. Each non-zero-length field is loaded into Merlin's in-memory pool when Merlin opens the pool. The _P<!> inclusion flag creates a row in Merlin from each subtree rooted by the identifier datatype to which it applies. For example, if set this flag in the $NAM datatype definition, Merlin would create a row for the $SMI(which is standard), but then it would also create a row for each $NAM.
_S<summary>	One-line summary of the datatype's meaning and use.
_D<description>	Long description of the datatype's meaning and use.
_M<set>	Set membership of the datatype. For administration.
_C	Comment. You can put anything you like in this datatype.
_O	The "owner" of the datatype. You can put anything you like in this datatype;

STANDARDIZATIONS/NORMALIZATIONS

USMILES -- unique SMILES

ASMILES -- absolute SMILES

USMILESANY -- unique SMILES, unrelated to root
ASMILESANY

WHITE0 -- zap spaces

WHITE1 -- zap 2 or more spaces

WHITE2 -- zap 3 or more spaces

UPCASE -- convert to upcase

DOWNCASE -- convert to downcase

NOPUNCT -- zap punctuation

CASNUM -- insert hyphens and verifies checksum

INTEGER16 -- numeric data
INTEGER32
REAL32
REAL64

These normalizations indicate that the datafield is a 16- or 32-bit integer or a 32- or 64-bit real number, respectively. Thor does no actual range checking to verify that the datafield meets the specification. However, the Merlin system can often use these specifications to allocate storage more efficiently, and to greatly increase the speed of certain searching and/or sorting operations.

BINARY -- binary data

SMILES_NTUPLE -- SMILES-order n-tuple data
AUTOGEN -- Generate a new dataitem using this dataitem's contents

This is a variation of the AUTOGEN normalization. It takes a comma-separated list of three datatype tags as a second field. Each time a dataitem is entered for this field, it is parsed into its dot-separated component parts. A new dataitem is created for each component within the datattree. The datatype tag for each component depends upon the role of the component in the reaction.

For example, consider a database for which the ISM<> datatype defined with the "MAKERXNMOL $RMOL,$AMOL,$PMOL" normalization and the three datatypes: $RMOL<>, $AMOL<>, $PMOL<>, each has the USMILES_ANY normalization. If the following datatree is entered:

      $SMI<"BrCC=C>>ICC=C">
      $RNO<12345>
      ISM<"BrCC=C>CC(=O)C.CCC(=O)C>ICC=C">
      |

the datatree actually stored in Thor, after normalization, would be as follows:

      $SMI<"BrCC=C>>ICC=C">
      $RNO<12345>
      ISM<"BrCC=C>CC(=O)C.CCC(=O)C>ICC=C">
      $RMOL<BrCC=C>
      $AMOL<CC(=O)C>
      $AMOL<CCC(=O)C>
      $PMOL<ICC=C>
      |

PART_NTUPLE -- Component-order n-tuple data

This normalization is of particular use with the FPP datatype, which consists of a set of N fingerprints corresponding with N dot-disconnected SMILES representing a mixture and/or library. As with SMILES_NTUPLE, an integer argument is required indicating the number of data per part.

GRAPH -- convert SMILES to GRAPH

MAKEGRAPH -- produce a GRAPH subtree

THOR uses the concept of a molecule's graph to allow retrieval of structures that might be tautomers, isomers, or otherwise an inexact match to a particular SMILES. One of the problems in representing molecules in a computer is that we must choose one valence model as the preferred representation, but there are many valid valence models. The graph of a molecule is an information-deficient representation that removes most valence-model information, allowing greater flexibility in retrieving data.

A molecule's graph is created by removing all isotopic, charge, and bond information from it. All bonds are set to "single", all charges are set to zero, and each atom's hydrogen count is set to the normal lowest valence consistent with its bond configuration. Having removed all of this information, the resulting "molecule" is used to generate a unique SMILES; this is the graph's identifier.

D3D -- compute 3D hash

The D3D standardization is designed primarily for use with the $D3D datatype. It has two effects: First, it is equivalent to specifying "SMILES_NTUPLE 3" for the 3D data. Second, it causes a "hash code" to be generated from the 3D data; this hash code is stored in the datafield two positions earlier in the dataitem.

INDIRECT

Indirect data are data that are stored separately from the regular data in a database. A field thus marked will contain an indirect reference rather than the actual data of interest. When the TDT is retrieved from the database, an indirect-reference expansion takes place: The indirect reference is looked up in an auxillary database, and the expansion data replaces the original indirect reference.

EXAMPLE:

spresi95demo_datatypes.tdt ->

$D<"$ISC">
_V<"indirect/citation;citation">

$D<"JA">
_V<"Journal article;Author(s);Institution;Citation;Keywords;Language;Year;Document ID">

$SMI<Br.CCCN(CCC)C1CCc2cc(Cl)c(O)cc2C1>
CL<26439;19;;0.0066>
FPlt;e4.02.JEEdR5dQk2EZB.e.YU8Gl6DW326Okrv.uEYE43,AAEN.AX01YU3X2e2Ve7E6Iw45.O0177xQ7YInY34...1;2048;194;512;173;1;S95>
TS<199701100502.52>
$GRF<Br.CCCN(CCC)C1CCC2CC(Cl)C(O)CC2C1>
$SNO<1402584-201>
JA<130505;111512;;134485;5~3297~2;ENG;1986;06X0203 87>
SMPlt;227.00;227.00;;;;;methanol/diethyl ether;;06X0203 87>
|

spresi95_indirect.tdt ->

$ISC<134485;J. MED. CHEM., 29,(1986) N 9, 1615-1627>
|

Daylight Chemical Information Systems Inc.

`$SMI<OCC>2D<1,2,3,4,5,6>\|`	(original SMILES)
`$SMI<CCO>2D<5,6,3,4,1,2>\|`	(unique SMILES)