TDTs- THOR Data Trees

TDT is the standard Daylight format for THOR Databases.
  • TDT format is comprised of Datatypes, Dataitems, Datafields,Datatrees
  • STANDARDIZATIONS/NORMALIZATIONS

    USMILES -- unique SMILES
  • Interpret the field as a SMILES, convert it to the unique SMILES.
  • ASMILES -- absolute SMILES
  • Interpret the field as an isomeric SMILES (contains isomeric and/or isotopic information); generate the unique isomeric SMILES
  • USMILESANY -- unique SMILES, unrelated to root
     
    ASMILESANY
  • Like USMILESANY, except the field is interpreted as an absolute SMILES (contains isomeric and/or isotopic information).
  • WHITE0 -- zap spaces
  • Remove all "whitespace" (spaces, tabs, newlines, and carriage-returns) from a field.
  • WHITE1 -- zap 2 or more spaces
  • Convert all occurances of two or more whitespace characters into a single space.
  • WHITE2 -- zap 3 or more spaces
  • Convert all occurances of three or more whitespace characters into two spaces.
  • UPCASE -- convert to upcase
  • Translate the characters a-z to their uppercase equivalents A-Z.
  • DOWNCASE -- convert to downcase
  • Translate the characters A-Z to their lowercase equivalents a-z.
  • NOPUNCT -- zap punctuation
  • Remove all punctuation. Punctuation characters are all non-alphanumeric characters.
  • CASNUM -- insert hyphens and verifies checksum
  • Chemical Abstracts numbers have the form NNNN-NN-N, where the last digit is a checksum. The CASNUM standardization will insert the hyphens, and verifies that the checksum is correct
  • INDIRECT -- indirect data field
  • The datafield is an indirect reference; on reading, its contents are replaced by the expansion for its contents. This is discussed in more depth in the section below entitled Indirect Data.
  • INTEGER16 -- numeric data
    INTEGER32
    REAL32
    REAL64
  • These normalizations indicate that the datafield is a 16- or 32-bit integer or a 32- or 64-bit real number, respectively. Thor does no actual range checking to verify that the datafield meets the specification. However, the Merlin system can often use these specifications to allocate storage more efficiently, and to greatly increase the speed of certain searching and/or sorting operations.
  • BINARY -- binary data
  • The datafield is binary; that is, it contains arbitrary 8-bit integers. Such datafields are invisibly converted by THOR to remove characters that would otherwise confuse the THOR server and clients. THOR and Merlin clients also use this when formatting a TDT for display.
  • SMILES_NTUPLE -- SMILES-order n-tuple data
  • It is often necessary to store data about individual atoms in a molecule, that is, data that have a one-to-one corespondence with the atomic symbol in a SMILES string.
  •  
    $SMI<OCC>2D<1,2,3,4,5,6>| (original SMILES)
    $SMI<CCO>2D<5,6,3,4,1,2>| (unique SMILES)
     AUTOGEN -- Generate a new dataitem using this dataitem's contents

     
  •  Takes the tag of a second datatype. Each time a dataitem is entered, a new datatype is created of the type specified by the tag, then that datatype's normalizations are applied.
  •   MAKERXNMOL -- Generate component molecules for a reaction
    PART_NTUPLE -- Component-order n-tuple data
    GRAPH -- convert SMILES to GRAPH

    MAKEGRAPH -- produce a GRAPH subtree

  • THOR uses the concept of a molecule's graph to allow retrieval of structures that might be tautomers, isomers, or otherwise an inexact match to a particular SMILES. One of the problems in representing molecules in a computer is that we must choose one valence model as the preferred representation, but there are many valid valence models. The graph of a molecule is an information-deficient representation that removes most valence-model information, allowing greater flexibility in retrieving data.
  • D3D -- compute 3D hash
    INDIRECT
        $ISC<134485;J. MED. CHEM., 29,(1986) N 9, 1615-1627>
        | 

    Daylight Chemical Information Systems Inc.