MUG '97: Database building from text, John Bradshaw

MUG '97 -- 11th Annual Daylight User Group Meeting -- 26 February 1997

Database building from text

John Bradshaw
GlaxoWellcome, Stevenage, Herts SG1 2NY, UK

The database was built from "Handbook of Enzyme Inhibitors" H Zollner, VCH 1993
Use EC# which is a true classifier, like Dewey decimal system for books.
Sources such as SWISSPROT and Brookhaven to find enzymes for which there is a 3D structure.
- Abstract Brookhaven code.
- Experimented with storing gifs of binding site.
Manually lookup information on these EC# and create trees which are rooted in $INH<>
Load into thor which merges all the trees correctly.
Use nam2smi (contributed) to create trees $SMI<>$INH<>| and merge in thor.
- Caveat. (Posh name for bug). Pre 4.5 thor does not merge on all ambiguous names. If there is a many $SMI to one $INH relationship need the merge to add the subtree to all new $SMI roots.
The resulting roots are now
- $SMI<> with a real structure
- $INH<> because
  1. No structure defined
    - OK
  2. Generic structure e.g. FATTY ACIDS
    - Can be dealt with by *
  3. Can' t find name
- Other sources can be prohibitively expensive or impossible to use.
- Many on line facilities such as SciFinder are not geared to handle lists.
- Even if they did it costs $5 per connection table
Could add reaction/transform too as data about the enzyme.
dt_wish()
- Virtual databases
  - nam2smi would work better
  - Effectively merge in-house data in a maintainable fashion.
- Information Extraction tools for building databases from literature sources.
- Large preferably public database of chemical synonyms