Making Tables from Trees |
EuroMUG'01 -- 13-14 Sept 2001 -- Cambridge, UK
|
Jeremy Yang |
"Making Tables from Trees"
Building DayCart/Oracle tables from Daylight/Thor databases
Jeremy Yang
Daylight CIS
Abstract: The different data models and datatypes of Thor vs.
Oracle should guide the interconversion of datasets between these two
DBMS's. These issues will be described and related to an example
conversion of WDI to DayCart-ready format.
Daylight data structure
- Each server -> multiple DBs of related content.
- Each DB -> N SMILES-rooted Thor Data Trees (other primary IDs also ok).
- Each TDT -> multiple xref-ID-rooted subtrees.
- Subtrees may correspond with isomers.
- Each ID may have associated dataitems.
- Each dataitem may consist of multiple datafields.
- Each field is typed (text, numeric, binary, etc.).
Example TDT
RDB data structure
- Each server -> multiple schema.
- Each server -> multiple tables.
- Related content (i.e., one "DB") typically contained in multiple tables.
- Each table 2D array of distinct rows of typed columns.
- Primary key may uniquely identify each row.
- "Database" typically means an Oracle instance w/ data.
Considerations for read-only compound databases (e.g. WDI, ACD)
- Data curators (Derwent, MDLIS Inc.) have already defined chemical
entities and corresponding primary IDs (DXRN, ACD#).
- Read-only so transaction-controls are not needed.
Supporting ALLNAMES queries
- WDI has many name datatypes (Derwent name, trade name, international
name, approved name, etc.)
- Search-all-names query is desireable (like Thor $NAM feature).
- Oracle VIEW defined for ALLNAMES queries.
CREATE VIEW WDI_ALLNAMES AS (
SELECT DXRN,DERWENT_NAME "NAME" FROM WDI_DRN
UNION SELECT DXRN,US_ADOPTED_NAME FROM WDI_USAN
UNION SELECT DXRN,NAM FROM WDI_NAM
UNION SELECT DXRN,APPR_NAME FROM WDI_APN
UNION SELECT DXRN,TRADENAME FROM WDI_TN
);
Conversion plan / table design
- One main table using existing primary key.
- For each datatype where not more than one occurs per primary id
create a column in the main table. For N-field datatypes
create N columns.
- For each datatype where at least two occurances correspond with
a single primary id value, create an auxillary table using
the main table primary key as foreign key. Both columns are
be non-unique.
WDI TDT summary
WDI tables
ACD TDT summary
ACD tables
Demos:
Open issues
- Performance tuning: indices, table-clustering
- Normalization, constraints, and triggers for supportability and
consistency
- Organizing tables by schema objects. Standardizing table,
column, schema names.
- Datatypes: user-defined, objects (e.g., multi-field types may
map to one complex RDB type instead of many simple)
- DBs without existing primary key. Use SMILES?
- Thor indirect references.
- Defining standard, efficient queries.
- Transaction processing for read/write DBs.
- Security
- Parent structures (not a conversion issue really)
- DBs which have chemical structures but are not essentially
compound databases (e.g., NCI bio + chem datasets).
- Application issues: parsing SQL output, handling CLOBs, binary
data, atom-N-tuple data, etc.
Thanks!
- Jack Delany - DayCart, Oracle know-how
- Mick Kappler - ACD conversion and web apps
EuroMUG'01 -- 13-14 Sept 2001 -- Cambridge, UK
Daylight Chemical Information Systems Inc.
support@daylight.com