MUG '02 - Weininger

MUG '02 -- 16th Daylight User Group Meeting -- 26 Feb - 1 Mar 2002

fedora
Dave Weininger, Daylight CIS

To be covered:

What is fedora,
and why bother?
Why federate rather than unify?
HTTP-orientation
Information models
Making it work: a modern delivery mechanism
Demonstration: dayflash
Example servers.
To everything, there is a season: fedora's place

1. What is fedora, and why bother?

What
- Federation of Research Assets
- Web-like database system using HTTP communication
- Language is the unifying communications element
- The molecular model provides integration of chemical information

2. What is fedora, and why bother?

Why
- Integration of disparate information is necessary, difficult
  Researchers are limited by fixed data models
- Computers get better, access to information gets more difficult
  Integrating a new type of information is harder now than in 1990
- We're not even reinventing the wheel, we're duplicating it!
  Encapsulate data once, then use it everywhere
- Researchers need to try out new things cheaply
  Special-purpose applications don't serve researchers well

3. Why federate rather than unify?

Theoretically
- Information is best represented by a native data model.
- Each server does the best job representing its data, regardless of application.
- Integration with other information sources is done by each server using local, native data model.
- Servers don't have to know about each other's conventions.
- Cheap to try new things, due to minimal side effects.

Practically
- Each resource can be supported/tested/administered separately.
- Resource modularization allows efficient, specific deployment.
- Scale independence: isolated laptop to collaborative web.
- Potential for zero-administration implementation.

4. HTTP-orientation

Advantages
- HTTP was specifically designed for collaborative computing
- HTTP is now used for most information exchange on this planet
- Flexible: can transport all kinds of data (not just HTML)
- Scaleable: from laptop to workgroup to corporate LAN to WWW
- Universal: virtually everyone already knows how to browse HTTP

Disadvantages
- Stateless model is harder to secure
- Real-time applications are limited
  e.g., molecular editing, 3-D display

5. Information models

Here are some fedora database synopses, selected to illustrate the wide variety of information models which are "native" to various kinds of data.

logpstar - observed hydrophobicity as log(P_o/w)
- 11,053 structures, names, observed data of one kind
- "simplest" chemical information model is still non-trivial
- find exact/generic structure, similar structures, find by name

wdi - World Drug Index, pharmacology of named drugs
- 67,059 formal entries containing ~800,000 fields
- 7 pharmacology types (100K items), 12 name types (400K items)
- discrete structures, combination preps and unknown structures
- full complement of structure/name/data searching methods needed
tcm - Traditional Chinese Medicine: structures, indications and effects
- biological entries (sources of Chinese medicinals)
- chemical entries (structures observed in Chinese medicinals)
- information is represented in English, Chinese (Pinyin) and SMILES
- underlying information model is Traditional Chinese Medicine
park - annotated photographic archive of medicinal sources
- photographs of medicinal sources which are mostly biological
- provides best identification available for such entities
- complements Latin/English/Chinese names which are not reliable
- annotations searchable in multiple languages
dcm - Dictionary of Chinese Medicine, encyclopedic
- 220,000+ fields in 4 languages (English, Chinese, Pinyin, Latin)
- underlying information model is Traditional Chinese Medicine
- identifiers are expressed as traditional Chinese characters
- add'l info, e.g., western medical concepts, acupuncture channels
- intrinsic queries are multilingual-multiconceptual
zi4 - Chinese character service
- Chinese character image and mapping services
- Unicode, Traditional- and Simplified-Chinese characters
- Traditional to Simplified translation
- pragmatic utility handles non-trivial requirement
planet - Protein-Ligand Association NETwork
- each data object is a protein-ligand association, i.e., a relationship between one or more proteins and one or more ligands
- e.g., observed binding, computed docking
- large/small molecule search/similarity methods are implemented
- robust in the face of relatively poor oxidation state information typical of crystallographic data
- big and fast
pathos - metabolic pathway chart
- models a modern metabolic pathway chart
- agents, cofactors, compounds, diseases, enzymes, landmarks, notes, pathways, regulators, steps
- represents unified metabolism (plant, unicellular and animal)
- a natural index: most drugs operate in this realm
- massive image representing a massive reaction schema
- integrated name/structure/similarity/functionality searching
ecbook - Enzyme Commission codebook
- EC code index is a discrete model of enzyme functionality
- EC codes are dynamic with time
- many-to-many relationship required due to multifunctional enzymes
- primarily a crossreference and index to other databases
qsar - Quantitative Struture-Activity Relationships
- Each primary entity is an observed relationship between molecular structure and biological activity (6500+) or physical property (7600+)
- Raw data and references are included (13,000+ authors!)
- Search by both component (data) and relationship (QSAR) properties
- Supports relationships of QSAR relationships (comparative QSAR)
- However, other applications might be just as important

6. Making it work: a modern delivery mechanism

The goal: deliver large amounts of disparate information in a manner which is robust, reliable, integrated, and operationally trivial.

"Computers get better, access to information gets more difficult." -- if we are to succeed, this needs to be reversed.

Continuing improvements in computer technology makes it possible to simplify information access, let's use it.

HTTP-oriented services form a logical beginning: how do we implement a complete solution for research purposes?
One attractive solution: embedded HTTP servers on flash memory devices.
- good capacity, excellent random-access performance
- promises zero-administration solution
- embedded hardware licensing can simplify installation
- provide all required resources: admin privileges not needed
- robost, no moving parts, 10 year persistance
- main disadvantage: this is not the way we already do things!
This has been implemented using USB flash-memory-based drives.

7. Demonstration: dayflash

I plan on plugging a dayflash device into a USB port on a laptop: it should come up serving multiple data sources to the network with no fuss, no bother.

Notice:
- Plug-n-play: self configuring, zero-admin
- Truly minimal interaction, no command line
- Nothing written to disk: no traces, no priv's needed
- Delivers high performance network services

8. Example servers

Available at MUG'02
- tcm472x: http://mainline.daylight.com:8888/
  wdidemo, tcm, dcm, park, zi4, logpstar
- mug02: http://janus.daylight.com/
  above, plus full wdi, qsar, imagine, pathos, and some more
- meta4x: http://meta4x.daylight.com:23456/

9. To everything, there is a season: fedora's place

Excellent for wide dissemination of complex data sources
Good for research; not perfect but better than what we have so far
Does not compete with RDB of data with known/simple data models
Limited number of databases to be released with Daylight 4.81
HTTP toolkit also to be released with Daylight 4.81

Thanks are due to a large number of people and companies who have contributed (and are still contributing) to this effort in various ways:

Apple ... for still being there
Derwent ... for WDI
Jack Delany and Yosi Tatiz, Daylight ... for making room for this work
Nigel Wiseman and Bob Felt, Paradigm ... for DCM
Peter Nielsen, Daylight and Bill Milne, Ashgate ... for TCM
Dawn Abriel ... for medical expertise
Ragu Bharadwaj, Daylight ... for JavaGrins and OS-X enthusiasm
Roger Sayle, Metaphorics ... for rasmol and inspiration
Scott Dixon, Metaphorics ... for planet
Xinjian Yan, NIST ... for TCM
Zhou Jiaju, Chinese Academy of Sciences ... for TCM
Al Leo and Corwin Hansch, BioByte ... for c/logp and QSAR

Thank you for your time and interest.