MUG '02
-- 16th Daylight User Group Meeting -- 26 Feb - 1 Mar 2002
fedora
Dave Weininger, Daylight CIS
To be covered:
  
    - What is fedora,
    
- and why bother?
    
- Why federate rather than unify?
    
- HTTP-orientation
    
- Information models
    
- Making it work: a modern delivery mechanism
    
- Demonstration: dayflash
    
- Example servers.
    
- To everything, there is a season: fedora's place
  
1. 
What is fedora, and why bother?
- What
  
  - Federation of Research Assets
- Web-like database system using HTTP communication
- Language is the unifying communications element
- The molecular model provides integration of chemical information
 
2. 
What is fedora, and
why bother?
- Why
  
  - Integration of disparate information is necessary, difficult
     
 Researchers are limited by fixed data models
- Computers get better, access to information gets more difficult
     
 Integrating a new type of information is harder now than in 1990
- We're not even reinventing the wheel, we're duplicating it!
     
 Encapsulate data once, then use it everywhere
- Researchers need to try out new things cheaply
     
 Special-purpose applications don't serve researchers well
 
3. 
Why
federate
rather than unify?
- Theoretically
  
  - Information is best represented by a native data model.
- Each server does the best job representing its data,
	 regardless of application.
- Integration with other information sources is done by each
     server using local, native data model.
- Servers don't have to know about each other's conventions.
- Cheap to try new things, due to minimal side effects.
 
- Practically
  
  - Each resource can be supported/tested/administered separately.
- Resource modularization allows efficient, specific deployment.
- Scale independence: isolated laptop to collaborative web.
- Potential for zero-administration implementation.
 
4. 
HTTP-orientation
- Advantages
  
  - HTTP was specifically designed for collaborative computing
  
- HTTP is now used for most information exchange on this planet
  
- Flexible: can transport all kinds of data (not just HTML)
  
- Scaleable: from laptop to workgroup to corporate LAN to WWW
  
- Universal: virtually everyone already knows how to browse HTTP
  
 
- Disadvantages
  
  - Stateless model is harder to secure
  
- Real-time applications are limited
     
 e.g., molecular editing, 3-D display
 
5. 
Information models
Here are some fedora database synopses, selected to illustrate the wide
variety of information models which are "native" to various kinds of data.
- logpstar
  - observed hydrophobicity as log(Po/w)
  
    - 11,053 structures, names, observed data of one kind
- "simplest" chemical information model is still non-trivial
- find exact/generic structure, similar structures, find by name
 
- wdi
  - World Drug Index, pharmacology of named drugs
  
    - 67,059 formal entries containing ~800,000 fields
- 7 pharmacology types (100K items),
	12 name types (400K items)
- discrete structures, combination preps and unknown structures
    
- full complement of structure/name/data searching methods needed
  
 
- tcm
  - Traditional Chinese Medicine: structures, indications and effects
  
    - biological entries (sources of Chinese medicinals)
- chemical entries (structures observed in Chinese medicinals)
- information is represented in English, Chinese (Pinyin) and SMILES
- underlying information model is Traditional Chinese Medicine
 
- park
  - annotated photographic archive of medicinal sources
  
    - photographs of medicinal sources which are mostly biological
- provides best identification available for such entities
- complements Latin/English/Chinese names which are not reliable
- annotations searchable in multiple languages
 
- dcm
  - Dictionary of Chinese Medicine, encyclopedic
  
    - 220,000+ fields in 4 languages (English, Chinese, Pinyin, Latin)
- underlying information model is Traditional Chinese Medicine
- identifiers are expressed as traditional Chinese characters
- add'l info, e.g., western medical concepts, acupuncture channels
- intrinsic queries are multilingual-multiconceptual
 
- zi4
  - Chinese character service
  
    - Chinese character image and mapping services
- Unicode, Traditional- and Simplified-Chinese characters
- Traditional to Simplified translation
- pragmatic utility handles non-trivial requirement
 
- planet
  - Protein-Ligand Association NETwork
  
    - each data object is a protein-ligand association, i.e., a
    relationship between one or more proteins and one or more ligands
- e.g., observed binding, computed docking
- large/small molecule search/similarity methods are implemented
- robust in the face of relatively poor oxidation state information
        typical of crystallographic data
- big and fast
 
- pathos
  - metabolic pathway chart
  
    - models a modern metabolic pathway chart
- agents, cofactors, compounds, diseases, enzymes, landmarks, notes,
	pathways, regulators, steps
- represents unified metabolism (plant, unicellular and animal)
- a natural index: most drugs operate in this realm
- massive image representing a massive reaction schema
- integrated name/structure/similarity/functionality searching
 
- ecbook
  - Enzyme Commission codebook
  
    - EC code index is a discrete model of enzyme functionality
- EC codes are dynamic with time
- many-to-many relationship required due to multifunctional enzymes
- primarily a crossreference and index to other databases
 
- qsar
  - Quantitative Struture-Activity Relationships
  
    - Each primary entity is an observed relationship between molecular
        structure and biological activity (6500+) or physical property (7600+)
	
- Raw data and references are included (13,000+ authors!)
- Search by both component (data) and relationship (QSAR) properties
- Supports relationships of QSAR relationships (comparative QSAR)
- However, other applications might be just as important
 
6. 
Making it work:
a modern delivery mechanism
The goal:
deliver large amounts of disparate information in
a manner which is robust, reliable, integrated, and operationally trivial.
-  "Computers get better, access to information gets more difficult."
-- if we are to succeed, this needs to be reversed.
- Continuing improvements in computer technology makes it possible
to simplify information access, let's use it.
- HTTP-oriented services form a logical beginning:
how do we implement a complete solution for research purposes?
- One attractive solution: embedded HTTP servers on flash memory devices.
  
    - good capacity, excellent random-access performance
    
- promises zero-administration solution
    
- embedded hardware licensing can simplify installation
    
- provide all required resources: admin privileges not needed
    
- robost, no moving parts, 10 year persistance
    
- main disadvantage: this is not the way we already do things!
  
 
- This has been implemented using USB flash-memory-based drives.
7. 
Demonstration:
dayflash
I plan on plugging a dayflash device into a USB port on a laptop:
it should come up serving multiple data sources to the network with
no fuss, no bother.
- Notice:
  
  - Plug-n-play: self configuring, zero-admin
- Truly minimal interaction, no command line
- Nothing written to disk: no traces, no priv's needed
- Delivers high performance network services
 
8. 
Example servers
- Available at MUG'02
  
  - tcm472x: 
     http://mainline.daylight.com:8888/
     
 wdidemo, tcm, dcm, park, zi4, logpstar
- mug02: 
     http://janus.daylight.com/
     
 above, plus full wdi, qsar, imagine, pathos, and some more
- meta4x: 
     http://meta4x.daylight.com:23456/
 planet, pathos, ecbook, wdidemo
 
9. 
To everything, there is a season:
fedora's place
- Excellent for wide dissemination of complex data sources
- Good for research; not perfect but better than what we have so far
- Does not compete with RDB of data with known/simple data models
- Limited number of databases to be released with Daylight 4.81
- HTTP toolkit also to be released with Daylight 4.81
Thanks are due to a large number of people and companies who have contributed
(and are still contributing) to this effort in various ways:
- Apple ... for still being there
- Derwent ... for WDI
- Jack Delany and Yosi Tatiz, Daylight
... for making room for this work
- Nigel Wiseman and Bob Felt, Paradigm ... for DCM
- Peter Nielsen, Daylight and Bill Milne, Ashgate
... for TCM
- Dawn Abriel ... for medical expertise
- Ragu Bharadwaj, Daylight ... for JavaGrins and OS-X enthusiasm
- Roger Sayle, Metaphorics ... for rasmol and inspiration
- Scott Dixon, Metaphorics ... for planet
- Xinjian Yan, NIST ... for TCM
- Zhou Jiaju, Chinese Academy of Sciences ... for TCM
- Al Leo and Corwin Hansch, BioByte ... for c/logp and QSAR
Thank you for your time and interest.