The online.daylight.com experiment

Dave Weininger, Daylight CIS
Euromug 2001, Cambridge

Question:

What does it take to deploy chemical information sources on WWW?

Considerations:


online.daylight.com: content

Servers deliver very heterogenous data.

A variety of implemention architectures are used, partly due to design constraints, partly for experimentation. Data are highly connected by logical links, many of which represent chemical relationships (i.e., identity or similarity).

Conventional chemical information

pow (Po/w)
pow serves information about n-octanol/water partitioning, including CLOGP (computed) and LogPstar (measured) values from BioByte, Inc. CLOGP computation is done by an external program object (clogptalk). LogPstar data is obtained from the same Thor datatrees used in the the medchem00 database. This server demonstrates the implementation of mixed compute and data services as a single HTTP-oriented resource, as well as the ability to deliver encapsulated computational resources (program object). [Experimental. This is the only server in this set which is not implemented locally, due to the lack of an OS-X Fortran compiler.]

quasar (quantitative structure-activity relationships)
quasar serves information about Quantitative Structure-Activity Relationships. It is implemented as an interface to a live Thor database (BioByte's QSAR database, bbq). quasar's primary data objects are observed relationships between molecular structure and biological activities (6500+) or physiochemical properties (7600+) for a given set of molecules. quasar is designed to serve the original data, relationships between the QSAR sets of molecules, primary QSAR relationships, and relationships between them (comparative QSAR). quasar is one of the earliest fedora services (implemented before the HTTP toolkit). This project is currently "on hold" with no firm release plans. Consequently, the quasar server has not been fully developed and it does not establish data relationships with other fedora servers. [Experimental.]

wdi (world drug index)
The World Drug Index by Derwent, a large database of pharmaceutical information about named drugs including registered drugs and trial preparations. The wdi server uses the same data (same datatrees) as the production Thor database. The demo version available here contains under 5000 of the 65000+ entries in wdi003. wdi establishes data relationships with tcm and planet (as available). [Of the servers in this set, wdi has the most complete interface and documentation.]

savvy
savvy is a small, experimental server which provides molecular surface area and volume computations. This HTTP server emerged as a tool (rather than a product) which was useful in development of a computational algorithm. Currently, it only provides molecular volume calculation. But it's not without interest: it includes an implementation of the recent analytical solution to the general triple spherical intersection problem. It's not hooked up to anything else yet, but it should be ;-) [Experimental.]

Chinese medicine

tcm (traditional chinese medicines)
An electronic version of the book Traditional Chinese Medicines by Yan, Zhou, Xie and Milne, as published by Ashgate. The tcm server uses the same data (same datatrees) as the production Thor database tcm00. tcm's basic trick is a marriage of two kinds of data objects: biological entries (i.e., the sources of Chinese medicines) and chemical entries (structures observed in those sources). Most tcm data are in English and SMILES, with Latin and Pinyin data used for crossreferencing. tcm is loosely coupled to dcm and park, and establishes data relationships with wdi and planet.

zi4 (chinese character server)
zi4 is a non-interactive utility which provides various kinds of Chinese character image and mapping services to other servers. It knows about Unicode, traditional Chinese characters (BIG5), simplified Chinese characters (GB), and Pinyin tone symbols. It also knows about specialized Chinese characters used in medicine and pharmacology (this is not yet complete). zi4's main trick is delivering Chinese characters in a way which is useful on any web browser (GIF89a's). It also implements a neat algorithm for fast and reliable translation of Traditional Chinese to Simplified Chinese. [Someday, though not soon, we will all use Unicode and zi4 will not be needed.]

dcm (Dictionary of Chinese Medicine)
An electronic version of the book A Practical Dictionary of Chinese Medicine by Wiseman and Feng, as published by Paradigm. The dcm server processes data in a specialized format used for multilingual typesetting (CTEX). dcm is more like an encyclopedia than a dictionary. dcm maintains a large number (220,000+) of searchable fields in four laguanges (English, traditional Chinese, Pinyin, and Latin) and a huge number (many millions) of derived relationships. dcm is tightly coupled to zi4 which allows Chinese readers the option of displaying traditional or simplified characters.

park (photo archive)
park is a HTTP-based archive for annotated images such as photographs. Its intended purpose here is to provide images of the sources of Chinese medicines (i.e., photos of plants and animals). There isn't much content yet, due to the [surprising] difficulty in obtaining good photographs of correct specie. [Experimental.]

Biochemistry/bioinformatics/biocomplexes

ecbook (Enzyme Commission code book)
Metaphorics' ecbook provides convenient searching of enzymes by functionality and name. ecbook was developed as an internal tool to aid pathos data entry, but it has been so useful that it might end up as a product in its own right. [Experimental.]

pathos (metabolic pathway server)
Metaphorics' pathos server provides a model of the metabolic pathway chart. It is intended for both exploration of metabolism and as an index to data from other sources. Compounds, cofactors, enzymes, agents, regulators, steps and pathways are maintained as "live objects" (e.g., steps and pathways are represented by Daylight reaction objects.) Landmarks, genetic diseases and other notes are also integrated. This server is a proof-of-principle; the only 5-10% of data entry is completed. [Experimental.]

planet (protein-ligand association network)
Each primary data object in Metaphorics' planet server represents a protein-ligand association, i.e., one or more proteins and one or more ligands, with an association such as an observed binding or computed docking. A number of search methods are provided for exploring the relationships of proteins, ligands and their complexes. E.g., a novel method is used to evaluate ligand similarity which operates well with the poor oxidation state information typical of crystallographic data. planet is designed for exploratory data analysis of archived results rather than as a compute service (e.g., for docking). Currently, the planet dataset consists of ~500 observed complexes from the PDB. planet establishes data relationships with ecbook, pathos, tcm and wdi (as available). [Alpha.]

Utilities

fedora (federation of research assets)
The front door to the current set of fedora services. In privileged installations, the fedora server lives at port 80 (the default http service) so it can be accessed with just a machine name, e.g.,
http://online.daylight.com/
fedora will refer requests to other servers, e.g., the full URL of the tcm home page is
http://online.daylight.com:26551/tcm/index.html
but it can also be accessed via
http://online.daylight.com:80/fedora/tcm/index.html.
or just:
http://online.daylight.com/tcm
Starting with beta-5, fedora also provides a login service.

gold and dlog (server and daemon journal, daemon log)
gold is an HTTP server which maintains log files for fedora servers and daemons. dlog is an HTTP client program which submits log requests to gold. (Functionally, they are sort of mirror images of each other, in case you hadn't guessed from their names.) Both are utilities, mainly designed for use by system administrators rather than end-users. [beta]

grind (grins daemon)
grind provides services which are needed to disseminate Daylight's JavaGrins molecular editor applet. grind does its work behind the scenes, e.g., when JavaGrins generates canonical SMILES or shows menus of templates, there's some grinding going on. grind simulates a Java RMI server: in practice it obviates the need for JRE on a server delivering JavaGrins. It is the oddball member of the fedora server family -- it doesn't use the HTTP toolkit (it talks Java RMI) -- although it does respond to HTTP requests for monitoring. Our plans are to migrate JavaGrins to use HTTP communication in future Daylight releases. In fact, dayutilserver 4.72 is the very same progam as grind, but with gold logging disabled. [beta]

imagine (image engine)
The imagine server is primarily a non-interactive utility which provides various kinds of image generation, layouts and conversions. Its services are normally accessed via URIs. In principle, it is designed to be a central resource for invented images. This imagine service is currently limited to generating molecular depictions and is used by other servers such as (pow and savvy). [Experimental.]

sandman (server and daemon manager)
sandman is the server which runs all other fedora servers. Its main job is to start the other servers and keep them running. sandman logs its activity with gold. sandman's HTTP user interface is limited to a monitoring page and documentation for system administrators. [beta]

testhttp (http-toolkit test program)
The testhttp server is an example program which demonstrates basic functionality of Daylight's HTTP Toolkit. When this toolkit is released (v4.81) it is likely to contain source code for a program very much like testhttp, for the benefit of those programmers who will be using the toolkit. [beta]

online.daylight.com: fedora strategy

Use resources to simplify human and machine data access.


online.daylight.com: access

Restricting access is tricky. Total access IRL is trickier.

Fedora provides a "service domain" Fedora also provides a user/password registration "Total access" means "serve everyone" Robot-proofing
online.daylight.com: scaleability

Identical software should operate at all deployment scales.


online.daylight.com: reliability

Should meet highest standards of reliability and performance


online.daylight.com: hardware

Hardware selection is a mission-specific choice.


online.daylight.com: experience

What did we learn?