David Weininger
Daylight Chemical Information Systems, Inc.
Santa Fe, New Mexico
One of the most powerful Web concepts is the Common Gateway Interface (CGI), a mechanism to deliver computational results and a meta-GUI to an end user over the net. Given CGI-capable servers and clients, the "Web" becomes a what we used to think of as a "computer" and computers become more like abstract "computational resources".
The Daylight Common Gateway Interface (DCGI) is a chemical information interface which operates using Web technology. Such technology can be used on an isolated machine, in a secure network, or in the global WWW environment. The Web provides a great framework and many general purpose utilities but was not designed to deliver chemical information interactively and is deficient in this respect. DCGI provides tools, utilities, and meta-interfaces required to successfully deliver chemical information in the Web environment.
In theory, user-interface programs should have become less precious since maintaining data integrity is now a server responsibility. In fact, this simplification was more than offset by the added complexity required to create a reliable, event-driven GUI. Such systems remained expensive largely due to the added effort required to build complex client interfaces.
This architecture should result in major improvements in the availability and cost of chemical information access. Since the entire information delivery interface is provided "free" to the developer, a system based on a "zero-cost seat" should be possible. Remaining issues include access control, high-volume data delivery and handling the special requirements of chemical information.
E.g., Netscape, MacWeb, HotJava, Mosaic, etc.
E.g., from NCSA, CERN, Netscape, Sun, etc.
E.g., thorserver, merlinserver, spresi95, wdi95, etc.
Each DCGI release comes with a list of "published" URLs. Published URLs are supported, i.e., they're documented and you can count on them being there and working in future releases. The /dayhtml/published.html is an example.
/dayhtml/doc includes all man(1) pages (with hyperlinked see-also's), all user guides, and the administrator, theory, and programmer reference manuals. Hard copy documentation is still available, but why lug it around?
Such CGI programs are normal Daylight Toolkit programs which exchange information in HTML and other WWW formats.
E.g., The word GLYMIDINE is linked to a DCGI program which samples a conformation of (using rubicon) and downloads to rasmol via the HTML browser.
As always, the tools we use at Daylight are available as a supported products. The DCGI Toolkit provides tools which allow you to add chemical information to your own HTML files and to create custom HTML interfaces. E.g., the smi2gif program makes it simple to add on-the-fly chemical structural depictions to an HTML file:
Compared to any special-purpose chemical information end-user interface, HTML browsers are amazingly capable, reliable and inexpensive. There is no mystery to this. A program such as Netscape is beta tested by literally millions of users and its development cost is spread out widely. No chemical information interface will ever have this advantage -- there just aren't that many people interested in chemistry. (The same is true for HTTP servers such as httpd.)
There is a growing niche for public chemical information software. Much of this is oriented to format conversion and interoperability, but some chemical freeware has established itself as a standard in sophisticated areas (e.g. rasmol). The DCGI approach is more of an "organization between collaborating programs" than a "monolithic single-vendor system" and it makes it very easy to take advantage of such software.
There is something magical about HTML browsers. Despite the fact that they aren't actually easier to use or set up than conventional GUI interfaces, an enormous number of people have embraced them. The fact that they are freely available and generally useful seems to empower otherwise "computer-timid" users. The same user who is unwilling to spend 10 minutes reading a manual for a special purpose interface will gladly spend an hour setting up an HTML helper program on their personal computer. The reason, of course, is that the perceived value of an HTML browser is extremely high, since it open up a whole world of information exchange. In any event, this phenomenon is wonderful from the user training point of view.
Hardware needed to implement the DCGI interface system includes a capable Unix box host the servers, user workstations or personal computers which run Netscape and an IP network to tie them together. In most cases, all the hardware needed is already in place.
Chemical companies in general do not provide adequate chemical information access to most of their employees who need such information. On one extreme, database managers and drug designers get excellent access. On the other extreme, temporary employees in the stockroom typically don't get any (although they probably have an equally valid need). The reality is that providing enough information access to optimize the productivity of each employee is very expensive due to the per-seat cost. It's simple math, e.g., $5000 x 1000 users is $5 million per year.
Any time a chemical information systems vendor produces a commercial program to be run by an end-user, the development must be paid for and the product must be supported. All extant chemical information systems work this way and all of them are expensive. By using HTML browsers, however, the chemical information interfaces can be implemented on a server and the theoretical need for a per-seat cost disappears.
The Web technology underlying the DCGI system fundamentally evolved to provide widely-distributed information exchange. By design, the resultant DCGI system scales up smoothly from a single isolated system with a single user to a global information service with 1000's of users.
The same arguments that apply to interface servers and clients apply to secure HTML servers (SHTML). No security system can claim absolute perfection, but it's reassuring to use a system which is under constant use and attack by a lot of clever people. To date, no chemical information system has been subject to such stringent testing.
The "collaborating programs" approach used by DCGI simplifies prototyping and developing chemical information interfaces (compared to using other GUIs). The DCGI system allows most task-oriented programs to reference each other, so such interfaces can be used as tools for other "programs" without relinking or recompiling.
However, the rest of the task in producing a production quality program is not equally simplified. DCGI doesn't dramatically reduce the effort needed to define a problem, establish an acceptable interface, document it, and respond to conflicting user requirements.
Most of the advantages of DCGI derive from the widespread use and development of the HTML protocol. Such an enormous amount of information is being delivered via the World Wide Web that we are assured of the survival of this technology for many years. Our current interfaces (e.g., xvthor, xvmerlin) have grown so complex that extending them has become a significant effort. The flexibilities inherent in the DCGI system are barely tapped.
Computational methods used in chemical information systems have historically lagged behind those used in by mainstream of computer science by about a decade. With DCGI, we're closing the technology gap to 2-3 years. This may make some people nervous, and for some good reasons. For instance, system administrators will need to learn new skills and to manage httpd servers and internal web networks. The newness of this system is ameliorated by its omnipresence (there are a lot of people in the same boat).
Secure HTML (SHTML) provides point-to-point security in the form of encrypted authorization and communication. This all-or-nothing approach to security was historically used in heavy-handed government and military systems while most chemical information systems have used lightweight authorization-only schemes. Although it's nice to provide real security, the result might not be entirely welcome. E.g., authorization passwords would need to be maintained for all users of a secure system, even those with access only to "internal public" data.
Like all Web users, DCGI users are expected to understand something about the role of various information formats. For instance, Web users need to know basically what a GIF file is used for. A similar case in the DCGI system would be the SMILES format. Just as there are GIF creators and viewers, there are SMILES editors and viewers. In either case, users need to know when its appropriate to use such formats and programs. The underlying Web methodology makes it difficult to protect users from such issues.
To achieve a supportable system in the long run, DCGI is very restrictive compared to the anarchy of the WWW. It's not OK to make arbitrary HTML references to anything in the system. DCGI is set up as a series of autonomous packages which can operate however they like internally but can only reference external entry points (URLs) which are "published". There is no question that DCGI needs such restrictions are for stability and supportability. In doing so, it gives up some of the good aspects of the "footloose" character of the Web.
DCGI uses more computational overhead and network bandwidth than a properly installed X-based-system. In fact, almost none of our customers use the Daylight X-system in the way that it was designed, so this might be a moot point.
Existing 4.x databases and database servers are unchanged. All database operations are implemented as network transactions. Databases and database servers communicate only via the network (using IPX) and form a well-defined package that can be modified without affecting system operation.
DCGI programs communicate via the network: on one side they access databases (via Daylight Thor and Merlin Toolkits), on the other they talk to the HTTP server. Meta-interfaces are implemented by such CGI programs. Note that use of the Toolkits has been moved from the client side to server side in this architecture.
All traffic control is managed by an HTTP server including access control, delivery of the meta-interface and and transaction security. The system is designed to run on "vanilla" public domain servers (e.g., NCSA's httpd).
The end-user interface is an HTML browser such as Mosaic, Netscape, MacWeb, etc. The design user-interface protocol is HTML-3.x (HTML-2.0 with tables). The browser used for initial development is Netscape 1.1N because it is very capable, reasonably reliable, and widely available.
Client-side helper programs are an important part of the system. Mainstream Web helper programs will be used whenever possible, (e.g., GIF, JPEG, MPEG viewers). Helper programs specific to chemical information also be used, either via specific languages (e.g., SMILES from molecular editors) or via a general chemical object-exchange mechanism (e.g., CEX to rasmol or other modeling packages).
Given a suitable task, HTML is an excellent protocol for writing "zerofaces" (minimalistic user interfaces). For the World Drug Index, the basic task is delivering pharmaceutical information given one of a large number of identifiers (structure, preferred name, trade name, INN, USAN, CAS number, manufacturer's id, etc.)
The wdi HTML interface has only one entry field in which a user can enter any identifier. When SUBMIT or RETURN is selected, all data corresponding to that identifier is produced.
The user does not need to know anything about the way the data is stored or retrieved, e.g., the host name, service, database name or version, what kind of identifier is entered, whether the structure is known, etc.
For instance, here are the wdi entries for verapamil, No-Doz, RU-486, pectin and CC(=O)NO.
Visual context switching can be confusing. In the wdi interface this is kept to an absolute minimum by making the query and data pages appear to be the same "page".
If an identifier is ambiguous, a graphical index is provided with links to the appropriate page section. For instance: dristan and udolac.
Since WDI is used by non-chemists, "SMILES exposure" is kept to a minimum. SMILES are only used for structural entry, and then only as an alternative to Grins. On the other hand, chemists may use WDI to find the structure given a trade name, in which case obtaining the SMILES may be important. This is handled by linking the depictions to a SMILES entry page.
Pages with data are entitled "WDI: identifier" and may be saved as a dynamic bookmark. When this bookmark is invoked (or page is reloaded), the page will reflect the current state of the database.
This requires a bit of invisible trickiness since one cannot normally save the results of a FORM entry as a bookmark. The "trick" is done by using a FORM-handling CGI program wdiform which causes the httpd server to instruct the browser to re-invoke the original wdi CGI script with an argument containing new instructions (in this case, the hex-encoded identifier). Believe it or not, these kinds of twisted relationships are normal on the Web.
The "bookmarkability" trick establishes an absolute URL address for every possible wdi page. Given that, pages can be referenced in a normal HTML file, e.g.
<A HREF="/daycgi/wdi?4449415a4550414d">DIAZEPAM</A> -------- ------ --- ---------------- -------- --- reference alias | "DIAZEPAM" (hex) hot-text end CGI program nameAppears in an HTML page as: DIAZEPAM
One nice feature of HTML browsers is that you can print anything you see. Color depictions of structures are also nice, but unfortunately don't print well on most monochrome laser printers. To resolve this, a hot-link to a "black on paper" version appears in at the top of each page. (This bit of HTML magic causes black-on-white GIF-89a images to be produced with with an transparent background.)
Like the interface to wdi, the acd interface implements a single-line query which can be any identifier (SMILES, CAS numbers, ACD names, ACD numbers, catalog names or catalog numbers). With ACD, a most commonly ID used is the catalog number which can be extremely ambiguous (e.g., "11").. Delivering all possible data associated with an identifier would sometimes produce an unreasonable amount of data (e.g., supplier information for all sulfuric acid products).
A three-page interface is therefore used:
The core of the interface is a table containing all product entries sorted by specific price. If the specified structure is ambiguous, an index is provided to multiple tables on the page (e.g. for various isotopes). Supplier codens are hot-linked to ordering information (e.g., supplier name, address, phone number, etc.) Product table pages may be bookmarked, referenced, and printed, e.g., 2-AMINOHEPTANE.
The first option is "language orientation". Spresi is a truly international database. This option limits the output to include only references to journals published in desired languages.
The second option controls the number of similar structures retrieved. It is set to 10 by default but may be decreased to 1 (exact match or most similar structure only) or increased to as much as 100.
Referencing may be done at the can be done at the query level, e.g., Savant query for diphenyl sulfide, or at the report level, e.g., Papers on synthesis of 3-iodo-toluene.
The report page allows the user to automatically re-enter savant with a structure retrieved by the last search.
Savant looks for journal article (JA) and patent (PAT) data which contain the keyword preparation. The spresi95preps database is a subset of spresi95 which only contains structures with journal articles which have this keyword and therefore works just as well as the whole database for this purpose, while saving the server a lot of space (200 vs 500 MB). Savant also works with spresi95demo, but not very well, since it only contains 1410 structures.
It would be reasonable to create Savant-like interfaces for other purposes, e.g., for analytical papers, patents, etc.
Hyperthor allows you to access Thor data by specifying the host, service, database, datatype, and identifier. The fact that the user needs to know the names of these things to specify them is a certain disadvantage. Advantages include generality (works with all Thor databases), flexibility (for the user), and efficiency (for the computer). Default values can be set to minimize how much a user needs to know to get in to the system.
Hyperthor's datatree display should be clear to users of Daylight's 4.x systems. If enabled, 2D data will be drawn as stored in a database, 3-D data can be downloaded to helper programs, and data survey tables are offered more than one dataitem exists for a given datatype. Server-side behavior is controlled by options in the DCGI environment, e.g., one can control which types of data are displayed and how. Client-side behavior is controlled by browser options (e.g., .mailcap and .mime.types control which helper programs get invoked). One could set up a default CGI schell script for each database (e.g., hyperacd, hyperwdi).
The version of Hyperthor shown at Euromug(s) is a beta-level program. The main issue currently outstanding is how to bookmark and reference hyperthor pages while allowing large amounts of data to be processed (e.g., the current version does not limit data size length). Also needed are datatype-specific options which control helper program invocation from the server side (via mime typing).
MCL is an English-language interface to the Merlin search engine. The 4.42 version of MCL is nearly unchanged from previous releases except that it will optionally write output in HTML (mcl -h). The wizard user generates an MCL program with an HTML-GUI which is then run with HTML output. If the display is done with tables, the output looks a lot like something you might see in xvmerlin.
The Wizard program specification page allows the user to write an MCL program using an HTML graphical interface (menus, text fields, etc.) The interface is very powerful although somewhat rough looking.
The version of Wizard shown at Euromug(s) is a alpha-level program. The main issue currently outstanding is how to bookmark and reference wizard programs without creating a maintenance headache.
In putting together the Daylight HTML interfaces, we needed to develop quite a few CGI-specific widgets, gadgets and dohickies. As with virtually all other tools that we use internally at Daylight, we intend to offer these as a supported product, the DCGI toolkit. Unlike the oop-ish toolkits that we currently offer, the DCGI toolkit will be a mixture of C-object libraries, scripts, and programs.
The biggest problem in building all but the simplest WWW interface is that it is based on a stateless model (no persistent client context). The DCGI system provides a reliable mechanism for managing a client context based on hidden FORM entries.
smi2gif is a CGI program which accepts a SMILES as a hex-encoded argument, generates a structural depiction, and produces a GIF file suitable for HTML display. Control of color mode and image size are provided. cxt2depict does the same job, but operates on context variables which makes it more powerful than smi2gif: it handles input of unlimited length and will use specified coordinates (if provided).
Grins (Graphical input of SMILES) is an HTML-based molecular editor which is specifically designed for structure specification. The version distributed with v4.42 is a lowest-common-denominator editor, working with with any (HTML 2.0) browser. Grins is intended to be called from an HTML form and will return the SMILES of the user-specified structure to a URL (as a hex-encoded argument) or via a POST-ing (as a HIDDEN INPUT). The file testgrins.html provides minimal examples of how to invoke Grins.
A persistent irritation when working with GIFs is that the algorithm which is nearly universally used for image compression is patented by Unisys. Everybody seems to use it anyway and Unisys doesn't seem to enforce their patent rights ... but ignoring it isn't a good solution for commercial software. The DCGI system includes a novel GIF generator which operates using a different compression algorithm (Sayles/Knuth). This method only marginally slower than the patented algorithm and produces identical GIF files. A graphics library and graphics-object-to-GIF functions are provided with the DCGI system.
Download a conformation from your program to an HTML browser as a CEX object for use with rasmol and other modeling programs. Can operate either from a Daylight conformation object or SMILES and (X,Y,Z) coordinates.
Producing a reliable HTML interface involves keeping track of a large number of details. The DCGI system provides a number utilities to solve such problems (or at least make them manageable). These include: a error handling mechanism, functions which allow arbitrary text to be written in HTML safely, hex-encoding deduction and conversion, scratch file management, and others.
The version of Grins available in MUG '96 is the 4.42 version, suitable for input of generic organic structures only. There is only one control panel in this version:
The initial version of the 4.5x Grins prototype is very similar:
Additional capabilities (chirality and reaction specification) are provided by a second panel which is available by selecting the "More" control:
If reaction specification is prohibited the panel looks like this:
Are the functions of the smiley faces clear? Can anyone suggest better icons?)
It is tempting to write a version of Grins in Java. Maybe we'll try it out and see. In any event, the basic HTML-Grins capabilities will be available in DCGI.
Installing a Daylight HTML-based chemical information system is intrinsically more complicated than systems based on dedicated clients and servers. The main reason is that instead of one place for installation to take place, there are three: the Daylight software, the HTTP server, and the HTML clients. Fortunately, most of our customers are already running internal HTTP servers and HTML clients, so it shouldn't be too much of a hassle. In any event we will need to develop an installation scheme which smooths out this process.
As described above, it is possible to operate a secure DCGI system by using secure point-to-point protocols, i.e., encrypting all communication between the HTTP server and the HTML client. While secure, the current methods of achieving this require an "all-or-nothing" approach to security. Is this acceptable?
The idea of a server-based chemical information system with zero-cost seats is a new one in many respects. Extant methods used for licensing software (number of seats, number of users, number of concurrent users) just don't reflect the reality of what's going on.
Our inclination is to simplify our licensing scheme in a way which is consistent with what's actually happening. We are proposing to consolidate the ever-increasing number of product components into a few functionally-defined packages and license them in "small", "medium", and "unlimited" versions based on annual usage. For instance, a "database server" package would include all servers and supporting software; average lookup usage limits might be 200/day (small) and 2000/day medium; a log would be kept for accounting purposes but no usage restrictions would be enforced (except an annual review of usage when renewing the license). You only pay for what we provide: internal database services. There would be no limits on the number of users (what's that to us?) the number of seats (now equipped with non-Daylight software) or CPU usage per se (after all, it's your CPU!)
Given that the DCGI system uses existing Thor databases and servers, and that capable HTML components are readily available, the technical transition should be relatively simple. The main challenges are establishing reliable installation and maintenance protocols.
Physical transition is simplified by the fact that end-users can use any machine that runs Netscape. The same is true on the server side: any machine that can run the Daylight servers can run an httpd server. The only glitch would be if a company didn't have an IP network installed (almost unheard of these days).
We expect that user training will be equally straight-forward. It seems like everybody and their hairdresser knows how to run Netscape and surf the Web these days. Such skills apply directly to DCGI.
The biggest change will be for people writing custom programs. Although writing HMTL interfaces is much simpler than doing so with other GUIs, it's done very differently and has a steep learning curve.
Finally, if we do take the plunge and change our licensing scheme, we will be committed to making the change as smooth and as fair to our customers as possible.