dayblob '99

Dayblob: an ORDBMS-oriented
chemical information package

Dave Weininger, Daylight

What is dayblob?

Dayblob is a robust encapsulation of high-performance chemical structure processing algorithms which is designed for use within ORDB/RDB systems.
The goal of dayblob is to support all types of chemical applications which might be reasonably implemented within an RDB environment, such as:
- structure registration
- inventory control
- laboratory information management
- electronic laboratory notebooks
- chemical/biological data integration
- experimental design
- exploratory data analysis
The basic "trick" required of dayblob is to repackage object oriented chemical information algorithms so they are suitable for use within real-life RDB environments without giving up flexibility, stability, capability or performance.
A novel architecture was developed to meet these requirements in which structure processing is done within a single chunk of persistent storage (in RDB terms, a binary large object or BLOB). When implemented on a capable server, the entire BLOB will be memory-resident and the subsystem should (in principle) deliver extremely high performance.
A dayblob-based Data Cartridge has been successfully implemented within Oracle 8i and appears to meet the design criteria in all respects.

Design criteria

Functionality
Targeted functionality was determined by three factors. A group of companies working with Oracle produced a desired functionality document. This was expanded to form covering-set of functions which could support expected applications. This set was further expanded to accomodate ORDBMS requirements. The functionality of dayblob exclusively deals with molecular structure (as described in Capabilities, below).
Robustness and stability
One of the highest priorities was producing a package which has an extremely high level of robustness. For one thing, we (Daylight) had to prove to Oracle that we could build a trouble-free component. For another, in this sort of collaborative product, it's in everyone's interest to require as little component-level support as possible. Our specific goal is that it could run unchanged for 10 years. This target required development of a new object interface.
Expandability
In addition to robustness, it was required that the interface support changes on both sides (dayblob and Oracle Cartridge) ... possibly dramatic changes. On the dayblob side, such changes might include new searches, new object types (e.g., large molecules, combinatorial libraries) and multithreading support. On the Oracle side, it is likely that communication prototol will change in the 8.2 server (shared memory objects). In any case, it is certain that both Oracle and Daylight are moving towards exploitation of 64-bit architectures, but in different time-frames. The new object interface is designed to allow such changes while maintaining backwardly compatible behavior.
Performance
High-performance molecular structure processing is of very high priority to everyone involved with this project. This is awkward, because real-world RDB systems come with terrible overhead penalties. To get around this problem, a novel BLOB-based architecture was developed and implemented. When used on a capable server, resulting performance is extremely high.
Flexibility
By design, the Chemistry Cartridge was not designed to support a specific set of applications. Target applications are those to be developed by Oracle, Daylight, end-users and/or third-parties. The capabilities of dayblob are defined in terms of chemical objects. It is expected that a common mapping onto SQL syntax will be adopted and that this will be suitable for use with expected applications (e.g., registration, inventory control) and also with those that we haven't thought of yet.
Capacity
This project is designed to produce a industrial-strength database system which is suitable for use with all chemical data sets, e.g., 10's of millions of structures, millions of reactions, strange and huge molecules, etc. However, sheer capacity was considered a secondary design criteria. For practical reasons, the current system is designed with reasonable limits (rather than without them), i.e., a few million molecules and reactions on current server hardware.
Invisibility
dayblob was designed as pure component software and not to be "Daylight-centric" ... at least, as much as possible in a SMILES-based chemical information system. This package doesn't know about datatypes, IDs, datatrees, or histlists; it doesn't use Daylight option manager, license manager or error handler, in fact, it does no I/O at all except via the BLOB. Ideally, users of a Chemistry Cartridge application shouldn't need to know that they're using Daylight Software, except that it goes real fast and, of course, they're smilin'.

Interfaces

Our ambitious requirement for stability required development of a new object interface. We aspire to provide a component that not only continues to work correctly as machine architectures and programming languages evolve, but may be asked to do its work on a different machine type than the controlling RDB server. Niether Daylight's Toolkit interface (dt) nor existing RDB API interfaces (e.g., OCI) handle this problem well.

A specialized object interface was developed for this project (the "db" interface). It is largely modelled on Daylight's "dt" object interface but is an entirely independent entity. Many of the same types of object are supported ... e.g., integers, strings, objects, sequences of objects ... but using neutral types allows us to change either side without changing our interface. For instance, one side of the partnership can move from pure 32-bit to LP64 without distrupting the other.

Support for some RDB-specific objects is provided. Row IDs are essential to the RDB data model and are treated as short variable length binary objects (BOBs). BLOBs ("binary large objects") are complex objects which are accessed via contexts (CXTs) and object locators (LOCs).

Architecture

The architecture of a high-performance database component normally involves a number of uncomfortable compromises. For flexibility, we need to handle things like Row IDs which vary in size. For performance, we need big chunks of fast, randomly-accessible storage (i.e., memory) which are none-the-less persistant and secure. For stability, we need a reliable method for accessing data which permits efficient caching (and which is likely to evolve over time). And in some cases (such as Oracle), we can't get access to the server's process space, no way, no how.

In a traditional RDB environment, such requirements are mutually contradictory. However, the novel approach of operating entirely within a BLOB (suggested by Sam Defazio and implemented by Cathy Trezza) appears to meet these requirements. The idea is that a component operates entirely within a single chunk of persistent storage (the BLOB). BLOB storage is managed by the RDB server via an ORDBMS interface such as OCI. From the database's POV, the BLOB is just a bunch of bits in a table and the software component is an extensible indexing method which uses these bits. From the component's POV, it can do anything it likes within the BLOB, e.g., run a fast specialized database of its own. Cool.

Aside from the calling program, dayblob requires two external utilites: one to define and manage "Row IDs" and one to manage the BLOB. BLOB management is done via contexts (CXTs) and object locators (LOCs). BLOBs are accessed in discrete readonly or read/write segments to enable efficient caching.

Capabilities

Most of the structure-oriented database capabilities in Daylight Release 4.62 are delivered, including:

management of multi-component molecules and reactions
support for isotopic and isomeric features
constant-time retrieval of structures and tautomers
selection by substructure
SMARTS pattern matching
structural similarity and nearest-neighbor selection
utility functions, e.g., calculation of cannonical molecular formula, atom counts, molecular weight and format conversions

Performance

The dayblob package was tested against the standard Daylight test suite used to evaluate merlinserver performance. Detailed results of this test are available.

For example, 243 superstructures of the cefalosporin-g1 moiety are known to exist among the 37037 structures in the test database wdi93. This search was done on the same machine using dayblob and merlinserver, then the results were examined for correctness and relative speed. In this case, dayblob completed the search correctly in 0.68 seconds, which is 85% of the time used by merlinserver461 (using 1 CPU only).

Selecting a current high-end workstation (Sun Ultra60) for comparison, for typical structural searches, dayblob is about the same speed as the current merlinserver461 when run with 1-CPU (30% slower to 15% faster, depending on query). Including pathological cases, it is 2.5x slower.

Limitations

Limitations of dayblob v46107 and v46208 include:

Some merlin functionality is missing
- Tversky similarity searching is not available.
- Non-structural searching (e.g., ARES) is not supported by dayblob.
- Sorting is not supported by dayblob (e.g., CAS and MF sorts).
- Non-structural data are not supported by dayblob (e.g., names).
SMILES length limit is 20000 characters.
A fixed-size buffer is currently used for BLOB-probing, reducing the number of required blob-manager interactions and increasing performance. This limitation could be changed or removed (with a speed penalty). Current RDB implementations impose a more restrictive limit (2000, 4000) which may be removed in future RDB versions.
Another consequence of the current buffering method is that dayblob always uses 20000 bytes more than is theoretically required (i.e., even after a "crunch"). No longer an issue in 46208.
Row IDs are limited to 32 bytes
Row IDs in RDBs are complex than they appear! They not only identify the row in a table, but also things like the table in a database, the database in a server and possibly even the database on a filesystem and the server on a network of servers. The largest binary Row IDs are now 14 bytes (we think) but there's a possibility that this might increase in the future. Dayblob views Row IDs as variable length binary objects so there's no theoretical limitation on Row ID size per se. In practice, 32B chunks are used for communication with the RDB.
Large database (re)loads are slow
Entries are made in dayblob whenever data is INSERTed into a SMILES column in an appropriately configured table. All indexing is done at INSERT time (or equivalent) including fingerprinting, which is a relatively slow operation (1000/min). In theory, this only has to be done once per entry, but there is no way of loading precomputed fingerprints into dayblob (e.g., for a database reload). Indexing is ~2X faster in 46208.
No intrinsic multiprocessor support
Dayblob is not multithreaded nor does it use program objects. However, Oracle implements parallelization at the session level.

Limitations of the prototype cartridge include:

SMILES length limit is 2000 (4000) characters.
The use of VARCHAR for SMILES-containing columns limits SMILES to 2000 (4000) characters. This limit could be removed by representing SMILES as LOBs (or CLOBs) at the cost of additional overhead.
Communication of large answer sets can be slow
Dayblob does not operate in the process-space of the RDB server, so communication speed between them is an issue. In practice, this presents a problem only for communication of pathologically large answer sets (possibly resulting in millions of FETCH operations). The current prototype cartridge does a creditable job of packing and buffering such data transfers, but even so, this approach is terribly slow compared to object mechanisms used by merlinserver. This problem should be resolved by future RDB versions which are expected to support shared memory object passing (e.g., Oracle 8.2).

Changes in v46208

Dayblob v46208 is a revision of v46107 to meet these goals:

All functionality of the v46107 prototype is retained.
dayblob v46208 is based on Daylight v4.62 code.
This results in faster indexing and many minor improvments.
Structures may be accessed at multiple levels of abstraction.
Although RowID orientation is retained, structures are internally indexed by absolute, unique, tautomer and graph SMILES. The v46208 API has been extended to allow such indexing to be used.
Code rewritten to production standard.
Prototype "shortcuts" have been replaced by production constructs, e.g., representation of internal objects, "buffer slop" elimination, multiple-index compression.
API redesigned to support optimization.
Dependence of functional interfaces on query contexts (which was present in the prototype) has been removed.
v46208 API is available here.

Chemistry Cartridge demos

A number of CGIs demonstrating the Chemistry Cartridge prototype are are available here and here.

The real experiment: collaboration

As much as anything else, the dayblob project has been an experiment in collaborative work between members of two very different companies.

now

Oracle is a very large company (8000x larger than Daylight) which produces general purpose database systems and which derives most of its income from support, training and consulting services.

To make this project succeed, each of us had to move towards the other's philosophy to some extent.

For Oracle, this meant stepping down a bit from the "one size fits all" throne, modifying the RDB interface to allow high-performance processing by a fundamentally different object model, engaging in development which was outside their proscribed beta program, not to mention having to deal with a bunch of free-thinkers ... not easy processes in a large company.

The fact that this project is working attests to the willingness of the members to overcome many obstacles which were encountered along the way. Outside the development environment, one rarely hears about such things.

Cathy Trezza implemented a Cartridge based on a dayblob package written by Dave Weininger. During large loads, it would occasionally crash and burn. From Cathy's POV, dayblob was corrupting some of the pointers into the BLOB. From Dave's POV, the dayblob seemed squeaky clean in all simulations, and his friend "purify" agreed. Since we don't get to see each other's code, it's very easy to start pointing fingers in such a situation. But instead, we wrote and exchanged heavy-duty debugging versions and resolved the problem in a couple of miserable days of debugging. (In this case, the problem was caused by a misunderstanding of object property ownership.) It's often much easier to throw up one's hands and blame someone else than it is to dig in together and share the hard work required to resolve a problem. The results are much different, however.

With the dayblob-based Chemistry Cartridge, we have shown that it is possible to produce a product in which each side does what it does best, i.e., Oracle develops the RDB side, Daylight the cheminfo algorithms. Contrary to historical precedent, it is not neccessary for one company to own or control the others' realm. This speaks worlds for the force of will and the level of respect among the participants. Speaking for myself, I am as pleased by this aspect of the project as I am by any aspect of techincal excellence in the product.

This is an ongoing project, and an ongoing process. Although we are "over the hump", there is much more work to be done. The two main areas of effort remaining have to do with applications (Who defines them?, Which one first?, Who produces them?, etc.) and business models (Whose product is this, anyway?) We have every reason to expect that these issues will be addressed with the same level of creativity and cooperation that has characterized this project thus far.

Acknowlegements

Sandeepan Banerjee, Oracle
Roger Critchlow, Daylight
Sam Defazio, Oracle
Jack Delany, Daylight
Bob Gouslin, Oracle
Steve Hagan, Oracle
Mike Mansell, Oracle
Norah McCuish, Daylight
Johnny Petersen, Oracle
Yosef Taitz, Daylight
Cathy Trezza, Oracle
Markus Visscher, Oracle
Dave Weininger, Daylight
and the Novartis "Chemical Datawarehouse" people

Daylight Chemical Information Systems, Inc.
info@daylight.com

Dayblob: an ORDBMS-oriented chemical information package