Jack Delany
Introduction:
This whitepaper/presentation will describe Oracle Cartridge technology, the current Daylight/Oracle cartridge, and our current development and future plans for the cartridge.
Oracle Cartridge Technology:
An Oracle Data Cartridge is a bundled set of tools which extends Oracle clients and servers with new capabilities. Typically, a single data cartridge deals with a particular 'information domain'. Some examples of current data cartridges include: image processing, spatial processing, audio processing.
A cartridge consists of a number of different components which work together to provide complete database capabilities within the information domain. The components include:
A cartridge extends the server. The new capabilites are tightly integrated with the database engine. The Cartridge interface specification provides interfaces to the query parser, optimizer, indexing engine, etc. Each of these subsystems of the server learns about the cartridge capabilities through these interfaces. The cartridge can use SQL, PL/SQL, Java, C, etc. to implement the functions.
From Oracle's point of view, the cartridge idea allows third-party organizations to expand the capabilities of the Oracle database server in a modular, supportable fashion.
Design Goals:
The key design goals for the cartridge are:
The main point here is that we recognize that in addition to the differences in the database itself (the servers, databases, tables, indexes), the members of the Oracle user community (the developers, database administrators, users) have very different perspectives than those which we are used to dealing with.
In order for the product to be successful, the Daylight cartridge must behave in a predictable, understandable way. The paradigms we use within our cartridge must match those which Oracle developers and DBAs already use and understand.
So, what is the Daylight Cartridge? First, what it isn't. It isn't a full-functioned toolkit interface / environment. It is a set of chemical utilities (normalizations, data conversions, comparison functions) which provide the missing pieces required to manage chemical data in an Oracle database.
Architecture:
The Daylight toolkit interfaces with the Oracle server via callouts to the "extproc" utility. This utility provides a RPC-like mechanism for performing C-language function calls. Daylight toolkit code is wrapped inside this RPC layer for each of the defined cartridge functions.
The main concern about this architecture is the efficiency of the RPC mechanism, and the potential that extproc will be a performance bottleneck. Simple tests indicate that this isn't a problem.
The three important communication bandwidths between the Oracle Server and Extproc are as follows:
Each of these data throughput and round-trip limits represents a design constraint which must be considered in the overall cartridge design. The design described in this whitepaper is what we consider to be the optimum tradeoff between resource efficiency, search performance, and the ability to maintain the full transaction concurrency model within Oracle for all cartridge data.
Cartridge Specification:
As indicated earlier, an Oracle Cartridge typically consists of three sets of functionalities: object type definitions, packaged functions, and indexing / data access tools. In turn, each of these three areas will be discussed with respect to the Daylight cartridge implementation.
Object Types:
The daylight cartridge does not define any new Oracle object types!!!
Daylight Object | External Language |
---|---|
molecule | SMILES |
reaction | SMILES |
pattern | SMARTS |
transform | SMIRKS |
fingerprint | Encoded ASCII |
depiction | 2D coordinate list |
conformation | 3D coordinate list |
Note that our external languages are all very expressive and well-behaved. That is, they are compact, with high information density. They are all printable ASCII strings. They all have well defined syntax and semantics.
When needed, the cartridge instantiates internal Daylight objects within the cartridge to perform a specific task (eg. calculate a molecular weight from a SMILES). The interfaces between Oracle and Daylight always pass objects as their external string representations.
Implications of this "Object Model":
Since our objects are represented as strings within the RDBMS, any Oracle, Informix, Sybase, JDBC, ODBC, CORBA, etc., etc., etc., client, server, middleware, application layer, etc., etc., etc. can handle these "objects" in their external form.
Only the endpoints of communication must understand what the objects mean: the Daylight Oracle cartridge provides the server-side endpoint, and a front end user interface provides the client-side endpoint. Otherwise, the objects effectively "tunnel" through the middle layers.
This does not preclude the design of an Oracle-specific object layer on top of the cartridge system (eg. ODBC); we simply don't require one, and don't dictate which model, if any, you use.
PL/SQL Functions/Operators
General:
Molecule/Reaction:
Fingerprint:
Comparison Operators (optional indexes apply)
For a detailed discussion of the PL/SQL functions, SQL operators, and
extensible indexes for the cartridge, see the official cartridge
Timing Tests:
The following table lists some representative search times for the blob-based
indexes. The databases are:
The tests are on three different machines: Red is a Sun Ultra 60 (2x360MHz),
with 768 MB of real memory. Xmas is a Dell laptop (700MHz PentiumIII) with
512 MB memory, and green is a Origin 200 (2x270MHz R12000) with 2GB memory.
In all cases the Oracle installation is a vanilla 8.1.5 install, with the only
changes being the increase of db_block_buffers and log_buffers parameters from
default values in init.ora. All database and program files reside on a single
disk partition (no RAID). All transactions use full rollback (size of
rollback segments was increased from default values), but not archivelog.
Query Performance (seconds elapsed):
Availability:
The cartridge for Solaris, SGI, and Linux is available as part of 4.72 now.
We support Releases 8.1.5, 8.1.6, 8.1.7 under Solaris 7 (Solaris 8 has
been tested but not officially supported yet), Oracle 64-bit releases under
Irix 6.5.8+ (m-releases only), and Linux under Red Hat 6.2.
Support the standard edition of 8i on all platforms. Longer term,
parallelization and partitioned indexes may cause us to revise this position.
Ongoing Efforts:
Oracle's VARCHAR2 datatype definition is somewhat confusing. PL/SQL
(the language which Oracle uses for procedural operations) defines the maximum
VARCHAR2 as 32767 characters. SQL and the database define the maximum
VARCHAR2 as 4000 characters. Since the cartridge indexes are invoked from SQL
against a database column, the maximum VARCHAR2 in the indexes is 4000
characters.
In 4.72 all the PL/SQL functions support VARCHAR2 strings up to 32767
characters. You can use them today provided you don't try to store the
longer VARCHAR2 strings in a VARCHAR2 datatype in the database. You'll need
to use a CLOB or LONG type.
If you want (in preparation for 4.8), you can write wrappers around the
cartridge functions to support longer SMILES. An example like below will
have the same interface and behavior as the proposed 4.8 interface, it will
simply be limited to 32767 characters:
In 4.8, we'll support CLOB datatypes everywhere we currently use VARCHAR2.
fcontains(s1 IN VARCHAR2, s2 IN VARCHAR2) => NUMBER
fcontains(s1 IN VARCHAR2, s2 IN CLOB) => NUMBER
fcontains(s1 IN CLOB, s2 IN VARCHAR2) => NUMBER
fcontains(s1 IN CLOB, s2 IN CLOB) => NUMBER
All operator combinations supported (eg. all four contains permutations).
Prototype 4.8 version running CLOB indexes on NCI:
SQL> desc ncib;
Name Null? Type
-------------------------- -------- ---------------
SMI CLOB
NSC NOT NULL NUMBER(6)
CAS_RN NOT NULL VARCHAR2(11)
CL_1 NUMBER(5)
CL_2 NUMBER(3)
CL_3 NUMBER(5,4)
SQL> create index ncii on nci(smi)
indextype is c$dcischem.ddblob;
Index created.
Elapsed: 00:12:56.46
SQL> create index nciib on ncib(smi)
indextype is c$dcischem.ddblob;
Index created.
Elapsed: 00:14:13.28
The following timings are on a Sun Ultra 60 (2x360MHz) with 768 MB of real
memory. It's running Oracle 8.1.7 Enterprise Edition. The table is the
126705 structures from NCI95.
Query Performance (seconds elapsed):
Why?
The cartridge is in hand on NT. Works really well.
Caveats:
The following tables compare times for index creation and search on identical
hardware running either Linux Red Hat 6.2 with Oracle 8.1.6 or Windows 2000
Professional with Oracle 8.1.6. The hardware is a 800 Mhz PentiumIII-based
Dell workstation with 256 MB memory. Again, the table is 126705 structures
from NCI95.
Query Performance:
(* - Includes fingerprint generation time)
Index type
Column
red
xmas
green
nci
exact
2:10
3:16
2:36
nci
graph
7:10
8:21
10:20
nci
blob
12:51*
9:37*
13:17*
rxn
exact
1:44
2:46
2:12
rxn
role
14:21
34:51
25:05
rxn
blob(smi)
20:34*
14:41*
21:57*
rxn
blob(fp)
00:47
01:47
01:03
savant_smi
exact
22:01
32:30
22:54
savant_smi
graph
1:12:59
1:25:08
1:33:38
savant_smi
blob(smi)
2:07:00*
1:39:55*
2:15:58*
savant_smi
blob(fp)
5:26
12:18
7:07
(* - invalid query, ** - Disk I/O observed)
Table name
Query
Hits
Red
Xmas
Green
nci
exact(smi, 'c1ccccc1') = 1
1
0.07
0.12
0.12
nci
graph(smi, 'c1ccccc1') = 1
3
0.17
0.17
0.21
nci
contains(smi, 'OC(=O)C1') = 1
0*
0.12
0.16
0.21
nci
contains(smi, 'OC(=O)CS') = 1
507
1.48
1.83
2.03
nci
matches(smi, '[OH]C(=O)CS') = 1
279
1.39
1.81
1.90
nci
fingertest(smi, 'OC(=O)CS') = 1
1237
0.54
1.10
0.76
nci
tanimoto(smi, 'OC(=O)CS') > 0.8
7
0.81
1.15
0.84
nci
tversky(smi, 'OC(=O)CS', 0.5, 0.5) > 0.8
56
0.80
1.26
0.87
rxn
reactant(smiles, 'Sc1ccccc1') = 1
282
1.45
4.30
0.83
rxn
product(smiles, 'Oc1ccccc1') = 1
13
0.15
1.51
0.14
rxn
contains(smiles, 'OC(=O)CS') = 1
1136
9.2
6.8
11.4
rxn
contains(smiles, '>>OC(=O)CS') = 1
739
4.3
3.9
5.6
rxn
contains(smiles, 'OC(=O)CS>>') = 1
782
8.2
6.6
10.5
rxn
matches(smiles, '>>[OH]C(=O)CS') = 1
117
3.8
3.6
5.1
rxn
tanimoto(smiles, 'OC(=O)CCl>>OC(=O)CS') > 0.5
68
0.60
1.24
0.92
savant_smi
exact(smi, 'Oc1ccccc1') = 1
15
0.10
0.38
0.09
savant_smi
graph(smi, 'Oc1ccccc1') = 1
51
0.16
0.61
0.15
savant_smi
contains(smi, 'O1C(=O)CCS1') = 1
2
4.0
8.9
8.9
savant_smi
matches(smi, '[OH]C(=O)CS') = 1
1575
14.0
19.7
28.3
savant_smi
tanimoto(smi, 'OC(=O)CS') > 0.8
7
5.0
9.6
9.6
savant_smi
tanimoto(fp, 'OC(=O)CCS') > 0.8
24
4.0
4.3
4.4
savant_smi
fingertest(fp, 'OC(=O)CCS') = 1
18620
4.0
4.4
4.9
component('c1ccccc1.C1CCCCC1', 'C1CCCCC1') => TRUE
create function csmi2cansmi(smiles IN CLOB, type in NUMBER,
cansmi IN OUT NOCOPY CLOB) return number
as
smi varchar2(32767);
rc varchar2(32767);
slen number;
rlen number;
begin
slen := dbms_lob.getlength(smiles);
dbms_lob.read(smiles, slen, 1, smi);
rc := ddpackage.fsmi2cansmi(smi, type);
if (rc is NULL) then
return ODCIConst.Error;
end if;
rlen := length(rc);
dbms_lob.write(retlob, rlen, 1, rc);
dbms_lob.trim(retlob, rlen);
return ODCIConst.Success;
exception
when others then
raise;
end;
fsmi2cansmi(smi IN VARCHAR2, type IN NUMBER) => VARCHAR2
fsmi2cansmi(smi IN CLOB, type IN NUMBER, can IN OUT CLOB) => NUMBER
fsmi2amw(smi IN VARCHAR2) => NUMBER
fsmi2amw(smi IN CLOB) => NUMBER
SQL> desc nci;
Name Null? Type
-------------------------- -------- ---------------
SMI NOT NULL VARCHAR2(428)
NSC NOT NULL NUMBER(6)
CAS_RN NOT NULL VARCHAR2(11)
CL_1 NUMBER(5)
CL_2 NUMBER(3)
CL_3 NUMBER(5,4)
Index type
VARCHAR2
CLOB
exact
2:10
3:08
graph
7:10
10:54
blob
12:57
14:13
Table name
Query
Hits
VARCHAR2
CLOB
nci
contains(smi, 'OC(=O)CS') = 1
507
1.27
1.29
nci
matches(smi, '[OH]C(=O)CS') = 1
279
1.31
1.34
nci
fingertest(smi, 'OC(=O)CS') = 1
1237
0.41
0.43
nci
tanimoto(smi, 'OC(=O)CS') > 0.75
16
0.48
0.48
Index type
NT
Linux
exact
1:25
3:16
graph
5:12
8:21
blob
3:50
9:37
(*** ms is milliseconds, *.** is seconds)
Table name
Query
Hits
NT
Linux
nci
exact(smi, 'c1ccccc1') = 1
1
9 ms
7 ms
nci
graph(smi, 'c1ccccc1') = 1
3
10 ms
9 ms
nci
tautomer(smi, 'c1ccccc1') = 1
1
9 ms
8 ms
nci
fingertest(smi, 'C1CC1') = 1
4358
0.40
0.47
nci
contains(smi, 'C1CC1') = 1
863
2.96
3.01
nci
contains(smi, 'OC(=O)CS') = 1
507
0.65
0.77
nci
matches(smi, '[OH]C(=O)CS') = 1
279
0.67
0.80
nci
tanimoto(smi, '[OH]C(=O)CS') > 0.75
16
0.38
0.46
nci
tversky(smi, 'c1ccccc1', 0.5, 0.5) > 0.5
1465
0.44
0.56
nci
contains(smi, 'NCCc1cc(O)c(O)cc1') = 1
829
0.84
0.97
nci
tanimoto(smi, 'NCCc1cc(O)c(O)cc1') > 0.75
77
0.42
0.48