Database of diverse compounds intended for screening
Well-formed data, e.g., isomers, salt data
Plated samples are available from InterBioscreen
Freely available as Thor database
Computation
10 by 40,000 by 100 trials each makes 40,000,000 dockings
Done on 240/256 R12000 processor SGI/Cray supercomputer
CEX used for communication between programs
Informatics
Results as CEX objects converted to Thor datatrees
Concensus pick: ~400,000 conformations
Thor: $D3D-rooted subtrees contain scores, rms, etc.
Merlin: SMILES/conformation rows, 3-D data not in pool
Rasmol: widely used, free visualization program, local "expert"
Why?
How many structures/dockings does it take to generate interesting
results? Is 40,000 not enough, enough, more than enough?
Are some proteins very specific and some promiscuous,
or more-or-less the same?
Are some small molecules very specific and some promiscuous,
or more-or-less the same?
Do current (faddish?) scoring functions produce sensible results?
Is RMS a sensible gold standard?
Does consensus scoring make sense?
Does DockIt/docking eliminate obvious losers?
Does DockIt/docking find known binders?
Can "normal" informatics be wrapped around docking data?
Are generic communication (CEX, XML) protocols suitable for
production work?
Is DockIt ready for prime-time?
Do you need DockIt in your shop?
When? How?
Project started less than a month ago, done in a couple weeks!
Started with a close-to-production CEX-talking DockIt,
production Daylight small-molecule database and DBMS software, etc.
Would have taken a long time on our machines (45 days / many months).
Even using local compute resources (NCGR) would have taken ~9 days, we would
have been hard-pressed to get it onto CD-ROM by MUG.
SGI generously provided time on a brand new (-1 week old) 256 R12000
processor machine. Compute time using ~240 of the CPUs was less than 24 hours.
Access to the machine needed to be via private SGI network,
so Scott camped out at the SGI sales office in Portland and got it done.
The result set was ~54 GB. Getting the results backed out and
transferred was as difficult and time-consuming as everything else combined.
Reduction of the data to Thor/Merlin database format was relatively
easy: done in a day.
CD-ROMs are available now.
Results
Database contains:
10 proteins as bound to ligands
10 ligands as bound to proteins
lowest RMS docking of 10 ligands to their proteins
consensus docking of 38,123 small molecules to each of 10 proteins
SMILES-rooted data:
name
fingerprint
heavy atom count
conformation-rooted data:
RMS for bindings
DOCK, PMF and PLP scores for dockings
docking consensus score (and ranks)
docked-to and bound-to IDs
m4x10x40k database contents:
38,143 SMILES-rooted datatrees,
10 are proteins:
10 are observed ligands,
the rest (38,123) are from bioscr99sc.
381,113 conformation-rooted subtrees:
20 are bindings,
the rest (381,093) are dockings,
381,103 sets of DOCK, PMF, PLP & consensus scores
m4x10x40k database sizes:
size, MB
description
205.9
m4x10x40k.DP, biggest file
235.0
all database files
65.3
merlin pool (3,620,250 dataitems)
72.9
merlin total, including overhead
Preliminary conclusions
Docking informatics is definitely doable.
However, this represents a big change from how things
are currently done.
Handling 10's of GB's of data is work, but not "rocket science".
Modern disks have enough capacity.
Daylight DBMS works fine with proteins.
Proteins are just big molecules!
(Most are not really so big, at that.)
Almost no changes were needed to the Thor/Merlin system.
CEX makes such informatics possible.
CEX could be faster
XML might work if it came of age
Pushing around 40,000,000 PDBs would not have been reasonable
DockIt is ready for production
DockIt definitely cranks.
Scale up is not a factor:
large-scale performance is within a few percent of predictions.
Can expect to complete large jobs w/o crashes, blowups, etc.
Automated sphere generation (sphinx) and refinement (sublime)
are 90% but not 100% ... yet.
We still have the choice between best (with H's)
or fast (without H's), but not "best and fast" at the same time.
Lots of interesting dockings crop up
Could these strange and wonderful scaffolds really work?
The current crop of scoring functions aren't wonderful.
Between-molecule scoring comparisons are non-intuitive.
The three scoring functions provided with DockIt are very
different from each other.
In so far as they represent the current world-range in scoring,
these three are about as good as one can do.
Between-molecule consensus scoring has practical (rather than
theoretical) merit.
Conformational consensus scoring seems to work well
(e.g., among 100 docked conformations of the same molecule).
However, that's not what most people want to know.