MUG '00 - Weininger

10x40k Crunch
Scott Dixon
Jeff Blaney
Dave Weininger

What?

40,000 small molecules docked to 10 proteins using DockIt with results stored in a standard Thor/Merlin database.
DockIt: fast distance-geometry-based docking
- Robust program designed for production work
- Analytical DOCK solution
- Fastest distance-geometry-based docking program yet
- Three scoring functions: DOCK, PMF, PLP
- New Metaphorics product
10 proteins
- One protein, 1hpv, was selected "on purpose"
- Others selected randomly from ~150 known protein-ligand structures
- 1add 1ela 1hpv 1hyt 1rbp 1rds 1swd 2cbr 3cla 4hvp
- Small & large, simple & complex, well-known & obscure, one pair
- All observed bound to ligands (which were also docked)
40,000 small molecules from bioscr99sc
- Database of diverse compounds intended for screening
- Well-formed data, e.g., isomers, salt data
- Plated samples are available from InterBioscreen
- Freely available as Thor database
Computation
- 10 by 40,000 by 100 trials each makes 40,000,000 dockings
- Done on 240/256 R12000 processor SGI/Cray supercomputer
- CEX used for communication between programs
Informatics
- Results as CEX objects converted to Thor datatrees
- Concensus pick: ~400,000 conformations
- Thor: $D3D-rooted subtrees contain scores, rms, etc.
- Merlin: SMILES/conformation rows, 3-D data not in pool
- Rasmol: widely used, free visualization program, local "expert"

Why?

How many structures/dockings does it take to generate interesting results? Is 40,000 not enough, enough, more than enough?
Are some proteins very specific and some promiscuous, or more-or-less the same?
Are some small molecules very specific and some promiscuous, or more-or-less the same?
Do current (faddish?) scoring functions produce sensible results?
Is RMS a sensible gold standard?
Does consensus scoring make sense?
Does DockIt/docking eliminate obvious losers?
Does DockIt/docking find known binders?
Can "normal" informatics be wrapped around docking data?
Are generic communication (CEX, XML) protocols suitable for production work?
Is DockIt ready for prime-time?
Do you need DockIt in your shop?

When? How?

Project started less than a month ago, done in a couple weeks!
Started with a close-to-production CEX-talking DockIt, production Daylight small-molecule database and DBMS software, etc.
Would have taken a long time on our machines (45 days / many months). Even using local compute resources (NCGR) would have taken ~9 days, we would have been hard-pressed to get it onto CD-ROM by MUG.
SGI generously provided time on a brand new (-1 week old) 256 R12000 processor machine. Compute time using ~240 of the CPUs was less than 24 hours.
Access to the machine needed to be via private SGI network, so Scott camped out at the SGI sales office in Portland and got it done.
The result set was ~54 GB. Getting the results backed out and transferred was as difficult and time-consuming as everything else combined.
Reduction of the data to Thor/Merlin database format was relatively easy: done in a day.
CD-ROMs are available now.

Results

Database contains:
- 10 proteins as bound to ligands
- 10 ligands as bound to proteins
- lowest RMS docking of 10 ligands to their proteins
- consensus docking of 38,123 small molecules to each of 10 proteins
- SMILES-rooted data:
  - name
  - fingerprint
  - heavy atom count
- conformation-rooted data:
  - RMS for bindings
  - DOCK, PMF and PLP scores for dockings
  - docking consensus score (and ranks)
  - docked-to and bound-to IDs
m4x10x40k database contents:
- 38,143 SMILES-rooted datatrees, 10 are proteins: 10 are observed ligands, the rest (38,123) are from bioscr99sc.
- 381,113 conformation-rooted subtrees: 20 are bindings, the rest (381,093) are dockings,
- 381,103 sets of DOCK, PMF, PLP & consensus scores

m4x10x40k database sizes:

size, MB	description
205.9	m4x10x40k.DP, biggest file
235.0	all database files
65.3	merlin pool (3,620,250 dataitems)
72.9	merlin total, including overhead

Preliminary conclusions

Docking informatics is definitely doable.
- However, this represents a big change from how things are currently done.
Handling 10's of GB's of data is work, but not "rocket science".
- Modern disks have enough capacity.
Daylight DBMS works fine with proteins.
CEX makes such informatics possible.
DockIt is ready for production
Lots of interesting dockings crop up
The current crop of scoring functions aren't wonderful.

Acknowledgements