Daycart Lab 2 - With hints

Daylight Summer School 2002, June 5-7, Santa Fe, NM

DayCart TCM Interface and Worksheet -- WITH HINTS

The Daycart TCM interface is a simple CGI interface which uses a combination of normal CGI and Oracle Call Interface code to provide a web interface to the Traditional Chinese Medicines data. The interface can be accessed HERE. The web interface shows, in the source for each HTML page, the SQL query or queries which were used to retrieve the data. You will query the TCM data from the Web interface and then will perform the same queries from SQLPlus and get the results.

Starting SQLPLUS:

telnet day0, login mug/coffee sqlplus mug/coffee

The main table of interest is: tcm01_mol. It has chemical structures. The other four tables (tcm01_plant, tcm01_plant_mol, tcm01_ref, tcm01_mol_ref) contain data about plants and references which are associated with the molecules. You can get information about the columns in the tables with the 'describe' command. For example:

describe tcm01_mol;

TCM Interface Queries

Perform a structural query for Dopamine (NCCc1cc(O)c(O)cc1). Note that the Web interface returns three records. Now, perform the same query using SQLPlus. Again, you'll find the actual SQL query embedded in the source to the html ("View...Page Source" from Netscape).
You can either draw a structure using JavaGrins or cut-and-paste the SMILES. After viewing the Web results, try the SQLPlus query on day0.
The SQL command is:
```
select pcn, smi, tcmm from tcm01_mol
  where exact(smi, smi2cansmi('NCCc1cc(O)c(O)cc1', 0)) = 1
```
Note that the SQLPlus query returns the TCMM field, but this isn't displayed in the web page. Where is it used?

Follow one of the links and view the query which is used. Note that the tcmm field is used as a query value to retrieve additional molecular data:
```
select smi, ism, pcn, mf, mw, src, rem, cas, tcmm from tcm01_mol
  where tcmm = 'M1831'
```
Perform a structural query for the following SMILES: NCCc1cc(S)ccc1, chosen because it is not in the database. Note that the CGI attempts to find exact matches, then substructures, and failing both of those, will return similar structures. Perform the similarity search using SQLPlus. Again, you'll find the actual SQL query embedded in the source to the html.

The SQL command is:
```
select pcn, smi, tcmm, tanimoto(smi, 'NCCc1ccc(S)cc1') from tcm01_mol
   where tanimoto(smi, 'NCCc1ccc(S)cc1') > 0.6
   order by tanimoto(smi, 'NCCc1ccc(S)cc1') desc
```
Note that Oracle, as part of the SQL query, is used to order the results. Try the query without the "order by ..." clause.
Modify the SQL query to display the molecular weight of each hit in the results.
The SQL command is:
```
select pcn, smi, tcmm, tanimoto(smi, 'NCCc1ccc(S)cc1'), smi2amw(smi)
   from tcm01_mol where tanimoto(smi, 'NCCc1ccc(S)cc1') > 0.6
   order by tanimoto(smi, 'NCCc1ccc(S)cc1') desc
```
Modify the SQL query to only display the hits with a molecular weight of greater than 150.
The SQL command is:
```
select pcn, smi, tcmm, tanimoto(smi, 'NCCc1ccc(S)cc1'), smi2amw(smi)
   from tcm01_mol where tanimoto(smi, 'NCCc1ccc(S)cc1') > 0.6
   and smi2amw(smi) > 150
   order by tanimoto(smi, 'NCCc1ccc(S)cc1') desc
```

Oracle Data Manipulation

Create a demo table as a subset of tcm01_mol. Use your initials as part of the name (eg. tcm_jjd) so it doesn't collide with a classmates. Only include the first 100 records from tcm01_mol.
We'll use a variation the 'CREATE TABLE' command, which allows you to get the data from a select statement. The 'rownum' pseudocolumn allows us to only get the first 100 rows.
```
create table tcm_jjd as 
  select * from tcm01_mol
    where rownum <= 100;
```
Add a column for average molecular weight. Populate the new table with calculated values.

We use the 'ALTER TABLE' command to add the column, then use the Daycart function smi2amw(smi) to add the data.
```
alter table tcm_jjd add new_mw number;
desc tcm_jjd;
```
To populate the columns:
```
update tcm_jjd 
  set new_mw=smi2amw(smi);
```
Validate the calculated molecular weight (new_mw) against the stored molecular weight (mw). Look for a discrepancy greater than 1.0.

We use a select query. The SQL function "ABS()" returns the absolute value of a number. It might come in handy.
```
select smi, mw, new_mw from tcm_jjd
  where abs(mw - new_mw) > 1.0;
```
Note that we can also do the validation by calculating the molecular weight on the fly:
```
select smi, mw, smi2amw(smi) from tcm_jjd
  where abs(mw - smi2amw(smi)) > 1.0;
```

Multi-Table Queries

Find out how many structures are common between the nci table and the tcm01_mol table.

We can use a regular join query. Since the SMILES in the two tables have been canonicalized and are stored as simple VARCHAR2 data, the following simple query will work:
```
select count(1) from tcm01_mol, nci
  where tcm01_mol.smi = nci.smi;
```
Refine the query and find all structures in both nci and tcm01_mol that contain the phenyl ethylamine substructure.

There are several ways to answer this query. A nested query will work:
```
select smi from tcm01_mol
  where contains(smi, 'NCCc1ccccc1') = 1
  and smi in (select smi from nci)
```
The above query finds all SMILES which match the contains() query, and are also found in the list of SMILES in NCI.
An intersection query will also work.
```
select smi from tcm01_mol
  where contains(smi, 'NCCc1ccccc1') = 1
intersect
select smi from nci
```
Note that the two queries above give different answers. In fact, the INTERSECT set operation removes duplicates, so the second query has less answers. The query below gives the same answers as the INTERSECT-based query.
```
select distinct(smi) from tcm01_mol
  where contains(smi, 'NCCc1ccccc1') = 1
  and smi in (select smi from nci)
```
Both the nci and tcm01_mol tables have a CAS Registry number column. Use the matching SMILES to validate the CAS registry numbers. That is, for all the structures in common between the two tables, how many have CAS registry number discrepancies?

Building on top of the previous queries we can find all rows with common SMILES and then further prune based on CAS registry number. First, simply print out both CAS registry number columns with the common SMILES.
```
select nci.smi, tcm01_mol.cas, nci.cas_rn from tcm01_mol, nci
  where tcm01_mol.smi = nci.smi;
```
Next, lets skip the NULL cas numbers...
```
select nci.smi, tcm01_mol.cas, nci.cas_rn from tcm01_mol, nci
  where tcm01_mol.smi = nci.smi
  and nci.cas_rn is not null
  and tcm01_mol.cas is not null
```
Finally, add the clause to compare the two cas registry number columns:
```
select nci.smi, tcm01_mol.cas, nci.cas_rn from tcm01_mol, nci
  where tcm01_mol.smi = nci.smi
  and nci.cas_rn is not null
  and tcm01_mol.cas is not null
  and nci.cas_rn != tcm01_mol.cas
```

Daylight Chemical Information Systems Inc.
support@daylight.com