Daylight v4.9
Release Date: 1 February 2008

Name

jarpat - perform Jarvis-Patrick clustering

Unix Synopsis

jarpat [options] in.tdt [out.tdt]

Description

jarpat(1) clusters items based on the occurrence of shared near neighbors. Input is a .tdt file (either "list" or "dump" format) containing "Nearest neighbors" (NN) data, typically produced by nearneighbors(1). Input is copied to output with a "Cluster" (CL) data item inserted after each "nearest neighbors" item. Its output is intended to be processed by other programs such as showclusters(1) and listclusters(1).

The original Jarvis-Patrick algorithm specifies that each item in a cluster must meet two criteria with at least one other member of that cluster based on lists of "near" nearest neighbors:

1. each item must be in each other's list
2. "need" neighbors must common to both lists
Parameters "need" and "near" determine the nature of the resultant clustering and may be adjusted to meet clustering goals. For instance, specifying "jarpat -p 6/10" requires 6 of 10 nearest neighbors be in common to cluster items.

The default cluster method used by jarpat(1) is as per the original algorithm. For some purposes, relaxing the first Jarvis-Patrick criterion leads to better clustering (fewer singletons, better linkage, more complete clusters). jarpat(1) provides two options which relax the first criterion. The -e option removes the first criterion completely, resulting in the most complete clustering at the expense of doing an exhaustive search (warning: this can be CPU-intensive). The -f option partially relaxes the first criterion, requiring only that one item is in the other's list (instead of requiring that this is true both ways). Clustering using the -f option produces results which approximate -e results, but which can be computed even faster than the original method.

Options

-JP_TYPE EXHAUSTIVE
Do an exhaustive search for cluster members by fully relaxing the first Jarvis-Patrick criterion (don't require cluster members to be in each other's lists). [-e]
-JP_TYPE FAST
Do a fast search for cluster members by half relaxing the first Jarvis-Patrick criterion (require that only one cluster member be in another's list). [-f]
-JP_RUNID runid
Identify this run by `runid' in $CLG and CL output data. If not specified, the runid field of CL data will be omitted and the first field of $CLG will be set to "na". It is important to specify a short, unique name for the run if the results are to be stored in a database with data from other clustering runs. [-id]
-NNID nnid
Only use nearest neighbor lists identified by `nnid', rather than the first one encountered in the tree. This is chiefly useful for testing: in normal use, only one set of nearest neighbors is ever generated. [-in]
-RECORD_COUNT count
Initially allocate memory for `count' structures. Ideally, `count' should be set to the number of structures to be input. It is good practice to specify a number equal to or slightly greater than this number. If more than `count' structures are encountered while reading input, memory will be reallocated as needed, resulting in a performance penalty and a possible "out of memory" error. The default is 10000. [-m]
-JP_WRITE_NNDATA TRUE
Copy "Nearest neighbors" (NN) data to output (not done by default). [-nn]
-JP_NEED need -JP_NEAR near
Specify the adjustable parameters for Jarvis-Patrick clustering; require that <need> of <near> nearest neighbors be in common. Since nearneighbors(1) defines the nearest neighbor of an item to be itself, and the standard Jarvis-Patrick algorithm requires items to be in each other's lists to cluster, 10/16 (the default values) really means that 9 of 15 other neighbors must be in common. [-p]
-RESCUE_SIZE sos
Rescue singletons by putting them in the cluster which contains the plurality, but not less than <sos>, of their <near> nearest neighbors (where <near> is as per option -p). Singletons are not rescued by default.
-SINGLETON_FILE file
Write singleton trees to the specified file, excluding NN and CL data. Such files are typically used for reclustering singletons. By default, singletons are not written to a separate file.
-NN_BEST_THRESHOLD val
Causes any structures whose best neighbor is less similar than 'val' to be considered a singleton by default (default: don't)
-COMPARISON [DISTANCE|SIMILARITY]
Controls relative goodness of similarity comparisons for tie-handling. SIMILARITY means that higher values are better; DISTANCE means that lower values are better. This is only used in conjunction with the NN_BEST_THRESHOLD value, otherwise it's ignored. (Default: SIMILARITY)
-JP_USE_ALL_TIES
Count all ties in the neighbor list as part of the NEED for clustering. The default behavior is to count only any common ties which can be part of the first NEAR neighbors in the list.

Return Value

Returns 0 to its environment on success, or 1 on error, in which case a diagnostic message is printed:

jarpat: input file not specified

An input file was not specified on the command line.
jarpat: can't open input file
The input file specified on the command line does not exist or is not readable.
jarpat: can't open output file
The output file specified on the command line can't be created for writing.
jarpat: bad value for option xx xxxxxxx
A non-numeric value was specified for an integer-valued option.
jarpat: <need> is greater than <near>.
"need" specified greater than "near" when requiring "need" of "near" nearest neighbors.
jarpat: unknown option encountered: xxx
An invalid option was specified on the command line.
jarpat: out of memory
The program was not able to allocate enough virtual memory to run the specified problem. Use the -m option to specify the number of datatrees to be input and limit the nearest neighbor list sizes with the -p option. If it still fails, your computer doesn't have enough virtual memory to process the input file.
jarpat: note, x of x input trees contain valid NN data
This (non-fatal) message appears if not all input trees contain nearest neighbor lists, and is intended to let the user know how much work is actually being done (trees without NN data are ignored in the clustering). The number of trees with nearest neighbors is typically a few less than the total.
jarpat: no trees with valid nearest neighbor lists were
found No valid nearest neighbor lists were found, either because no NN data were input or their "Run ID" didn't match that specified with the -in option.

Examples

jarpat(1) is typically used in concert with other programs for doing cluster analysis on chemical structures.

Starting from a small (less than 10000 structures) .smi file containing SMILES and CAS Numbers, perform a cluster analysis and display results:

$ smi2tdt -t '$CAS' smicas.smi > smicas.tdt
$ fingerprint -id 7DEC smicas.tdt > fingers.tdt
$ nearneighbors -NNID 7DEC fingers.tdt > neighbors.tdt
$ jpscan neighbors.tdt > /dev/printer
$ jarpat -JP_NEED 8 -JP_NEAR 14 -NNID 7DEC neighbors.tdt > clusters.tdt
$ showclusters -h -q -v clusters.tdt | more
$ showclusters -h -q -x clusters.tdt > /dev/printer
Repeat the above clustering, but instead of clustering with parameters 8/14, do the primary clustering with 10/16 then save, examine, and recluster singletons at 5/10:
$ jarpat -JP_RUNID I -SINGLETON_FILE sing.tdt neighbors.tdt > clusters.tdt
$ showclusters -h -q -v clusters.tdt > /dev/printer
$ nearneighbors -NNID II sing.tdt > sing_nn.tdt
$ jpscan sing_nn.tdt > /dev/printer
$ jarpat -JP_NEED 5 -JP_NEAR 10 -NNID II sing_nn.tdt > sing_cl.tdt
$ showclusters -h -q -v sing_cl.tdt | more
$ showclusters -h -q -x sing_cl.tdt > /dev/printer
Make a .smi file containing the SMILES and CAS Numbers of cluster centroids from both of the above clusterings.
$ listclusters -d '$CAS' -s -x clusters.tdt > cents.smi
$ listclusters -d '$CAS' -s -x sing_cl.tdt >> cents.smi
$ cat cents.tdt > /dev/printer
Starting from .tdt file containing 80,000 SMILES with CAS Numbers perform a 6/10 cluster analysis and display cluster centroids by CAS number.
$ fingerprint -id 4OCT smicas.tdt > fp.tdt
$ nearneighbors -RECORD_COUNT 80000 -NNID 4OCT fp.tdt > nn.tdt
$ jarpat -JP_TYPE FAST -JP_NEED 3 -JP_NEAR 5 -RECORD_COUNT 80000 nn.tdt > cl.tdt
$ showclusters -d '$CAS' -x cl.tdt | more
Prepare a .tdt file to be used to load the above results in a Thor database.
$ listclusters -x cl.tdt > clusterdata.tdt

Files

$DY_ROOT/bin/jarpat

Daylight License

programs: cluster

Related Topics

fingerprint(1) jpscan(1) listclusters(1) mergeneighbors(1) nearneighbors(1) showclusters(1) licensing(5)

Daylight Theory Manual

R .A. Jarvis and E. A. Patrick, "Clustering using a similarity measure based on shared near neighbors", IEEE Trans Comp, Vol C22, No 11, pp 1025-1034, 1973.

P. Willett, "Similarity and Clustering in Chemical Information Systems", Research Studies Press, Letchworth, Hertfordshire, 1987.

Bugs

None known.