This program takes a class descriptor on stdin, looks up all members of that class in a merlin pool derived from a command line thor database.
Each of these class members is used as a target to sort the database by similarity to the target.
For each target the ranksum of all the other members of the class is calculated. Using a Mann-Whitney statistic, a test is carried out for each target in turn to see if it retrieves the class from the population in the same way that the original classification did.
The test is one-sided as we are testing not only that the two groups are separated but also that they separation is towards the top of the ranked list. Ranks are from 0, the target itself. Values in excess of 1.65 are significant at the 95% level and values in excess of 2.33 are significant at the 99% level.
Formulae For each of m targets in a population of size N N = m + n Tm = targets[i].rank_sum Mann-Whitney U = m*n + m*(m+1)/2 - Tm Mean mu = m*n / 2 Standard error sigma = sqrt( m*n*(N + 1)/12 ) z-score = ( U - mu ) / sigma =( ( m*n + m*(m+1))/2 - Tm ) / sigma Let const = ( m*n + m*(m+1))/2 = m*(N+1)/2 For the purists this is the formulation of the Wilcoxon rank-sum test as it is the rank-sum of a random sample of size m, or the mean of the rank-sums. So for each target in class z-score = ( const - Tm ) /sigma As this is a single tailed test the critical values are 1.65 ( alpha = 0.05 ) and 2.33 ( alpha = 0.01 ) A z value which exceeds the critical value indicates that, with the given level of confidence, the target is separating the population into two classes. Program zeemer. Calling syntax: $ zeemer [options] databasename Options:- -tag <tag> the thor database tag of the datatype containing the activity/class -field <field> the field of the activity/class datatype to use Default tag is "AC" and default field "1" i.e the first. -coeff <coefficient> the type of similarity measure to be used. Valid values are TANIMOTO, EUCLID or TVERSKY. Default is TANIMOTO. -alpha <value> Tversky coefficient values need to be 0.0->1.0 -beta <value> Tversky coefficient values need to be 0.0->1.0 -tdt prints the output in tdt format. Default is a table. $SMI<SMILES> RANK<target;no_of_hits;sum of ranks;z-score> | -zscore <value> Critical value of z-score to filter compounds. Useful values are 1.65 and 2.33. If the value is exceeded it means you are 95% or 99% sure the separation into two classes has not occurred by chance. When zscore is <> 0.0 the the program acts as a simple filter and prints out a file of SMILES. -header prints a header on the csv file. Default is to omit it. -sort sorts the output by the rank sum. Default is leave in input order. Unsorted is useful for comparing data from different measures as the output can be joined. See join(). Sorting is not useful for tdt output. -dos produces output with cr/lf at the end of each data line to facilitate transfer to a Windows statistics/graphics package Database specification format is: database%basepw@host:service:user%user-hostpw The database name must be specified (other fields are optional).
This program reads activity type on stdin and returns a comma delimited table with each row representing SMILES and the sum of the ranks of the other compounds in the same class, along with the z-score, unless the -tdt option is set.