~/Similarity_Toolbox/EMUG05/zeemer

This program takes a class descriptor on stdin, looks up all members of that class in a merlin pool derived from a command line thor database.

Each of these class members is used as a target to sort the database by similarity to the target.

For each target the ranksum of all the other members of the class is calculated. Using a Mann-Whitney statistic, a test is carried out for each target in turn to see if it retrieves the class from the population in the same way that the original classification did.

The test is one-sided as we are testing not only that the two groups are separated but also that they separation is towards the top of the ranked list. Ranks are from 0, the target itself. Values in excess of 1.65 are significant at the 95% level and values in excess of 2.33 are significant at the 99% level.

 Formulae
        For each of m targets in a population of size N 
                               N = m + n  
      			                   
                              Tm = targets[i].rank_sum

           Mann-Whitney        U = m*n + m*(m+1)/2 - Tm

           Mean               mu = m*n / 2

           Standard error  sigma = sqrt( m*n*(N + 1)/12 )

                         z-score = ( U - mu ) / sigma
                                 =( ( m*n + m*(m+1))/2 - Tm ) / sigma

           Let
                           const = ( m*n + m*(m+1))/2
                                 = m*(N+1)/2

	      For the purists this is the formulation of the Wilcoxon 
       rank-sum test as it is the rank-sum of a random sample of size m,
       or the mean of the rank-sums. 
                
       So for each target in class
                         z-score  = ( const - Tm ) /sigma

       As this is a single tailed test the critical values are
       1.65 ( alpha = 0.05 ) and 2.33 ( alpha = 0.01 )

       A z value which exceeds the critical value indicates that,
       with the given level of confidence, the target is separating
       the population into two classes.


Program zeemer.

Calling syntax: $ zeemer [options] databasename

Options:-
        -tag  <tag>
                the thor database tag of the datatype containing the activity/class
        -field <field>
                the field of the activity/class datatype to use
                Default tag is "AC" and default field "1" i.e the first.
        -coeff <coefficient>
                the type of similarity measure to be used.
                Valid values are TANIMOTO, EUCLID or TVERSKY.
                Default is TANIMOTO.
        -alpha <value>
                Tversky coefficient values need to be 0.0->1.0
        -beta <value>
                Tversky coefficient values need to be 0.0->1.0
        -tdt
                prints the output in tdt format. Default is a table.
                $SMI<SMILES>
                RANK<target;no_of_hits;sum of ranks;z-score>
                |
        -zscore <value>
                Critical value of z-score to filter compounds. Useful values are
                1.65 and 2.33. If the value is exceeded it means you are 95% or
                99% sure the separation into two classes has not occurred by chance.
                When zscore is <> 0.0 the the program acts as a simple filter
                and prints out a file of SMILES.
        -header
                prints a header on the csv file. Default is to omit it.
        -sort
                sorts the output by the rank sum. Default is leave in input order.
                Unsorted is useful for comparing data from different measures as the
                output can be joined. See join().
                Sorting is not useful for tdt output.
        -dos
                produces output with cr/lf at the end of each data line to facilitate
                transfer to a Windows statistics/graphics package
Database specification format is: database%basepw@host:service:user%user-hostpw
The database name must be specified (other fields are optional).

This program reads activity type on stdin and returns a comma delimited table with each row representing SMILES and the sum of the ranks of the other compounds in the same class, along with the z-score, unless the -tdt option is set.