In regard to figure 1a from Teichmann et al.

How does one set up the scoring for this kind of experiment? Edit

pending -- I have emailed the group asking what criteria their pathologist used.

Why is the Mann-Whitney test used?Edit

The Mann-Whitney test (AKA Wilcoxon rank-sum test) is a non-parametric test (distribution-free) used to compare two independent groups of sampled data. It is used when the values are ordinal (can be always be ranked by their value) but the intervals between values are not equally scaled. Thus, if clinical score is such that 3>2>1 these can be ordered, but the difference between 3 and 2 is the "same" as between 2 and 1, we would use a MW test. This test is can also be used instead of the t-test when the assumptions of normality and equal variances is not met. This, like many non-parametric tests, uses the ranks of the data rather than their raw values to calculate the statistic. The clinical score data from the experiment is ordinal ranked data, so this test is appropriate.

How does one do it?Edit

Paraphrased and modified from the Wikipedia article on Mann-Whitney.

For small samples a direct method is quick:

  1. Rank the values frm both sample in the same listing (note that the null-hypothesis is that the two samples are merely randomly drawn from the same listing of ranks).
  1. For ease of calculation, choose the sample with the smaller ranks are smaller Call this "sample 1," and call the other sample "sample 2."
  2. For each observation in sample 1, count the number of observations in sample 2 that have a smaller rank (count a half for any that are equal to it). Do this only for sample 1.
  3. U is the sum of all these counts.

The theoretical maximum value of U is the product of the sample sizes for the two samples.

Why is appropriate to use two-tailed? Edit

The Mann-Whitney test approximates whether two groups have equally large values. In this case, the null hypothesis would be rejected in the event of "sufficiently larger" or "sufficiently smaller" values than the control group. If only "sufficiently smaller" OR "sufficiently larger" would be applicable to the study, a one-tailed test would be appropriate.

Comment on whether it would be appropriate or not to conduct a correction for multiple-comparisons and how one would do it. Edit

A Bonferroni correction would be appropriate for multiple comparisons. However, in this figure there are only two groups sampled, and so no correction of the p-value is needed.

A Bonferroni correction is done as follows:

1. Choose the critical one-comparison p-value α desired for rejecting the null hypothesis. Often this is α= 0.05.

2. Divide this p-value by the number of comparisons done. Thus, if three comparisons have been made, the α_corr = .05/3 ~.017.

The hypothesis will be rejected if p <.017.