Field of Science

A stats question RE T-tests and U-tests

So I'm in the midst of the "Oh fuck, must get actual numbers and graphs for publication" stage of my project. This means I must not only generate piles of data, but also make it talk, and speak the truth. Which means I get to interrogate it with statistics, mwahaha. I actually enjoy this part of the process, since you can magically convert piles of numbers into pretty p-values and sexy graphs showing how earth-shakingly significant your data is...oh, well, statistically significant anyway. That is, if your stats is being done correctly, otherwise the whole activity is a futile waste of taxpayer dollars, more so than it usually is.

So I noticed that for situations where I'd expect some sort of significance (they're bloody obviously different, but I never thought/said that because I'm a 'good scientist' and all that...), the p-values were...well, maybe a little bit too high. Like, they were kind of insane -- 10-45? Oh come on... it would be awesome boring if biological data were so clean! But both the t-test, and the Mann-Whitney U test showed extreme significance, with the latter being more trustworthy in my case, or so I've been told (I have a prominent shift in distributions rather than means; that is, cells in the drug treated case get really big, while in the mock they don't get really big.)

So I decided to test the data I know shouldn't be significant -- treatments of two wild-type ecotypes, and another case where the drug had no effect. So here's my data that SHOULDN'T be significant:

The first graph shows means and st dev errorbars, second graph shows quartile box plots of the same data (that is, no obvious shift in distribution either). Then I have t-test results from Excel, which show significance regardless of whether we assume equal or unequal variance, although F-test shows equal var. The Mann-Whitney U-test, while not being as striking as the results for the data that should be significant, is still somewhat... acceptable-ish. That is, 'significant'. But that doesn't correspond well with the data in the graphs, does it?

Would anyone know what the hell is going on here? Could the difference in sample sizes come into play? All involved data has a normal distribution, but under some conditions (not in the data above though), there is evident shift in skew in the data. I was told a U-test should sniff out differences in skew and kurtosis. I ran my data by a stats-ish guy (ok, ecologist...) a while back, and he said there's no doubt my significant data (not shown), is actually significant, but I can't trust my tests if they show 'significance' between wild types (not shown) and treatments that don't make any noticeable difference whatsoever (above).

This is really REALLY frustrating because I have a total of like 10 different lines, each treated and untreated, with massive sample sizes considering the work it takes to get the data (microscopy and measurements and all that), and I'd like to wrap up very soon with a complete graph with significant results pointed out, and finally start writing. This will be my first time writing up a part of a manuscript, so it's really exciting (and scary), but right now I've got damn stats in the way!

And I am aware Excel is not a stats program. We don't have anything else though...

I'd really appreciate any input, thanks! =D (even if it leads to rediscovering that I'm actually a huge idiot...)

UPDATE 07.02.10 2am

Ok, so Aydin recommended PAST, which turns out to be quite a nice stats program =D Thanks!

But it shows the same thing.

Actually, looking at the confidence intervals (and repeating the calculations back in Excel), the 95% CIs don't overlap, nor do the 99% ones. What's even more frustrating, is that the drug that generally causes cells to get bigger (ploidy, etc), in this case "significantly" shows smaller cells. Which is weird. And rubbish.

Ok, fine, this isn't really a proper control. Let's compare our wild type ecotypes -- the ultimate negative control. There's no bloody way Col-0 and Ler (ecotypes) should have different responses in this situation! Right?

Amazingly, it baaaarely scrapes by for 95% confidence! We use 99%, so we can call it non-significant, but still... it shouldn't be anywhere near barely scraping by! I mean, the damn p-values should be like 0.5 or something, no? Again, these are two WILD TYPES! Sketchy...

And, hang on... F-test comes out significant? Owww, headache!

Do I need more data then? It'll take another couple of months to double the sample sizes, especially for these ones, where there's much lower count per view, so I'd have to image waay more specimens. Grrrrr...

OMG, IT DOES HISTOGRAMS? And in a HUMANE way, unlike Excel? Aydin, I owe you for PAST!
So here's the obviously significant case:

The non-significant (histograms can be really misleading when the sample sizes differ, I find...)

And now the really weirdly 'pseudo-significant' case: (again, n = 280, 352)

I can see how there's a bit of a shift, but significant? Really???

Great. So while the phylogeny course has beaten out any faith in phylogenies out of me, now goes my faith in statistics. I mean, this is the thing we're supposed to rely on to avoid introducing our own biases and judgements... but if done wrongly, it can really make a mess. And I suspect I'm not doing something right.

Or am I just being too paranoid?


  1. I'd repeat the t-test with another program. There are t-test calculators on the web. Or else, use PAST; it's easy & free:

  2. It is really important to state the sample sizes (as you say they are huge)! Tests like this have a tendency to quickly become incredibly significant if the numbers are high enough..

  3. I did specify sample sizes at all times. I thought that larger sample sizes --> lower chance of false positives, just because any accidental differences would be corrected for more and more; ie larger sample sizes should represent the 'true' curve more closely, and in this case the 'true' curves should be the same!

    PAST also yields the same results! But thanks, it looks like a really nice program! And unlike Excel, an actual STATS program! =D

    I can't understand how the 95% CI could be so small!

  4. Sorry for not seeing the sample sizes! I've been thinking for the past half hour about your problem, and since you mentioned that the variances appear to be different, according to your T-test, you maybe should try out your luck with adapted t-tests for samples with unequal variance or unequal sample sizes (or both!).
    See for example:'s_t-test#Unequal_sample_sizes.2C_unequal_variance or's_t_test .

    By doing a test corrected for unequal variance, you would also make this guy happy ;):

    I don't know if PAST or excel allow these tests. I'm a big fan of R, but I found that most biologists don't feel comfortable running scripts and prefer a visual interface.. It's really powerful though, and definitely allows you to perform the tests I described before (

    Statistics can be annoyingly complex.. Don't lose hope just yet though!

  5. Hi Psi,

    A couple of thoughts, but since I've only learnt stats 'on the fly' as I've needed it you should take it with a couple of grains of salt.

    1. I don't know if there is any biological or methodological reason for your treatments/cells to differ but I do know if there is a difference then your large sample sizes will almost certainly find it. You might want to think in terms of 'effect size' as well as p-value - in the case the difference is greater than you'd expect by chance but the difference in mean size falls between 0.1 and 1mm - does this matter? How does this compare with the effect size between treated an control lines?

    2) If you are testing different combinations of 10 replicates of an experiment then you will expect to see some 'signficant' results just by chance (I think it was John Maynard Smith who said "Statistics was invented by biologists so they could do twenty experiments a year and publish one false result in Nature."). You might consider a correction such as Bonferroni's to deal with the problem of multiple testing.

    3) R is awesome. And it does have a GUI (R Commander) that has T-tests and ANOVA and all that good stuff just a couple of mouse clicks away. You should definitely consider it

    As I said, just a few thoughts, sorry I can't provide The Answer.

  6. If you have very large samples, even very small differences can become statistically significant. But that doesn't mean that the differences are biologically significant.

  7. Thanks a lot for your input, guys, I really appreciate it!

    I'll ponder over this a bit later, when I'm less swamped with course-related stuff...(three exams this week o_O) I'd take data analysis issues over exams/classes any day of the week...

  8. Others have said it already, but just to put it another way: The thing of interest is effect size (e.g. difference in means), not p value. In your examples the effect size is very small, and that's what you should consider. The p value is perhaps useful (or perhaps not) if you have very small sample sizes - then you may want to check that the difference is unlikely to be due to random sampling effects.

    For large samples you practically always get a significant difference (shrug).

    And by the way, Past is much hotter than R.


Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="">FoS</a> = FoS