Skeptic Wonder: A stats question RE T-tests and U-tests

A stats question RE T-tests and U-tests

By Psi Wavefunction on Friday, February 05, 2010

So I'm in the midst of the "Oh fuck, must get actual numbers and graphs for publication" stage of my project. This means I must not only generate piles of data, but also make it talk, and speak the truth. Which means I get to interrogate it with statistics, mwahaha. I actually enjoy this part of the process, since you can magically convert piles of numbers into pretty p-values and sexy graphs showing how earth-shakingly significant your data is...oh, well, statistically significant anyway. That is, if your stats is being done correctly, otherwise the whole activity is a futile waste of taxpayer dollars, ~~more so than it usually is~~.

So I noticed that for situations where I'd expect some sort of significance (they're bloody obviously different, but I never thought/said that because I'm a 'good scientist' and all that...), the p-values were...well, maybe a little bit too high. Like, they were kind of insane -- 10^-45? Oh come on... it would be ~~awesome~~ boring if biological data were so clean! But both the t-test, and the Mann-Whitney U test showed extreme significance, with the latter being more trustworthy in my case, or so I've been told (I have a prominent shift in distributions rather than means; that is, cells in the drug treated case get really big, while in the mock they don't get really big.)

So I decided to test the data I know shouldn't be significant -- treatments of two wild-type ecotypes, and another case where the drug had no effect. So here's my data that SHOULDN'T be significant:

The first graph shows means and st dev errorbars, second graph shows quartile box plots of the same data (that is, no obvious shift in distribution either). Then I have t-test results from Excel, which show significance regardless of whether we assume equal or unequal variance, although F-test shows equal var. The Mann-Whitney U-test, while not being as striking as the results for the data that should be significant, is still somewhat... acceptable-ish. That is, 'significant'. But that doesn't correspond well with the data in the graphs, does it?

Would anyone know what the hell is going on here? Could the difference in sample sizes come into play? All involved data has a normal distribution, but under some conditions (not in the data above though), there is evident shift in skew in the data. I was told a U-test should sniff out differences in skew and kurtosis. I ran my data by a stats-ish guy (ok, ecologist...) a while back, and he said there's no doubt my significant data (not shown), is actually significant, but I can't trust my tests if they show 'significance' between wild types (not shown) and treatments that don't make any noticeable difference whatsoever (above).

This is really REALLY frustrating because I have a total of like 10 different lines, each treated and untreated, with massive sample sizes considering the work it takes to get the data (microscopy and measurements and all that), and I'd like to wrap up very soon with a complete graph with significant results pointed out, and finally start writing. This will be my first time writing up a part of a manuscript, so it's really exciting (and scary), but right now I've got damn stats in the way!

And I am aware Excel is not a stats program. We don't have anything else though...

I'd really appreciate any input, thanks! =D (even if it leads to rediscovering that I'm actually a huge idiot...)

UPDATE 07.02.10 2am

Ok, so Aydin recommended PAST, which turns out to be quite a nice stats program =D Thanks!

But it shows the same thing.

Actually, looking at the confidence intervals (and repeating the calculations back in Excel), the 95% CIs don't overlap, nor do the 99% ones. What's even more frustrating, is that the drug that generally causes cells to get bigger (ploidy, etc), in this case "significantly" shows smaller cells. Which is weird. And rubbish.

Ok, fine, this isn't really a proper control. Let's compare our wild type ecotypes -- the ultimate negative control. There's no bloody way Col-0 and Ler (ecotypes) should have different responses in this situation! Right?

Amazingly, it baaaarely scrapes by for 95% confidence! We use 99%, so we can call it non-significant, but still... it shouldn't be anywhere near barely scraping by! I mean, the damn p-values should be like 0.5 or something, no? Again, these are two WILD TYPES! Sketchy...

And, hang on... F-test comes out significant? Owww, headache!

Do I need more data then? It'll take another couple of months to double the sample sizes, especially for these ones, where there's much lower count per view, so I'd have to image waay more specimens. Grrrrr...

OMG, IT DOES HISTOGRAMS? And in a HUMANE way, unlike Excel? Aydin, I owe you for PAST!
So here's the obviously significant case:

The non-significant (histograms can be really misleading when the sample sizes differ, I find...)

And now the really weirdly 'pseudo-significant' case: (again, n = 280, 352)

I can see how there's a bit of a shift, but significant? Really???

Great. So while the phylogeny course has beaten out any faith in phylogenies out of me, now goes my faith in statistics. I mean, this is the thing we're supposed to rely on to avoid introducing our own biases and judgements... but if done wrongly, it can really make a mess. And I suspect I'm not doing something right.

Or am I just being too paranoid?

8 comments:

AYDIN ÖRSTANFebruary 5, 2010 at 8:13:00 PM PST
I'd repeat the t-test with another program. There are t-test calculators on the web. Or else, use PAST; it's easy & free: http://folk.uio.no/ohammer/past/
ReplyDelete
Replies
Lucas BrouwersFebruary 6, 2010 at 2:30:00 AM PST
It is really important to state the sample sizes (as you say they are huge)! Tests like this have a tendency to quickly become incredibly significant if the numbers are high enough..
ReplyDelete
Replies
Psi WavefunctionFebruary 7, 2010 at 2:29:00 AM PST
I did specify sample sizes at all times. I thought that larger sample sizes --> lower chance of false positives, just because any accidental differences would be corrected for more and more; ie larger sample sizes should represent the 'true' curve more closely, and in this case the 'true' curves should be the same!

PAST also yields the same results! But thanks, it looks like a really nice program! And unlike Excel, an actual STATS program! =D

I can't understand how the 95% CI could be so small!
ReplyDelete
Replies
Lucas BrouwersFebruary 7, 2010 at 5:30:00 AM PST
Sorry for not seeing the sample sizes! I've been thinking for the past half hour about your problem, and since you mentioned that the variances appear to be different, according to your T-test, you maybe should try out your luck with adapted t-tests for samples with unequal variance or unequal sample sizes (or both!).
See for example: http://en.wikipedia.org/wiki/Student's_t-test#Unequal_sample_sizes.2C_unequal_variance or http://en.wikipedia.org/wiki/Welch's_t_test .

By doing a test corrected for unequal variance, you would also make this guy happy ;): http://beheco.oxfordjournals.org/cgi/content/full/17/4/688

I don't know if PAST or excel allow these tests. I'm a big fan of R, but I found that most biologists don't feel comfortable running scripts and prefer a visual interface.. It's really powerful though, and definitely allows you to perform the tests I described before (http://sekhon.berkeley.edu/stats/html/t.test.html).

Statistics can be annoyingly complex.. Don't lose hope just yet though!
ReplyDelete
Replies
David WinterFebruary 7, 2010 at 4:58:00 PM PST
Hi Psi,

A couple of thoughts, but since I've only learnt stats 'on the fly' as I've needed it you should take it with a couple of grains of salt.

1. I don't know if there is any biological or methodological reason for your treatments/cells to differ but I do know if there is a difference then your large sample sizes will almost certainly find it. You might want to think in terms of 'effect size' as well as p-value - in the case the difference is greater than you'd expect by chance but the difference in mean size falls between 0.1 and 1mm - does this matter? How does this compare with the effect size between treated an control lines?

2) If you are testing different combinations of 10 replicates of an experiment then you will expect to see some 'signficant' results just by chance (I think it was John Maynard Smith who said "Statistics was invented by biologists so they could do twenty experiments a year and publish one false result in Nature."). You might consider a correction such as Bonferroni's to deal with the problem of multiple testing.

3) R is awesome. And it does have a GUI (R Commander) that has T-tests and ANOVA and all that good stuff just a couple of mouse clicks away. You should definitely consider it

As I said, just a few thoughts, sorry I can't provide The Answer.
ReplyDelete
Replies
AYDIN ÖRSTANFebruary 7, 2010 at 11:07:00 PM PST
If you have very large samples, even very small differences can become statistically significant. But that doesn't mean that the differences are biologically significant.
ReplyDelete
Replies
Psi WavefunctionFebruary 9, 2010 at 11:43:00 PM PST
Thanks a lot for your input, guys, I really appreciate it!

I'll ponder over this a bit later, when I'm less swamped with course-related stuff...(three exams this week o_O) I'd take data analysis issues over exams/classes any day of the week...
ReplyDelete
Replies
Oyvind HammerFebruary 11, 2010 at 1:58:00 AM PST
Others have said it already, but just to put it another way: The thing of interest is effect size (e.g. difference in means), not p value. In your examples the effect size is very small, and that's what you should consider. The p value is perhaps useful (or perhaps not) if you have very small sample sizes - then you may want to check that the difference is unlikely to be due to random sampling effects.

For large samples you practically always get a significant difference (shrug).

And by the way, Past is much hotter than R.
ReplyDelete
Replies

Add comment

Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS

About Me

Psi Wavefunction: Aspiring protistologist with an obsession with anything living, from microbe to language and culture. Finds the innate thirst for the strange and foreign to be quite satisfied by the alien microscopic world here on earth. Life is a fractal where each discipline reigns over its own exciting area and scale; a product of infinite complexity defined by an elegant collection of finite laws. I wish to make protists my area and cellular structure my scale!
Email: psi "dot" wavefunction "at" gmail "dot" com
The story of Psi, and why I blog under a pseudonym

View my complete profile

Field of Science

Pages

A stats question RE T-tests and U-tests

8 comments: