So I'm in the midst of the "Oh fuck, must get actual numbers and graphs for publication" stage of my project. This means I must not only generate piles of data, but also make it talk,
and speak the truth. Which means I get to interrogate it with statistics, mwahaha. I actually enjoy this part of the process, since you can magically convert piles of numbers into pretty p-values and sexy graphs showing how earth-shakingly significant your data is...oh, well,
statistically significant anyway. That is, if your stats is being done correctly, otherwise the whole activity is a futile waste of taxpayer dollars,
more so than it usually is.
So I noticed that for situations where I'd expect some sort of significance (they're bloody
obviously different, but I never thought/said that because I'm a 'good scientist' and all that...), the p-values were...well, maybe a little bit too high. Like, they were kind of insane -- 10
-45? Oh come on... it would be
awesome boring if biological data were so clean! But both the t-test, and the Mann-Whitney U test showed extreme significance, with the latter being more trustworthy in my case, or so I've been told (I have a prominent shift in distributions rather than means; that is, cells in the drug treated case get really big, while in the mock they don't get really big.)
So I decided to test the data I
know shouldn't be significant -- treatments of two wild-type ecotypes, and another case where the drug had no effect. So here's my data that SHOULDN'T be significant:

The first graph shows means and st dev errorbars, second graph shows quartile box plots of the same data (that is, no obvious shift in distribution either). Then I have t-test results from Excel, which show significance regardless of whether we assume equal or unequal variance, although F-test shows equal var. The Mann-Whitney U-test, while not being as striking as the results for the data that should be significant, is still somewhat... acceptable-ish. That is, 'significant'. But that doesn't correspond well with the data in the graphs, does it?
Would anyone know what the hell is going on here? Could the difference in sample sizes come into play? All involved data has a normal distribution, but under some conditions (not in the data above though), there is evident shift in skew in the data. I was told a U-test should sniff out differences in skew and kurtosis. I ran my data by a stats-ish guy (ok, ecologist...) a while back, and he said there's no doubt my significant data (not shown),
is actually significant, but I can't trust my tests if they show 'significance' between wild types (not shown) and treatments that don't make any noticeable difference whatsoever (above).
This is really REALLY frustrating because I have a total of like 10 different lines, each treated and untreated, with massive sample sizes considering the work it takes to get the data (microscopy and measurements and all that), and I'd like to wrap up very soon with a complete graph with significant results pointed out, and finally start writing. This will be my first time writing up a part of a manuscript, so it's really exciting (and scary), but right now I've got damn stats in the way!
And I am aware Excel is not a stats program. We don't have anything else though...
I'd really appreciate
any input, thanks! =D (even if it leads to rediscovering that I'm actually a huge idiot...)
UPDATE 07.02.10 2am
Ok, so
Aydin recommended PAST, which turns out to be quite a nice stats program =D Thanks!
But it shows the same thing.
Actually, looking at the confidence intervals (and repeating the calculations back in Excel), the 95% CIs don't overlap, nor do the 99% ones. What's even more frustrating, is that the drug that generally causes cells to get bigger (ploidy, etc), in this case "significantly" shows smaller cells. Which is weird. And rubbish.
Ok, fine, this isn't really a proper control. Let's compare our wild type ecotypes -- the ultimate negative control. There's no bloody way Col-0 and Ler (ecotypes) should have different responses in this situation! Right?

Amazingly, it baaaarely scrapes by for 95% confidence! We use 99%, so we can call it non-significant, but still... it shouldn't be anywhere
near barely scraping by! I mean, the damn p-values should be like 0.5 or something, no? Again, these are two WILD TYPES! Sketchy...
And, hang on... F-test comes out significant? Owww, headache!
Do I need more data then? It'll take another couple of months to double the sample sizes, especially for these ones, where there's much lower count per view, so I'd have to image waay more specimens. Grrrrr...
OMG, IT DOES HISTOGRAMS? And in a HUMANE way, unlike Excel? Aydin, I owe you for
PAST!
So here's the obviously significant case:

The non-significant (histograms can be really misleading when the sample sizes differ, I find...)

And now the really weirdly 'pseudo-significant' case: (again, n = 280, 352)

I can see how there's a bit of a shift, but
significant? Really???
Great. So while the phylogeny course has beaten out any faith in phylogenies out of me, now goes my faith in statistics. I mean, this is the thing we're supposed to rely on to avoid introducing our own biases and judgements... but if done wrongly, it can really make a mess. And I suspect I'm not doing something right.
Or am I just being too paranoid?