|Don't be like Dick. Too many people are Dicks|
Unlike a presidential poll, however, there's not only a handful of possible results (e.g. Obama, Romney, undecided). If you're looking at something like the blood pressure readings for people taking a certain medication for hypertension, there is the possibility that you could get any conceivable result possible, although the probability can be reasonably considered 0 outside of a certain range. We're dealing with normal distributions, specifically two different normal distributions for exposure and for no exposure, and then comparing them. This is the crux of data analysis in a nutshell. We want to know what the middle of the curve (i.e. the mean) for people or plots of land or crops exposed to a certain treatment tells us, compared to what the middle of the curve for those not exposed tells us.
|The normal distrubution. Taken from Wikipedia|
Before we discuss this, though, let's take a closer look at the normal distribution. The blue shading shows what I mean by the probability of getting really any conceivable number. The percentages are the probabilities that you'll find any single data point in that range. The key takeaway is that almost anything is technically possible. So when you take only a single test, you always know in the back of your mind that these results could be a total fluke. When scientists report the results of their studies, they acknowledge this by listing the confidence intervals (CI) of their curve. It's the same thing in presidential polls, when they openly declare the margin of error (MOE) involved. There has to be some sort of cutoff for readers to understand the impact of your results. A CI or an MOE gives the range of possibilities in which you are 95% confident your true means exist, if you were to theoretically perform an infinite amount of readings. The more individual data points you take, however, the more certainty you have that your curve represents reality, so the tighter this range is, but there's always some uncertainty, and it gets compounded a bit when you start testing it against other curves.
|Left: High variability, thus a large CI. Right: Low variability, thus a smaller CI|
Science always starts with a hypothesis (not synonymous with theory! Don't even!) that needs to be tested. Sometimes your peaks are pretty close together, and sometimes they're further apart. Those of us working with statistics put the burden of proof on showing something "statistically significant", which is to say you put the burden of proof on a meaningful difference in results, and found evidence in the numbers for it. Our hypothesis is always that we will not find evidence that there is a meaningful effect, with the assumption that the two curves (samples) come from the same population. Similar to how you assume that the Gallup and Rasmussen polls, although they may be telling you slightly different things, represent the same electorate.
One of the big questions we are asking, of course, is, "what is the probability that these two curves are not from the same population?" This is represented in studies as the "p-value". Obtaining a p-value first requires you to "fit" your data into a standard normal distribution, which for our purposes is generally more than acceptable. You can't compare two curves unless you compare two curves against the same standard, but I'll spare you the details of how you do it, just be aware that the comparison is made fairly.
You can maybe get a feel for the challenges of this question from the image below. Remember, we're taking one set of readings for each treatment, and if we were to take another set just for good measure, the peak could be quite different just by chance. There's a 95% chance the peak could be anywhere in your confidence interval. Sometimes you see this from week to week in the presidential polls, when the 1500 or so people who respond to the poll in one week seem to show a major swing in opinion from the week prior. News outlets tend to run with the horse race narrative, looking for the one gaffe or moment that caused this crazy swing. The Nate Silvers of the world look at the swing and say, "Simmer down, people. It's almost certainly due to randomness."
|For our studies, this is what's being analyzed! Taken from Missouri State|
Because of this effect, when your curves are close together, it's going to be pretty difficult to tell the two apart. You are going to have a very high probability of accepting your hypothesis that there is no evidence of an effect, because you (hopefully) have a good amount of data, and the two groups look pretty similar. The yellow part takes the randomness of your sample into account and highlights the possibility that maybe we just caught a fluke. Maybe there is a statistical difference in real life but we just didn't happen to catch it by total chance.
The red shading, conversely, shows how we allowed a 5% chance that we did actually find evidence of an effect, but only by random chance as well. In practice, it's fairly rare that you get curves that are so far apart you're practically certain that you found evidence of an effect. If you are reading about a study, it's because they found this evidence, and the media thought it was going to generate interest. By convention, the researchers allowed themselves the same 5% cutoff point of being in error, and often times the reported p-value is pretty close to that 5% cutoff.
And now you know, in excruciating detail, why any single study is just one piece of evidence to throw onto the scale and weigh, even in the best possible circumstances. But statistical uncertainty is just the very, very tip of the iceberg.