Friday, December 14, 2012

The Language

The language of science, of course, is math. In physics and chemistry, you need to learn algebra and calculus to have more than a passing knowledge of the relevant topics. For our purposes, where we are exploring environmental risk factors or treatments on health and the environment, the language is statistics. If you are not familiar with the basics, quite simply, you will easily be lead astray by hype. There's no possible way to put all of the basics in a single, much less readable and interesting blog post, but I do think you can try to highlight what separates someone like Nate Silver from Dick Morris. So I'm totally gonna. You don't have to understand the results section of a study to correctly gauge what you're being told is important, but you do need to understand the concepts behind it. The two most important concepts deal with the nature of randomness, and they are confidence intervals, and errors in hypothesis testing.

Don't be like Dick. Too many people are Dicks
Virtually every study we will come across involves a rather simple, but important concept: sampling. When you think about it, it's really quite amazing how many of the recent polls in the presidential election, using only around 1500 random people, were able to be so accurate in determining the results of a voting population that ended up being over 125 million people. The assumption that a random sample accurately reflects the larger population like this is the basis for studies that link cancer to various agents, or the prevalence of a certain contaminant in Lake Michigan.

Unlike a presidential poll, however, there's not only a handful of possible results (e.g. Obama, Romney, undecided). If you're looking at something like the blood pressure readings for people taking a certain medication for hypertension, there is the possibility that you could get any conceivable result possible, although the probability can be reasonably considered 0 outside of a certain range. We're dealing with normal distributions, specifically two different normal distributions for exposure and for no exposure, and then comparing them. This is the crux of data analysis in a nutshell. We want to know what the middle of the curve (i.e. the mean) for people or plots of land or crops exposed to a certain treatment tells us, compared to what the middle of the curve for those not exposed tells us.

The normal distrubution. Taken from Wikipedia
Before we discuss this, though, let's take a closer look at the normal distribution. The blue shading shows what I mean by the probability of getting really any conceivable number. The percentages are the probabilities that you'll find any single data point in that range. The key takeaway is that almost anything is technically possible. So when you take only a single test, you always know in the back of your mind that these results could be a total fluke. When scientists report the results of their studies, they acknowledge this by listing the confidence intervals (CI) of their curve. It's the same thing in presidential polls, when they openly declare the margin of error (MOE) involved. There has to be some sort of cutoff for readers to understand the impact of your results. A CI or an MOE gives the range of possibilities in which you are 95% confident your true means exist, if you were to theoretically perform an infinite amount of readings. The more individual data points you take, however, the more certainty you have that your curve represents reality, so the tighter this range is, but there's always some uncertainty, and it gets compounded a bit when you start testing it against other curves.

Left: High variability, thus a large CI. Right: Low variability, thus a smaller CI

Science always starts with a hypothesis (not synonymous with theory! Don't even!) that needs to be tested. Sometimes your peaks are pretty close together, and sometimes they're further apart. Those of us working with statistics put the burden of proof on showing something "statistically significant", which is to say you put the burden of proof on a meaningful difference in results, and found evidence in the numbers for it. Our hypothesis is always that we will not find evidence that there is a meaningful effect, with the assumption that the two curves (samples) come from the same population. Similar to how you assume that the Gallup and Rasmussen polls, although they may be telling you slightly different things, represent the same electorate.

One of the big questions we are asking, of course, is, "what is the probability that these two curves are not from the same population?" This is represented in studies as the "p-value". Obtaining a p-value first requires you to "fit" your data into a standard normal distribution, which for our purposes is generally more than acceptable. You can't compare two curves unless you compare two curves against the same standard, but I'll spare you the details of how you do it, just be aware that the comparison is made fairly.

You can maybe get a feel for the challenges of this question from the image below. Remember, we're taking one set of readings for each treatment, and if we were to take another set just for good measure, the peak could be quite different just by chance. There's a 95% chance the peak could be anywhere in your confidence interval. Sometimes you see this from week to week in the presidential polls, when the 1500 or so people who respond to the poll in one week seem to show a major swing in opinion from the week prior. News outlets tend to run with the horse race narrative, looking for the one gaffe or moment that caused this crazy swing. The Nate Silvers of the world look at the swing and say, "Simmer down, people. It's almost certainly due to randomness."
For our studies, this is what's being analyzed! Taken from Missouri State

Because of this effect, when your curves are close together, it's going to be pretty difficult to tell the two apart. You are going to have a very high probability of accepting your hypothesis that there is no evidence of an effect, because you (hopefully) have a good amount of data, and the two groups look pretty similar. The yellow part takes the randomness of your sample into account and highlights the possibility that maybe we just caught a fluke. Maybe there is a statistical difference in real life but we just didn't happen to catch it by total chance.

The red shading, conversely, shows how we allowed a 5% chance that we did actually find evidence of an effect, but only by random chance as well. In practice, it's fairly rare that you get curves that are so far apart you're practically certain that you found evidence of an effect. If you are reading about a study, it's because they found this evidence, and the media thought it was going to generate interest. By convention, the researchers allowed themselves the same 5% cutoff point of being in error, and often times the reported p-value is pretty close to that 5% cutoff.

And now you know, in excruciating detail, why any single study is just one piece of evidence to throw onto the scale and weigh, even in the best possible circumstances. But statistical uncertainty is just the very, very tip of the iceberg.

No comments:

Post a Comment

Post a Comment