Welcome all seeking refuge from low carb dogma!

“To kill an error is as good a service as, and sometimes even better than, the establishing of a new truth or fact”
~ Charles Darwin (it's evolutionary baybeee!)

Wednesday, February 22, 2017

Testing Scientific Hypotheses v. Statistical Hypothesis Testing

Six plus years on.  Today's airing of Part III of Gary Taubes on LLVLC  led me to bump this post.

Original Post Date:  December 10, 2010.

"One of my goals here is to get the research community to understand that there is an alternative hypothesis that should actually be the null hypothesis — the hypothesis that requires remarkable evidence to reject. "                                             
~ Gary Taubes

"Many scientists consider Popper’s idea of falsification to be the only major improvement to the scientific method since Francis Bacon came up with the idea.
Although more complex mathematically than it appears on the surface what Popper’s falsification theory does is describes a way in which hypotheses can be stated with accuracy. Remember, an hypothesis is merely a guess or an assertion that requires testing before it can be said to be viable. Hypotheses can be stated in ways that, although they seem reasonable, can never really be tested. Popper wrote that the only way an hypothesis can be considered a valid hypothesis is if it can be falsified.
What does this mean exactly?

... It’s the same if my hypothesis were that all cats are black. All one would have to do to disprove that hypothesis is to find one white cat.  
It seem simplistic but it is important. It has changed the way that scientist state their hypotheses so that they can be falsified. That doesn’t mean that the hypotheses will be falsified, but it’s important to state them so that they can be falsified if such data comes forth.
Now let’s back up to our idea about the metabolic advantage.
Some people claim it exists while others claim that it doesn’t. What’s the truth? We know both groups can’t be correct, so one has to be wrong. The metabolic advantage either exists or it doesn’t. Let’s establish our hypothesis so that it fits with Popper’s concept of falsifiability.
If we hypothesize that there is a metabolic advantage we may have some trouble. Why? Because if we search and search and never find any evidence of a metabolic advantage, all we can say is that we haven’t found it yet. If, on the other hand, we state our hypothesis as follows: There is no metabolic advantage, then all we have to do is find one instance where there is one to disprove that hypothesis. Since that hypothesis can be falsified it is a valid hypothesis. And if we can falsify it, then it’s obverse, i.e., there is a metabolic advantage, is true. Sir Karl would approve."
~Dr. Michael Eades

I have a fair amount of experience in applying the scientific method and statistical hypothesis testing.  It seems both of these men are conflating the role of, or perhaps better worded, the type of hypotheses involved in these two distinct applications.

A Scientific Hypothesis is formulated/worded to express what you believe to be true, based on prior observation or knowledge.  One then designs an experiment or experiments to see if the results are consistent with the hypothesis.  So, based on observations that people seem to lose more weight eating more calories on an LC diet compared with LF, the hypothesis put forth in a scientific study might go something like this:  "I propose that there is a metabolic advantage to low carbohydrate diets compared to low fat diets that allows dieters to lose more weight while consuming more calories on LC vs. LF."  I would then design a study to test this (randomized, controlled and all that jazz) and if indeed the LC group consumed more calories but lost more weight (fat mass, reaching a level of statistical significance), or consumed more calories (similar types of foods whole vs. refined etc., reaching a level of statistical significance) and lost the same amount of weight (fat mass), these results would support my hypothesis.  My conclusions might go something like:  "The results of my study are consistent with the assertion that there is a metabolic advantage to low carbohydrate diets when compared with low fat diets".   In more lay-friendly summary form, something like:  "this experiment demonstrated a metabolic advantage to LC compared with LF diets".

Note: I have not "proven" an Metabolic Advantage if I get the results, but I have added to the evidence in support of my Scientific Hypothesis.

The scientific literature is full of summary and review articles and meta analyses, and indeed even the introductions of some primary research include laundry lists of citations of previous work.  I love coming across these because it makes my own research easier!   Eades is correct that even if 100 studies are consistent with the Metabolic Advantage Scientific Hypothesis,  this will never *prove* the Scientific Hypothesis because we keep finding "white swans".  But at some point, enough evidence mounts to where that consensus or general acceptance is reached.  A single black swan, rather a black duck, amongst a huge flock of white swans is interesting, but doesn't go far to discount the mountain of evidence that supports the "white swan" Scientific Hypothesis.   (Yes, it is good to explain that black bird in the context of the hypothesis, good scientists don't ignore them.  No some LC dieter on the internet claiming to eat 5000 calories and losing weight when they were stalled at 400 lbs eating 800 high carb calories doesn't count.)

Often, various studies lead to conflicting results.  We see this a lot where some studies show a positive correlation between two variables, some show none, some even show a negative correlation.  These situations point to something amiss in the hypothesis or perhaps the study design was flawed, such as failure to control for some related variables, or that the hypothesis needs to be more context specific.   By Eades' Popper-logic all I have to do is restate my Scientific Hypothesis in a falsifiable form:  "There is no metabolic advantage ...." .   Then if the results of 100 studies are consistent with that hypothesis, we haven't proven it to be true, but if one study is inconsistent with it, that would "prove" the opposite (alternative)?   Sorry, Dr. Eades, that is NOT how science works!  In science we look at all the evidence in support of, or countering, a Scientific Hypothesis and weigh the rigor of the studies producing the evidence and the quantity of evidence on each side.  A thorough scientist will always keep that "black swan" on his/her mind and consider its implications on the assumptions made in research, but the more the "white swan" keeps popping up, the more that "black swan" can be ignored as a fluke, outlier, whatever.

Let's look at Popper's hypotheses a bit more.  Not only does he advocate a falsifiable hypothesis, but also that there be only two competing hypotheses.  In other words, we are dealing with complements in probability.  Either something is "A" or "not A", there's no other option.  So either there is a Metabolic Advantage or there's not.   But Taubes' "alternate hypothesis" is not the only alternative to explain obesity.  As it is, the hypothesis that carbohydrate excess causes obesity is not even independent from the assertion that calorie excess causes obesity, because carbs contain calories.  We cannot separate these variables completely.   But even if carbs and cals were independent variables, there are still any number of other Scientific Hypotheses one could formulate on the cause of obesity besides "it's the calories" v. "it's the carbs".   In this light, one can see how misguided Taubes' quoted comment above is.  It doesn't even fit the structure of statistical hypotheses.

So ... what is this "null hypothesis" of which Taubes speaks?   The one that requires remarkable evidence to reject?  This is a type of hypothesis in a statistical method known as Hypothesis Testing.

Before I go into the nitty gritty of Hypothesis Testing, let me discuss how Scientific Hypotheses and Statistical Hypotheses relate and differ.  We rarely do Hypothesis Testing on the original Scientific Hypothesis.  Rather if we measure several parameters, we may do several Hypothesis Tests to determine if the results reach a level of statistical significance.  For example, if we hypothesize that LC is a better diet for CVD risk, we may measure several things such as LDL, HDL, C-reactive protein, and BP.  We would do an Hypothesis Testing on the results for each of those outcomes to determine if one diet or the other produced a different result. Each Hypothesis Testing would involve a Statistical Hypothesis pertaining to the parameter measured and the results.

Whenever you read the words "statistically significant", or simply "significant" describing a result in the scientific literature, this means that a statistical test has been applied to the data.  All such tests involve some basis in probability.  If I flip a coin 20 times, we expect it to land on "heads" and "tails" an equal number of times.  We can use what is called the Binomial Probability Function to determine the probability of getting each of the possible outcomes from such an experiment.   

As you can see from the picture above, there is the greatest probability of getting 10 heads (and thus 10 tails),  but there's a more than 5% chance of getting 7 heads and 13 tails  (~7.4%).  Intuitively getting almost twice as many tails as heads would lead most of us to be suspicious of the integrity of that coin.  But statistically this result does not rise to the level of "unusual", because there is a sufficient probability that it could occur at random.  Which brings me to:

The "Rare Event Rule" (RER):   If I state X is true, and the probability of X actually being true is very small, then X is probably not true and the opposite of X is thus probably true.  That's a mouthful alright!  But since we use probability there's always that word "probably" in there.  When you see things like (P<0.01) in scientific papers, they are reporting a "P-value".   Simply put, if you see P<0.01, it means that the probability that the result could have occurred at random is less than 1%.  

In hypothesis testing, we apply the RER to test a pair of competing hypotheses.  We call these Statistical Hypotheses the Null Hypothesis (H0) and the Alternate Hypothesis (Ha).  These are complementary, therefore either the null is true or the alternate is true, and there's no other option.  One can see already how this would necessarily differ from the formulation of Scientific Hypotheses.

Let's use a comparison of weight loss between two groups to go through hypothesis testing.  Two groups follow different diets for a period of time and at the end of the study we find the mean (average) weight loss for each group.  Group A lost 12.3 pounds while Group B lost 10.1 pounds.  Clearly there was a difference between the two groups in the amount of weight lost, but is this statistically significant?  We would perform an Hypothesis Testing on the data to answer this question.

Regardless of what we're trying to demonstrate, the Hypothesis Testing process involves stating a null hypothesis that contains the affirmative (equal sign in the mathematical equation).

The null hypothesis is assumed to be true.   This is simply part of the process and is agnostic as to whether it is or not.   
We then use some statistical method to determine the probability that H0 is actually true.

  • If this probability is sufficiently small (less than 5%, 1% are common levels), then applying the RER, the H0 is probably not true and therefore the alternateHa is probably true.  We say that we "reject the null hypothesis" and this result therefore supports (not proves!) its opposite:  the alternate hypothesis.  
  • If the probability is not sufficiently small, we "fail to reject" the null.  This does not support the null, because, say an 8% probability of the null being true is not that great, but it doesn't reject it either.  (We'll see wording such as "trending towards but did not reach a level of statistical significance" when this occurs).

So in the weight loss study if μA is the average weight loss of group A, and μB is the average weight loss of group B, we would formulate our hypotheses as follows:       H0:   μAB            Ha:   μA≠μB 

Note that either the weight losses are equal or they are not and there is no other possibility.  Also, even though we are looking to demonstrate that the weight losses are different, our null hypothesis is always the equality, it is always the one "tested", and we actually hope to reject it so that the alternate would be supported.   There are nuances to setting up these hypotheses, so if we want to show μAB, then the competing hypothesis would be μA≤μB and:        H0:   μA≤μB            Ha:   μAB  ... however testing the simple equality is by far the most common approach.   There are slight differences in the math that I don't wish to get into, but note how the equality is in the null and again, we're not testing that μAB something that might very well be what is stated in, or would support, the Scientific Hypothesis of our study.

Next we apply some probability-based statistical method to "test" H0, and if we reject it, then we would say the data support the alternate.  The wording might go something like:  "the weight loss was significantly different between the two groups", or "the difference in weight lost between the groups was statistically significant".  If we set up our hypotheses with the >, we would say that "the weight loss of Group A was significantly greater than that of Group B" or "the greater weight loss of Group A compared to Group B was statistically significant" ... or some variation thereof.

Before moving on, let's put it all together for our weight loss study.  We make the Scientific Hypothesis that there's a metabolic advantage to LC diets, design and conduct a solid study, then perform Hypothesis Testing on the weight loss results.  If our data pan out such that we reject the null hypothesis (essentially that there's no difference in weight loss), this data would mean Group A lost significantly more weight than B, and this result (assuming A = LC) would support our Scientific Hypothesis.

Once we get the hang of Hypothesis Testing, it becomes evident that we want to formulate our Statistical Hypotheses so as to reject the null thereby supporting the alternate.  If we're outright seeking to refute a claim, then we would have that claim be the null (e.g. contain the affirmative, =) and if we're looking to support a claim, then we would word things such that it becomes the alternate hypothesis in our Hypothesis Test.  But it is a mistake to think that we should apply this same logic or these linguistic gymnastics to Scientific Hypotheses.

Statistics lesson over ... hope I didn't bore you to death with that!  But I guess if you've made it this far there was something of interest there.   This whole "alternate hypothesis" thing has been hanging in the back of my mind for quite some time now because the terminology applies to a process that differs from the scientific method as a whole.

Some concluding remarks:

In science as in statistics, we almost never really prove anything.  Indeed many physical "laws" tend to fall apart at subatomic levels yet are called "laws" because they've been "proven" over and over in most circumstances.   Even when "broken", they can still be useful and hold up so long as the context is appropriate.  But if you ever take a course in statistics, you'll quickly learn that we should avoid words such as "proven" in favor of such phrases as "the data support the claim that ...".  The field is littered with double negatives as well.  For example, if we have a subject with a BMI in the lowest 2.5% or highest 2.5% of the population, this meets a statistical rule of thumb 95/5% definition for UNusual.  That BMI would fall in the extreme 5% of the population.  However if a subject presents with a BMI that is only greater than 5% or less than 5% of the population, we would describe this as "not unusual".  This individual would lie in the combined extreme 10% of the population, so we would hardly call their BMI a usual one.  

So going back to Taubes' current goals, I would ask:

Why should researchers be tasked with presuming your hypothesis to be true and needing to find remarkable evidence to reject it?  
For starters, it's not even an appropriate alternate hypothesis by Statistical Hypothesis/Hypothesis Testing standards.  But how about, we take the energy balance to be the null hypothesis and put the onus on you to find remarkable evidence to reject that?  Some may say that GCBC does just that.  It doesn't even come close.  A case could be made that you found several of those white swans in support of your hypothesis (I contend most of these are not really white swans at all, but ...), or seeming black swans for CICO ... but according to your friend Dr. Mike, you haven't gone about this by a popper, I mean proper, fashion.  As to the MA, Dr. Eades, your black swan .... err ... mutant rodent ... does not a "proof" make for your Scientific Hypothesis on Metabolic Advantage.

Mostly I hope the take away message of this post will be that there are different types of hypotheses and we need to discuss their study and implications in the proper context.  To present Statistical Hypothesis Testing arguments in support of Scientific Hypothesis Testing is either wrong headed or outright misleading.  Since neither of these men has ever actually conducted scientific research or seemingly has any experience in first hand applications of statistical analyses, I could give them a pass on this and chalk it up to insufficient knowledge in the field.   Which would lead to the question of, yet again, why so many look to these two for "expert" opinions on anything.


Sanjeev said...

and the p value is not what most scientists who use it think


the P value fallacy, the mistaken idea that a single number can capture both the long-run outcomes of an experiment and the evidential meaning of a single result

Wish I could get the full text (even though I probably would not understand it)

Sanjeev said...

this article (there are related others on that site) introduced me to the idea that p-value is oversold

& I'm hoping repeated reading will someday help me make sense of it.

I came upon this researching the article that made the rounds "lies, statistics & medical studies"

Post a Comment

Moderation is currently on. Thanks in advance for your patience.