A blog on statistics, methods, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Saturday, November 11, 2017

The Statisticians' Fallacy

If I ever make a follow up to my current MOOC, I will call it ‘Improving Your Statistical Questions’. The more I learn about how people use statistics, the more I believe the main problem is not how people interpret the numbers they get from statistical tests. The real issue is which statistical questions researchers ask from their data.

Our statistics education turns a blind eye to training people how to ask a good question. After a brief explanation of what a mean is, and a pit-stop at the normal distribution, we jump through as many tests as we can fit in the number of weeks we are teaching. We are training students to perform tests, but not to ask questions.

There are many reasons for this lack of attention in training people how to ask a good question. But here I want to focus on one reason, which I’ve dubbed the Statisticians' Fallacy: Statisticians who tell you ‘what you really want to know’, instead of explaining how to ask one specific kind of question from your data.

Let me provide some example of the Statisticians' Fallacy. In the next quotes, pay attention to the use of the word ‘want’. Cohen (1994) in his ‘The earth is round (p < .05)’ writes:


Colquhoun (2017) writes:


Or we can look at Cumming (2013):


Or Bayarri, Benjamin, Berger, and Sellke (2016):


Now, you might have noticed that these four statements by statisticians of ‘what we want’ are all different. The one says 'we want' to know the posterior probability that our hypothesis is true, the others says 'we want' to know the false positive report probability, yet another says 'we want' effect sizes and their confidence intervals, and yet another says 'we want' the strength of evidence in the data.

Now you might want to know all these things, you might want to know some of these things, and you might want to know yet other things. I have no clue what you want to know (and after teaching thousands of researchers the last 5 years, I’m pretty sure often you don't really have a clue what you want either - you've never been trained to thoroughly ask this question). But what I think I know is that statisticians don’t know what you want to know. They might think some questions are interesting enough to ask. They might argue that certain questions follow logically from a specific philosophy of science. But the idea that there is always a single thing ‘we want’ is not true. If it was, statisticians would not have been criticizing what other statisticians say ‘we want’ for the last 50 years. Telling people 'what you want to know' instead of teaching people to ask themselves what they want to know will just get us another two decades of mindless statistics.

I am not writing this to stop statisticians from criticizing each other (I like to focus on easier goals in my life, such as world peace). But after reading many statements like the ones I’ve cited above, I have distilled my main take-home message in a bathroom tile:



There are many, often complementary, questions you can ask from your data, or when performing lines of research. Now I am not going to tell you what you want. But what I want, is that we stop teaching researchers there is only a single thing they want to know. There is no room for the Statistician’s Fallacy in our education. I do not think it is useful to tell researchers what they want to know. But I think it’s a good idea to teach them about all the possible questions they can ask.


Further Reading:
Thanks to Carol Nickerson who, after reading this blog, pointed me to David Hand's Deconstructing Statistical Questions, which is an excellent article on the same topic - highly recommended.

Monday, October 16, 2017

Science-Wise False Discovery Rate Does Not Explain the Prevalence of Bad Science

Science-Wise False Discovery Rate Does Not Explain the Prevalence of Bad Science

This article explores the statistical concept of science-wise false discovery rate (SWFDR). Some authors use SWFDR and its complement, positive predictive value, to argue that most (or, at least, many) published scientific results must be wrong unless most hypotheses are a priori true. I disagree. While SWFDR is valid statistically, the real cause of bad science is “Publish or Perish”.

Introduction

Is science broken? A lot of people seem to think so, including some esteemed statisticians. One line of reasoning uses the concepts of false discovery rate and its complement, positive predictive value, to argue that most (or, at least, many) published scientific results must be wrong unless most hypotheses are a priori true.

The false discovery rate (FDR) is the probability that a significant p-value indicates a false positive, or equivalently, the proportion of significant p-values that correspond to results without a real effect. The complement, positive predictive value (\(PPV=1-FDR\)) is the probability that a significant p-value indicates a true positive, or equivalently, the proportion of significant p-values that correspond to results with real effects.

I became interested in this topic after reading Felix Sch√∂nbrodt’s blog post, “What’s the probability that a significant p-value indicates a true effect?” and playing with his ShinyApp. Sch√∂nbrodt’s post led me to David Colquhoun’s paper, “An investigation of the false discovery rate and the misinterpretation of p-values” and blog posts by Daniel Lakens, “How can p = 0.05 lead to wrong conclusions 30% of the time with a 5% Type 1 error rate?” and Will Gervais, “Power Consequences”.

The term science-wise false discovery rate (SWFDR) is from Leah Jager and Jeffrey Leek’s paper, “An estimate of the science-wise false discovery rate and application to the top medical literature”. Earlier work includes Sholom Wacholder et al’s 2004 paper “Assessing the Probability That a Positive Report is False: An Approach for Molecular Epidemiology Studies” and John Ioannidis’s 2005 paper, “Why most published research findings are false”.

Scenario

Being a programmer and not a statistician, I decided to write some R code to explore this topic on simulated data.

The program simulates a large number of problem instances representing published results, some of which are true and some false. The instances are very simple: I generate two groups of random numbers and use the t-test to assess the difference between their means. One group (the control group or simply group0) comes from a standard normal distribution with \(mean=0\). The other group (the treatment group or simply group1) is a little more involved:

  • for true instances, I take numbers from a standard normal distribution with mean d (\(d>0\));
  • for false instances, I use the same distribution as group0.

The parameter d is the effect size, aka Cohen’s d.

I use the t-test to compare the means of the groups and produce a p-value assessing whether both groups come from the same distribution.

The program does this thousands of times (drawing different random numbers each time, of course), collects the resulting p-values, and computes the FDR. The program repeats the procedure for a range of assumptions to determine the conditions under which most positive results are wrong.

For true instances, we expect the difference in means to be approximately d and for false ones to be approximately 0, but due to the vagaries of random sampling, this may not be so. If the actual difference in means is far from the expected value, the t-test may get it wrong, declaring a false instance to be positive and a true one to be negative. The goal is to see how often we get the wrong answer across a range of assumptions.

Nomenclature

To reduce confusion, I will be obsessively consistent in my terminology.

  • An instance is a single run of the simulation procedure.
  • The terms positive and negative refer to the results of the t-test. A positive instance is one for which the t-test reports a significant p-value; a negative instance is the opposite. Obviously the distinction between positive and negative depends on the chosen significance level.
  • true and false refer to the correct answers. A true instance is one where the treatment group (group1) is drawn from a distribution with \(mean=d\) (\(d>0\)). A false instance is the opposite: an instance where group1 is drawn from a distribution with \(mean=0\).
  • empirical refers to results calculated from the simulated data, as opposed to theoretical which means results calculated using standard formulas.

The simulation parameters are

parameter meaning default
prop.true fraction of cases where there is a real effect seq(.1,.9,by=.2)
m number of iterations 1e4
n sample size 16
d standardized effect size (aka Cohen’s d) c(.25,.50,.75,1,2)
pwr power. if set, the program adjusts d to achieve power NA
sig.level significance level for power calculations when pwr is set 0.05
pval.plot p-values for which we plot results c(.001,.01,.03,.05,.1)

Results

The simulation procedure with default parameters produces four graphs similar to the ones below.