Editor’s
note: this article is by Vitaly Feldman, research scientist at IBM Research-Almaden
From discovering new particles and
clinical studies to predicting election results and evaluating credit scores,
scientific progress and industrial innovation increasingly rely on statistical
data analysis. While incredibly useful, data analysis is also notoriously easy to
misuse, even when the analyst has the best of intentions. Problems stemming
from such misuse can be costly
and contribute to a wider concern about the reproducibility of research
findings, most notably in medical research. The issue is hotly debated
in the scientific community and has attracted a
lot of public attention in the recent years.
In recent research with a team of computer scientists from
industry and academia, I made progress in understanding and addressing some of
the ways in which data analysis can go wrong. Our work, Preserving Validity in Adaptive Data Analysis, published this week in Science, deals with important, but fairly technical and subtle,
statistical issues that arise in the practice of data analysis. I’ll outline
them below. For the less patient, here is also an elevator-pitch description
(minus the elevator) of our work in the video, below.
The main difficulty in obtaining valid results is that it is hard
to tell whether a pattern observed in the data represents a true relationship that
holds in the real world, or is merely a coincidental artifact of the randomness
in the process of data collection. The standard way of expressing the
confidence in a result of analysis -- such as a certain trait being correlated
with a disease – is a p-value.
Informally, a p-value
for a certain result measures the probability of obtaining the result in the
absence of any actual relationship (referred to as the null hypothesis). A small p-value
indicates high confidence in the validity of the result, with 0.05 commonly accepted
as sufficient to declare a finding as being statistically significant.
The guarantee that a p-value
provides has a critical caveat however: it applies only if the analysis procedure
was chosen before the data was examined. At the same time, the practice of data
analysis goes well beyond using a predetermined analysis. New analyses are
chosen on the basis of data exploration and previous analyses of a dataset, and
are performed on the same dataset. While useful, such adaptive analysis invalidates the standard p-value computations. Using incorrect p-values is likely to lead to false discoveries. An analyst might
conclude, for example, that a certain diet increases the risk of diabetes, when
in reality it has no effect at all.
The mistakes that misapplications of standard p-value computations cause can be easily observed using any basic
data analysis tool (like MS Excel). For example, let’s create a fake dataset
for a study of the effect of 50 different foods on the academic performance of
students.
For each student, the data will include a consumption level for
each of the 50 foods and an (academic) grade. Let’s create the dataset consisting
of 100 students by choosing all the values randomly and independently from the
normal (or Gaussian) distribution. Clearly, in the true data distribution we
used the food consumption, and the students’ grade are completely unrelated. A
natural first step in our analysis of this dataset is to identify which foods have
the highest (positive or negative) correlation with the grade. Below is an
example outcome of this step in which I highlighted in red three foods with
particularly strong correlations in the data.
click to enlarge
A common next step would be to use the least-squares linear
regression to check whether a simple linear combination of the three strongly correlated
foods can predict the grade. It turns out that a little combination goes a long
way: we discover that a linear combination of the three selected foods can
explain a significant fraction of variance in the grade (plotted below). The
regression analysis also reports that the p-value
of this result is 0.00009 meaning that the probability of this happening purely
by chance is less than 1 in 10,000.
Recall that no relationship exists in the true data distribution,
so this discovery is clearly false. This spurious effect is known to experts as
Freedman’s paradox. It arises since the variables (foods) used in the
regression were chosen using the data itself.
Despite the fundamental nature of adaptivity in data analysis, little
work has been done to understand and mitigate its effects on the validity of
results. The only known safe approach to adaptive analysis is to use a separate
holdout dataset to validate any
finding obtained via adaptive analysis. Such an approach is standard in machine
learning: a dataset is split into training and validation data, with the
training set used for learning a predictor, and the validation (holdout) set
used to estimate the accuracy of that predictor.
Because the predictor is independent of the holdout dataset, such
an estimate is a valid estimate of the true prediction accuracy. In practice,
however, the holdout dataset is rarely used only once, and the predictor often
depends on the holdout data in a complicated way. Such dependence invalidates
the estimates of accuracy based on the holdout set as the predictor may be overfitting to the holdout set.
I’ve been working on approaches for dealing with adaptivity in
machine learning for the past two years. During a chance conversation with Aaron Roth, a professor at Penn, while at a big
data workshop,
I found out that he and several colleagues were also working on an approach to this
problem. That conversation turned into a fruitful collaboration with Microsoft's Cynthia Dwork, Google's Moritz Hardt, University of Toronto's Toni Pitassi, Samsung's Omer Reingold, and Aaron.
We found that challenges of adaptivity can be addressed using
techniques developed for privacy-preserving data analysis. These techniques
rely on the notion of differential privacy that guarantees that the data
analysis is not too sensitive to the data of any single individual. We rigorously
demonstrated that ensuring differential privacy of an analysis also guarantees
that the findings will be statistically valid. We then also developed
additional approaches to the problem based on a new way to measure how much
information an analysis reveals about a dataset.
The Thresholdout Algorithm
Using our new approach we designed an algorithm, called Thresholdout,
that allows an analyst to reuse the holdout set of data for validating a large
number of results, even when those results are produced by an adaptive
analysis. The Thresholdout algorithm is very easy to implement and is based on
two key ideas.
First, the validation should not reveal any information about the
holdout dataset if the analyst does not overfit to the training set.
Second, an addition of a small amount of noise to any validation
result can prevent the analyst from overfitting to the holdout set.
To illustrate the benefits of using our approach, we showed how it
prevents overfitting in a setting inspired by Freedman’s paradox. In this
experiment, the analyst wants to build an algorithm that can accurately classify
data points into two classes, given a dataset of correctly labeled points. The
analyst first finds a set of variables that have the largest correlation with
the class. However, to avoid spurious correlations, the analyst validates the
correlations on the holdout set and uses only those variables whose correlation
agrees with the correlation on the training set. The analyst then creates a
simple linear threshold classifier on the selected variables.
We tested this procedure on
a dataset of 20,000 points in which the values of 10,000 attributes are drawn
independently from the normal distribution, and the class is chosen uniformly
at random. There is no correlation between the data point and its class label
and no classifier can achieve true accuracy better than 50 percent. Nevertheless,
reusing a standard holdout leads to reported accuracy of more than 63±0.4 percent
(when selecting 500 out of 10,000 variables) on both the training set and the holdout
set.
We then executed the same algorithm with our Thresholdout
algorithm for holdout reuse. Thresholdout prevents the algorithm from
overfitting to the holdout set and gives a valid result close to 50 percent
estimate of classifier accuracy. In the plot below we show the accuracy on the
training set and the holdout set (the green line) obtained for various numbers
of selected variables averaged over 100 independent executions. For comparison,
the plot also includes the accuracy of the classifier on another fresh data set
of 10,000 points which serves as the gold standard for validation.
click to enlarge
Beyond this illustration, the reusable holdout gives the analyst a
simple, general and principled method to perform multiple validation steps
where previously the only known safe approach was to collect a fresh holdout
set each time a function depends on the outcomes of previous validations. We
are now looking forward to exploring the practical applications of this
technique.