push theory to experiment with the wisdom of crowds.
years ago, IBM Research scientist Dr. Gustavo Stolovitzky’s team was looking
for a way to better-understand the accuracy of the biological results yielded by
the network reconstruction algorithms they were developing at IBM. In other
words, how could Stolovitzky improve the evaluation of their reverse
engineering efforts to better understand and maybe help to solve biomedical
challenges such as cancer?
generally, all computational biologists want a clear-cut evaluation of the
models they use to analyze and eventually represent biological systems. Are
their techniques working? How do their techniques compare with other techniques?
and collaborator Dr. Andrea Califano, now the director of Columbia University’s Initiative in
Systems Biology, decided to organize DREAM – the Dialogue on Reverse Engineering
Assessment and Methods – Project to crowd source the analysis of high throughput
data (now so pervasive in biological research) to address important challenges
taking submissions for DREAM7 challenges, Stolovitzky and colleague Dr. Pablo
Meyer Rojas* discuss the goals of the project and how to submit responses to this year’s challenges.
How did the DREAM Project start?
Stolovitzky: The explosion of genomics has created the need to organize and
structure the data produced to generate a coherent biological picture. DREAM was
created in order to foster concerted efforts by computational and experimental
biologists to understand the limitations of the models built from these
high-throughput data sets.
I conceived the DREAM project as a way to understand the accuracy of the
biological results yielded by the network reconstruction algorithms (reverse engineering)
we were developing at IBM, it captured a need in the community that was, so to
speak, up in the air.
long-time collaborator (and former IBMer) Andrea Califano and I organized the first meeting with the New York Academy of Sciences in 2006.
After that, the project was launched as a series of annual challenges that
culminate in the DREAM conference.
What is the project's overall goal?
In the context of the current avalanche of genomic data, DREAM's goal is to
objectively assess and enhance the quality of data-based modeling of biological
systems. For example, if we know what the results of a particular analysis
should be (because we have what we call the “ground truth” contained in
unpublished information, not yet available to the community at large) then we
can test the community to assess how close to the ground truth the results are.
approach has many useful outcomes.
- It can find the best analytical method for a given problem, because all the methods are pitted against each other on the same data set, and under the same evaluation scheme.
- It enables a dialogue in the community about why an analytical tool may
yield good or bad results.
- It fosters a synergy between theoretical, computational and experimental
scientists – all of whom look at the same data from different perspectives to
achieve the great goal of understanding biology.
- It can help garner evidence for or against a hypothesis because, if nobody
in the community can solve a given problem predicated on a hypothesis, then the
underlying hypothesis may be wrong. Conversely, if at least one member of the
community solves it, then the hypothesis can be considered verified.
- The outcomes of DREAM have the potential to complement peer-reviewed
research, and increase the confidence of the scientific community on biological
models and algorithm reliability.
DREAM states that its “main
objective is to catalyze the interaction between experiment and theory in the area
of cellular network inference and quantitative model building in systems
biology.” Please elaborate on this.
Pablo Meyer Rojas: The goal of systems biology is to understand the biological whole
as more than the sum of the individual parts. In order to do this, we need to
build comprehensive context-specific models of biological processes at the
cellular or organism level, based on data inherent to the system under study.
say that the models need to be quantitative because the ultimate goal of
systems biology is to describe the behavior of biological systems based on
precise measurements, and predict the response of those systems to
perturbations, such as
disturbances caused by disease.
models are based on the construction of cell-maps from data describing the
interactions of DNA, mRNA, proteins, drugs, etc. Networks are a succinct way to
represent these interactions, and are the scaffolding from which to build the
mathematical models that quantitatively implement our understanding of the
How are the
GS: It has been said that a wise man's question
contains half the answer.
With DREAM we try to pose relevant and important
questions (the project’s challenges) about biological problems, whose answers
should be found through the analysis of complex biological data. For example, how
can we predict the survival of a cancer patient based on genomics data extracted
from the patient's tumor? Or, what is the therapeutic effect of a drug on a
cell, given that we know the effect of the same drug on other cells?
Another important consideration is that we need to
know the answer of a challenge to assess the predictions. Therefore the availability of
unpublished data that can be used as ground truth to evaluate the submissions –
and the willingness of the data producers to share their unpublished data – is
Why use crowd
PMR: In order to tap the wisdom of crowds, we need the
crowds! Crowd sourcing is an effective way to reach out to people from a
diverse set of communities as participants, to get a spectrum-wide set of
methods for solving a problem.
Suppose you have a tough question for which you need
an answer. You may not know the answer, and your immediate friends may not know
the answer, either. But what if you could ask that same question to all your
neighborhood, town, province, country or planet?
It is a bit like the “ask the audience” life-line in
the game show “Who Wants to be a Millionaire”.
It is possible that someone who has the expertise
happens to know the answer. But to find that person we need to tap the crowds.
In the case of systems biology, crowd sourcing the solution of a challenge
allows us to search among many
different methodologies used to analyze the bio-data,
and find the one that produces the most accurate predictions. The more
participants we get, the more likely it is that if a solution exists, we will
How is a "best
answer" for each challenge chosen, and who chooses?
GS: Before the challenges are made public, individuals
involved in the organization of a challenge (including people that generated
the data) get together and decide on a scoring method based on few different
metrics. Participants are then informed of how their entries will be evaluated.
Once the challenge is finished, predictions are
evaluated and scores are published, along with all of the scoring methods. Only the names of the best performers are
revealed, but each participant is informed of his or her own score.
Something interesting we discovered is that when we
aggregate the prediction of the community, the resulting aggregate solution tends
to be the best answer. This gives new meaning to the concept of the wisdom of
How will these “best
responses” be used? Do you have a past example to share?
PMR: The algorithms of the best performers can be used
to generate new predictions that will be tested experimentally. For example, in
DREAM5 a challenge asked for predictions to determine the affinity of
synthetically generated peptides (peptides are small pieces of proteins) to
antibodies (proteins that rid the body of pathogens). The algorithms from the
best performers were then used to generate and test a second round of peptides
that were predicted to work better together.
In another DREAM5 challenge, a community prediction of
the gene regulatory network of
Staphylococcus Aureus was created. It could be used to help find new
antibiotics against this serious bacterial pathogen that can cause infections
such as MRSA (methicillin-resistant Staphylococcus
Who can participate,
GS: Anyone is invited to participate. The more diverse
the community of participants, the more chance we have in finding an innovative
methodology. Participants need to register here, and can
choose any (or all) of the four challenges.
This year’s challenges are what we call translational, in the sense that we use basic research
that can be translated into medically relevant knowledge, including areas such
as breast cancer and
lateral sclerosis, commonly referred to as ALS or Lou Gehrig’s disease.
We also have a number of incentives
for challenge participation. For example, in the prediction of progression of Lou
Gehrig’s disease, the non-profit Prize4Life
will award $25,000 to the best performing submission.
For all challenges, an expense-paid
speaking invitation to the DREAM conference (Nov 12-16 in San Francisco) will
be provided to the best performer. This year we are also partnering with the
journals Open Network Biology, Science Translational
and Nature Biotechnology
publication of the best perfo
* -- Besides
Stolovitzky and Meyer, IBM Research scientists Raquel Norel and Erhan Bilal are
working on the DREAM Project.
Labels: biometrics, computational biology, experimental biology, reverse engineering