IBM Research: BlueSNP scales statistical genetics studies

At the genome level, everyone is 99.5 percent identical. It is the half percent difference that may hold the key to understanding many diseases and give clues about potential new treatments for common and rare diseases. At millions of places in the human genome, the DNA spelling (the nucleotide arrangements of adenine, guanine, thymine and cytosine) varies from person to person. Single-letter spelling differences between individuals are called single nucleotide polymorphisms, or SNPs (pronounced "snips").

A new open source tool from IBM Research called BlueSNP lets genetics researchers harness computer clusters to rapidly analyze vast numbers of peoples' SNPs and diseases to discover the genetic factors influencing disease predisposition.

The standard statistical genetics method for this type of analysis is a genome-wide association study (GWAS), which involves sifting through more than one million SNPs in hundreds to thousands of people to discover the handful of SNPs that alter the risk of getting a particular disease. The method works by identifying the places in the genome where people who are afflicted by a disease tend to have a certain DNA spelling, while people who are healthy tend to have a different DNA spelling. When a disease-associated SNP is in or near a gene (the genetic instructions to make a protein), it can lead to a hypothesis about how certain nucleotide spelling combinations affect disease risk and could lead to new medicines.

Translational GWAS -- from laboratory research to health system analytics

Two converging trends are transforming GWAS from a method intended for relatively small-scale research studies (one disease, thousands of people) to an analytic tool that scales to the patient population of an entire health system (thousands of diseases, hundreds of thousands of people).

Identifying patient groups is the first step of GWAS. Analytics for electronic medical records now make it possible to efficiently identify groups of patients satisfying a myriad of inclusion and exclusion criteria. For example: A healthcare analyst can identify the thousands of people across a health system who have diabetes, but not many additional health problems; and also identify thousands of people who don't have diabetes, but are otherwise similar to the diabetic group (i.e., matched controls).

The second trend is that the cost of DNA sequencing and related genomic data acquisition technologies are rapidly dropping. The DNA sequencing industry is on track to achieve the 2013 goal of a $1,000 genome – about the cost of a dental crown. Measuring just the SNPs, a low resolution type of DNA sequencing, already cost far less. In genomics, the bottleneck is data analysis, not data generation.

Open Source BlueSNP: Clinical bioinformatics is big-data science

IBM Research’s BlueSNP open source software can help genetics researchers to apply the GWAS method to analyze many more people and diseases than previously possible. Based on the R language and Hadoop, it runs standard GWAS calculations on Hadoop clusters. Because BlueSNP is open source, the genetics research community can adapt the code to answer new questions that have not yet been asked. BlueSNP makes analyzing thousands of diseases as easy as analyzing one disease and by using thousands of compute cores, results come just as fast. When combined with electronic health records data, it can open a pathway to a genomics-based personalized medicine.

Figure 1. SNPs located at the proximal end of chromosome 4 (arrow) exceeds the threshold for genome-wide significance (line) indicating that the DNA in this region of the human genome is associated with the disease under investigation.