A Q&A with IBM’s Dr. Jimeng Sun
|Dr. Jimeng Sun|
Dr. Jimeng Sun joined IBM Research after earning his PhD in healthcare analytics research from Carnegie Mellon University. For the past four years, he has been developing data mining algorithms and systems for healthcare analytic applications. His team, in partnership with Sutter Health and Geisinger Health Systems, recently earned a $2 million grant from the National Institutes of Health to “develop new analytics methods to help predict and identify early signs of risk for heart failure.”
Sun’s analytics research will help develop accurate and robust predictive models for the early detection of heart failure using Electronic Health Records (EHR) data.
How did your team access EHR data in order to test this analytics technology?
It took more than six months to get access to real EHR data needed to build our models. Aside from the legal agreement, we also implemented a set of elaborated procedures to receive and access EHR data in order to guard the security and privacy of actual patient data.
We also created a course on Protected Health Information (PHI) which we require of our researchers and developers who work on such data. It’s an issue we have to take seriously.
Medical text is also incorporated into this analytics technology. Is this the Watson technology? If not, how is it different?
We do utilize Unstructured Information Management Architecture (UIMA) to extract the known signs and symptoms to heart failure from available text. A similar Natural Language Processing (NLP) technology is also used in Watson – the difference is our usage after the data extraction.
We use those extractions as features, along with many other features from structured information to feed the subsequence analysis.
What kind of data from the EHRs indicated a higher risk of heart disease?
For this specific dataset with Geisinger Health Systems, we have information on more than 30,000 patients, among which about 5,000 of them are confirmed heart failure patients. The other patients in the datasets are controls.
These patient records account for more than 10 years of longitudinal data. While some patient records have more information than another, it is a very impressive and comprehensive dataset for studying heart failure.
The challenge for differentiating heart failure patients from the controls, prior to diagnosis, is that there is no single strong indicator. But there are many weak indicators called co-morbidities, such as hypertension and diabetes, associated medications and Framingham heart failure symptoms that we can extract from text. The hope is by combining a large number of weak indicators we can still develop an accurate and robust predictive model.
What kind of IT infrastructure does this analytics technology need?
To facilitate and speed up the model development, we use a Hadoop cluster to manage and schedule tens of thousands of models in parallel. We are able to reduce the amount of time for a large scale model building – which typically needs 9 days in a single computer – to just three hours on a small cluster by developing the predictive models in Hadoop.
What would a doctor – who is using this technology versus what is available today – see in a patient’s EHR that was not there before?
With new patients, the data collection has to start from scratch. But overall, as the patients stay in with the same doctor or clinic over a long time (which is the case with Geisinger’s dataset), we will begin to know more and more about those patients – and utilize that information for prediction.
The validation in an operational clinical setting is the next step after the current project. If a patient has high risk of developing heart failure based on our predictive model, the system will alert doctors with the risk level, and associated risk factors derived from similar patients in the past.
The NIH grant is for the next three years. What are the next steps in the partnership with Sutter Health and Geisinger Health Systems to test these predictive methods for heart failure?
We hope to conduct a subsequent clinical trial of the resulting predictive model. Through the trail, we want to show whether a randomly selected group of patients that use the predictive model to facilitate the clinical decision making is better than the current clinical practice. Besides unstructured text and other structured medical information, we will also look into other data source such as Electrocardiography (ECG) and genomic data.