Math vs. Massive Data Overload

This year digital information will grow to 988 exabytes or equivalent to a stack of books from the sun to Pluto and back.

Sure, lots of data is great for predicting the future and producing models, but how do you know if the data is any good? IBM scientists have developed an algorithm that can tell you.

Data, data everywhere

Much of this data is gathered by sensors, actuators, RFID-tags, and GPS-tracking-devices, which measure everything from the degree of pollution of ocean water to traffic patterns to food supply chains.
But the question remains, how do you know if the data is good and not filled with errors, anomalies or if it was generated by a busted sensor? For example, if a scientist attempts to predict climate change based on a broken sensor that is off by 25 degrees for an entire year, the model is going to reflect that error. As the saying goes, "garbage in, garbage out."

“In a world with already one billion transistors per human and growing daily, data is exploding at an unprecedented pace,” said Dr. Alessandro Curioni, manager of the Computational Sciences team at IBM Research – Zurich. “Analyzing these vast volumes of continuously accumulating data is a huge computational challenge in numerous applications of science, engineering and business.”

Lines of efficiency

To solve this challenge IBM scientists in Zurich have patented a mathematical algorithm (for the details click here) with less than 1000 lines code that reduces the computational complexity, costs, and energy usage for analyzing the quality of massive amounts of data by two orders of magnitude.
To confirm their method, scientists validated nine terabytes of data—nine million million (or a number with 12 zeros) on the fourth largest supercomputer in the world in Germany, a Blue Gene/P system at the Forschungszentrum Jülich.

The result, what would normally have taken a day, was crunched in 20 minutes. In terms of energy savings, the JuGene supercomputer at Forschungszentrum Jülich requires about 52800 kWh for one day of operation on the full machine, the IBM demonstration required an estimated 700 kWh  - only 1 percent of what was previously needed.

“Determining how typical or how statistically relevant the data is, helps us to measure the quality of the overall analysis and reveals flaws in the model or hidden relations in the data,” explains Dr. Costas Bekas of IBM Research – Zurich. “Efficient analysis of huge data sets requires the development of a new generation of mathematical techniques that target at both reducing computational complexity and at the same time allow for their efficient deployment on modern massively parallel resources.”