Exabytes, Elephants, Objects and Spark

Michael Factor, IBM Fellow
Editor’s note: This article is by Michael Factor, IBM Fellow, Storage and Systems

Information. Data. It is clichéd to talk about its rate of growth. Not long ago, a terabyte was a lot of information. Now we talk freely about petabytes and even exabytes. While not many organizations are at an exabyte yet, having an exabyte of data will become more and more common.

But what is an exabyte? The simple answer is 1018 bytes. But this is still tough to conceptualize. So let’s put it in some less conventional units. Assuming we had an exabyte of data stored on 6TB disks (the most-commonly available large disks), if we piled these disks one on top of the other, the pile would be almost 6 times the height of the Burj Khalifa, the world’s tallest building. And these drives would weigh almost as much as 16 full-grown elephants.

No need for tall buildings or elephants, just object storage

This is a lot of data – and of course, actually storing and managing this data requires more resources than just an exabyte of disk capacity. So, we work with object stores, a technology that stores data as objects (versus, say, files) – which is  cost-effective for massive amounts of data. And it’s why IBM recently acquired Cleversafe, to efficiently, securely, and scalably store and manage petabytes and exabytes of data in an object store. Other mechanisms for storing data, such as block storage systems or relational databases, will not handle exabytes. They don’t even handle petabytes; and if they could, they would be prohibitively expensive.

Simply storing the data is necessary, but not sufficient. We need to be able to use the data. One element of ensuring the data can be used is to have an open interface to access the data. This helps ensure that businesses can use the cloud services they want and are not limited to a specific proprietary provider. It also means there will be broad investment in ensuring various tools work efficiently with the data. One such interface is that provided by OpenStack Swift, an interface IBM is committed to both for our products and for our cloud services such as the Bluemix object store.

It is not possible to manually process even terabytes of information, let alone petabytes or exabytes. The only way one can make sense out of these quantities of data is through analytics, machine learning, and cognitive computing. Apache Spark is rapidly growing as the bulk data analytics platform of choice and is particularly amenable for use with machine learning and cognitive algorithms.

Given the importance of Apache Spark, we needed a fast connector that would allow Apache Spark to efficiently process data in an object store. And now scientists at IBM Research – Haifa have built just that. Called Stocator, the connector allows Spark to efficiently ‘talk’ to an object store.

Building a fast bridge to the cloud

The developers of Apache Spark very wisely built on existing code to interact with external data stores. Specifically, they made it possible to connect with open source framework, Hadoop’s HDFS layer, which provides access to stored data. This gave them immediate access to a very large number of data stores. But it also meant the data access path carries with it a lot of extra code that object stores simply don’t need.

For example, when Hadoop creates the file /John/Photos/elephant.jpg, it needs to create separate unique objects: one for John, one for Photos, and one for elephant.jpg. But in an object store, we can use just a single object to represent /John/Photos/elephant.jpg.

By using Spark with Stocator, we can reduce the number of calls Spark makes to the object store by an order of magnitude! The result is faster performance and the need for fewer resources.

Stocator is object store aware, and thus has a very different architecture than the existing Hadoop drivers for object stores. Hadoop historically started using a file system to store data. In contrast, our new connector leverages object store specific features, like creating only one object for John/Photos/elephant.jpg. The impact of these changes is significant. For instance, on one application, the number of calls to the object store was reduced from 190 to 22.

Although we developed and tested this connector using the Swift API as the interface to the object store, the code can be extended to support other object stores. Even though this new connector cannot replace the Hadoop derived code in all cases, it can easily be applied to other open source projects such as Apache Flume or Secor.

We opened our code and encourage your use of it and feedback. The code is now available online. There is also a technical blog that goes into more detail on the new connector and describes how to use it with Spark. 

Labels: , , , , , , ,