IBM Research: Reflections en-route from the OpenStack Summit

As I sat on the plane during the return trip from the OpenStack Summit in Tokyo, I couldn’t help but reflect on the tremendous value of open source. One might ask: Why is IBM so interested in open source?

It’s not just about OpenStack and its various projects such as Nova, Cinder, or Swift. There are so many benefits in the broad world of open source projects; some of which are being integrated with and used in IBM’s cloud, including OpenStack and others. Examples of these projects include Spark, Docker, Kafka, ElasticSearch, Parquet, among others. Each of these projects brings value on its own – not the least of which is enabling consistency and choice as to where the workload runs since these projects can all be deployed in local, dedicated, and public clouds. But the really big value comes from combining projects to address real world problems. It’s a definite case of the whole is greater than the sum of its parts. In short, 1+1=3.

We can look at all of these services as puzzle pieces that when put together solve real world problems. In an ideal world, the different open source projects we need would all be available as services on public, private, and dedicated clouds. And it would be trivial to integrate the services by a point and a click.

For example, we should be able to take a message bus like Kafka, and use a simple configuration command with no coding to have it archive the messages in an object store service. Once in that service, another point-and-click could pull the data into an analytics engine (like Apache Spark).

The best (and worst) of open source times

It is the best of times for developers because of the wide array of open source puzzle pieces out there, which can be used to build a solution. It is also the worst of times since there is still work needed to make sure these pieces snap together easily, and too often this burdens the developer.

But my colleagues at IBM’s Research lab in Haifa and I are working on integrating several of these projects – so we really can snap together those puzzle pieces!

For example, my colleague Gil Vernik, a cloud storage, security and analytics expert in the lab, is enabling Tachyon to use Swift as its persistent underlying storage system. Tachyon is a very active, but relatively new, open source project that provides an in-memory file system with automated tiering. While Tachyon is general purpose, one of its most known use cases is improving Spark’s performance over conventional stores. While Tachyon is in-memory, it needs some place to persist data when it is no longer being used, or if it runs out of memory in higher performance tiers.

When Gil first started this work, the main choices for persisting data were HDFS or Amazon S3 (other options exist today). Both are good solutions but each has its limitations: HDFS was not designed as a long term, multi-tenant store; Amazon’s S3 is part of a public cloud. By adding support for OpenStack Swift, there is now a multi-tenant, object store under Tachyon that can provide long term, cost-effective persistent store.

And Guy Hadash, another colleague on our cloud team, is developing a solution to aggregate and store messages from the Kafka message bus in OpenStack Swift. Visual bookmark social media site Pinterest had a small project called Secor which knows how to subscribe to Kafka, aggregate messages, and then put them in Amazon's S3. These messages are stored sequentially in the object. The object can later be retrieved for batch processing of the messages.

Guy extended Secor so it could store messages via the Swift API. This gives Secor users the choice of using Kafka-Secor-Swift in all three deployment models: local, dedicated, and public. He then stored the data in Parquet format, not just as a list of messages. Parquet is designed for objects or files that contain tabular data (like CSV). This data is stored in a columnar format, enabling efficient compression and the retrieval of only selected columns from the store. Lastly, Guy annotated the created objects with metadata, such as the minimum or maximum values of various columns in the table. We have already implemented prototypes of all three steps. A patch for the first step is already part of the community Secor code, and we are starting to work with the community on the second.

Then there’s what we are doing with our partners in the context of the EU COSMOS project. Paula Ta-Shma, a cloud security and analytics expert on our team, presented this work at the recent Spark Summit in Amsterdam. The goal of this work is to help improve the timeliness of Madrid’s city buses by automating their reactions to changes in traffic.

Data is available from thousands of static sensors. Using NodeRed (an open source project for defining data flows), the data is retrieved from the sensors and put on Kafka. As described above, Secor aggregates, formats, and annotates this data, before storing it in Swift. Spark is then used to analyze this data. Using standard Spark machine learning, the solution defines the expected values for speed and traffic density on different days and times. A complex event processing engine, which also subscribes to the messages from the sensors, takes the threshold values and can issue warnings to trigger corrective action, such as rerouting a bus or changing the traffic light pattern, when needed.

These examples clearly show the tremendous value in putting puzzle pieces together from different open source projects to build something even better. We are continuing to work on integrating services and look forward to describing some of our other efforts in future blogs.