IBM Research: Deep dive insights into Swift

Editor’s note: This blog entry is by IBM Fellow Michael Factor and IBM storage research engineer Dmitry Sotnikov from IBM Research – Haifa. The work was done together with Yaron Weinsberg from IBM Research - Haifa.

When companies deploy a system, they have specific performance, durability, and cost objectives in mind. So, before physical deployment, they will run models and simulations to get a close approximation of how the system will meet those objectives. But after deployment, the system must be checked. This holds true for even open source systems — which have become more and more popular because of their flexibility in terms of available, interchangeable hardware and software. Our team at IBM Research-Haifa, building on open source tools, has been gaining experience and developing techniques to observe and monitor one of the most-popular options, OpenStack Swift. It's the leading open source object storage system that runs in public and private clouds.

Let’s dive into Swift and what we learned from the data collected during monitoring.

Dmitry Sotnikov, IBM storage engineer

Increasing system complexity means increased monitoring complexity — since huge amounts of data need to be analyzed to find out what’s really going on. That’s where our work comes in. Our methodology uses an open source toolbox and enables understanding the behavior of Swift clusters by examining the data collected during monitoring.

Today, performance monitoring and troubleshooting of a running cloud-based object storage is as much an art as a science. Although there are a plethora of open source monitoring tools to gather system metrics, the real challenge is how to use them to find the root cause of a problem.

We developed a general, open-source-based, step-by-step methodology to understand performance bottlenecks in a Swift system. Our solution uses standard tools including Logstash, collectd, StatsD, Elasticsearch, Kibana and Graphite. It also includes an additional simple Swift middleware we developed to gain further insights into the source of system bottlenecks.

Our methodology helps validate the correctness of a Swift cluster configuration, and identifies which important data should be presented in the visualization of the Swift parameters together with the system’s statistics. This visualization helps users gain a better understanding of the internals of Swift and its behavior; it also enables them to see potential problems and misconfigurations.

For example, if validation of Swift’s network configuration is required, for instance to understand unexpectedly low performance, , it can be done by using our methodology and the open source toolkit on which it is based. This can be seen in the following charts, which present the network utilization between the Proxy and the Object servers, as well as the Proxy’s public network utilization for a write only workload. Based on these charts one can easily validate that all the data received by the Proxy is replicated three times and sent to the Object servers.

At the OpenStack summit on May 20 in Vancouver, we will be demonstrating the results obtained from our approach, used with an internal deployment of Swift. You are welcome to join us at 1:50 pm, room 109.