|
Vasanth Bala, Manager of
Scalable Datacenter Analytics |
Editor’s note: This article is by Vasanth Bala, a staff
scientist at IBM’s Thomas J. Watson Research Center.
It's inevitable. Servers crash. Applications misbehave. Even
if you troubleshoot and figure out the problem, the process of problem
diagnosis will likely involve numerous investigative actions to examine the
configurations of one or more systems—all of which would be difficult to describe
in any meaningful way. And every time you encounter a similar problem, you
could end up repeating the same complex process of problem diagnosis and
remediation.
As someone who deals with just such scenarios in my role as
manager of the Scalable Datacenter Analytics Department at IBM Research, my
team and I realized we needed a way to “fingerprint” known bad configuration
states of systems. This way, we could reduce the problem diagnosis time by
relying on fingerprint recognition techniques to narrow the search space.
Olive’s role
CMU and other academic organizations now manage virtual
image library. For their work, CMU’s Mahadev Satyanarayanan
and Gloriana St. Clair
received a two-year grant from the Sloan Foundation “to support the technical
development of a platform for archiving executable content and the
environment in which it runs, as well as a plan for the institutionalization
and ongoing sustainability of work for such an archive.”
|
Project Origami was thus born from this desire to develop an
easier-to-use problem diagnosis system to troubleshoot misconfiguration
problems in the data center. Origami, today a collaboration between
IBM
Open Collaborative Research, Carnegie Mellon University, the University of
Toronto, and the University of California at San Diego, is a collection of
tools for fingerprinting, discovering, and mining configuration information on
a data center-wide scale. It uses public domain virtual image library,
Olive, an idea created under this Open
Collaborative Research a few years ago.
It even provides an ad-hoc interface to the users, as there
is no rule language for them to learn. Instead, users give Origami an example
of what they deem to be a bad configuration, which Origami fingerprints and
adds to its knowledge base. Origami then continuously crawls systems in the
data center, monitoring the environment for configuration patterns that match
known bad fingerprints in its knowledge base. A match triggers deeper analytics
that then examine those systems for problematic configuration settings.
How Origami works
Together with Carnegie Mellon University and the University
of Toronto, we developed agent-less system crawlers that are able to
continuously scan the configuration state of virtual servers – without
requiring any scanning agents to be installed inside them. Think about these
crawlers as analogous to web crawlers that silently and non-intrusively scan
the contents of web documents to build a central index that can then be searched
or mined for insight.
This crawling approach improves usability and security
because: there is no scanning agent to install and maintain on tens of
thousands of systems; and there is no agent for malware present within these
systems to attack. We are now developing advanced fingerprinting technologies
that use a concept called “search by example,” where the user provides an
example of a problematic configuration, rather than using a complex rule
language to declaratively define the details of the problem.
Such a “search by example” can also be created by first
crawling a system; making some change to it that represents a configuration
adjustment; then re-crawling the system, and finally asking Origami to compute
the difference between the two crawled states of the system. This technique
allows users to provide arbitrary system changes as examples.
What’s happening inside Origami during all of these
processes? It internally computes a fingerprint of the example and stores it in
a fingerprint knowledge base. A fingerprint is a collection of hashes that
summarize different dimensions of the configuration data for very fast
recognition. Various heuristics then adjust the relative weights of different
features comprising the fingerprint so that important features (e.g. a network
port being opened) are distinguished from less important ones (e.g. a log file
being modified). These heuristics lower false alarms so bad configuration
patterns can be distinguished from very similar patterns that are actually
benign.
What’s next for Origami?
The overriding question for us is how Olive and Origami
together can lead to the production of commercially viable technologies. The
above-mentioned problem diagnosis of misconfiguration-related outages is one
clear use for the search-by-example technology that we have developed with our
OCR partners.
Another technology using Origami, under development with the
University of California at San Diego, would mine many different systems in the
data centers that are identically configured to automatically learn patterns
that tend to produce problems from those that tend to operate well.
Labels: analytics, datacenter, olive, open collaborative research, origami