Name: Mike
Nidd
Location: IBM Research - Zurich
Nationality: Canadian
Focus: Services Research
Many large firms across any number of industries
often outsource their data centers to IT service providers like IBM. The
rational is obvious: managing thousands of servers is not a core competency for a retailer or
mining company, while IT vendors have the skills
and resources to manage the millions of square feet of datacenter space more
efficiently.
After the often billion dollar deals are signed,
the real work begins as the vendors migrate the systems. In some
instances this could mean literally shipping or relocating the hardware to
another secure facility, or it could mean remote management, or a complete rip
and replace.
But before any system can go offline, the teams need
to carefully study the IT environment to understand how the systems are all
interconnected. Taking one rack offline could shut down an entire store full of
cash registers, or a city filled with ATM machines.
IBM scientists like Mike Nidd are part of a global
team of researchers who are making the migration process as seamless as
possible - with no downtime - for clients. One of the technologies he has
contributed to is called Analytics for Logistical Dependency Mapping.
I recently spoke with Mike about this research.
What exactly does the
tool provide to the migration team?
Mike Nidd (MN): The script runs across the
entire data center and a report is complied. Each server gets a file which is
then zipped and shared with the migration team. When they open it, it looks like
a massive complex “spaghetti diagram” showing all of the correlations across
the network for each individual machine. The team can zoom in and out and see
all of the connections.
We also receive a summary report which provides a
heat map showing where the most critical systems are located. These are the
systems which typically are taken offline over the weekend or during the early
hours of the day. You don’t want to make mistakes with these servers as they
could be responsible for millions of transactions per minute.
In the end, the migration team has a clear roadmap
to know which systems can be taken offline and in what order. The result is predictable, manageable downtime for the affected systems.
How long does it take to scan the entire system?
MN: The scripts are run separately on all
systems. They are very lightweight so they don't have a significant effect on
the server operations, and can be run as a one-off data gathering, which
takes a minute or two. A more useful configuration is to leave it running
for a week or two to pick up dependencies that aren't coded into the
configuration files that we parse.
How was this done previously?
MN: The teams basically start with the server
closest to the door and a person would be assigned to each rack. They would
then need to determine the dependencies for their rack, and schedule the
migration for the appropriate wave.So our method is a little more fact based,
which can be the difference between uptime and downtime.
Sounds like a pretty high pressure situation.
MN: Discovery must be executed in a minimally
invasive manner because the IT environment to be migrated is usually in a
production state until the relocation is complete.The migrations typically take
place beyond normal working hours, so it certainly can be
stressful. Thankfully, our tools are hardened and tested to work across even the
most demanding and complex environments.
Follow Mike on Twitter @menidd
Labels: IBM Research - Zurich, services research