Meet an IBM researcher: Michael Nidd

Name: Mike Nidd
Location: IBM Research - Zurich
Nationality: Canadian
Focus: Services Research

Many large firms across any number of industries often outsource their data centers to IT service providers like IBM. The rational is obvious: managing thousands of servers is not a core competency for a retailer or mining company, while IT vendors have the skills and resources to manage the millions of square feet of datacenter space more efficiently.

After the often billion dollar deals are signed, the real work begins as the vendors migrate the systems. In some instances this could mean literally shipping or relocating the hardware to another secure facility, or it could mean remote management, or a complete rip and replace.

But before any system can go offline, the teams need to carefully study the IT environment to understand how the systems are all interconnected. Taking one rack offline could shut down an entire store full of cash registers, or a city filled with ATM machines.

IBM scientists like Mike Nidd are part of a global team of researchers who are making the migration process as seamless as possible - with no downtime - for clients. One of the technologies he has contributed to is called Analytics for Logistical Dependency Mapping.

I recently spoke with Mike about this research.

What exactly does the tool provide to the migration team?

Mike Nidd (MN): The script runs across the entire data center and a report is complied. Each server gets a file which is then zipped and shared with the migration team. When they open it, it looks like a massive complex “spaghetti diagram” showing all of the correlations across the network for each individual machine. The team can zoom in and out and see all of the connections.

We also receive a summary report which provides a heat map showing where the most critical systems are located. These are the systems which typically are taken offline over the weekend or during the early hours of the day. You don’t want to make mistakes with these servers as they could be responsible for millions of transactions per minute.

In the end, the migration team has a clear roadmap to know which systems can be taken offline and in what order. The result is predictable, manageable downtime for the affected systems.

How long does it take to scan the entire system? 

MN: The scripts are run separately on all systems. They are very lightweight so they don't have a significant effect on the server operations, and can be run as a one-off  data gathering, which takes a minute or two. A more useful  configuration is to leave it running for a week or two to pick up dependencies that aren't coded into the configuration files that we parse.

How was this done previously? 

MN: The teams basically start with the server closest to the door and a person would be assigned to each rack. They would then need to determine the dependencies for their rack, and schedule the migration for the appropriate wave.So our method is a little more fact based, which can be the difference between uptime and downtime.

Sounds like a pretty high pressure situation.

MN: Discovery must be executed in a minimally invasive manner because the IT environment to be migrated is usually in a production state until the relocation is complete.The migrations typically take place beyond normal working hours, so it certainly can be stressful. Thankfully, our tools are hardened and tested to work across even the most demanding and complex environments.

Follow Mike on Twitter @menidd

Labels: ,