May 2009 Archives

Calendar of upcoming unplanned outages

Today was Jim Colten's last day in OIT. I think all of us have worked with Jim at some point in his 35+ years at the University. It was a definite pleasure to work with him.

At Jim's send-off party this afternoon, I said a few words about my time with Jim Colten. I remember that Jim always was ready with a story or a piece of wisdom. One truism that I will carry with me is "there is no calendar of upcoming unplanned outages".

This is underscored by the unexpected outage in one of our data centers last week. Try as you might, you can never really predict the future. Bad things sometimes just happen, even unlikely things such as a power loss inside a data center. So it's important to plan ahead and put in place technical practices for unplanned disaster recovery. I think you all did a great job in responding to the outage, but every emergency uncovers something that could be improved.

I ask that everyone - DBA, storage, systems, production services - look to your processes, and make an honest determination if things are the best they can be. Improve what we have, identify and move forward in the new world as it exists now. By doing so, we can make it easier on ourselves the next time we need to respond to an emergency.

Disaster Recovery Services

Our Disaster Recovery Services team has done an excellent job in building up the University and OIT in disaster preparedness. Under Lois and John, and the rest of the DRS team, we have implemented standardized templates for building disaster recovery plans, defined the process for DR planning, and even created an introductory video for departments who are new to disaster recovery planning.

In most other organizations (that have a dedicated Disaster Recovery team) the concept of "disaster recovery" is often married to "high availability". The reasons are obvious: if you can bring the two groups together, the issues you uncover in disaster recovery planning can be merged directly into high availability architecture.

For this reason, we have been asked that the Disaster Recovery Services team move under Bill Decker, the senior manager for our High Availability architecture group. The transition became effective earlier this week.

Excellent work on Sunday

I just wanted to express my thanks for all the hard work you put in over the weekend, responding to our planned shutdown of the data center on May 3. As you know, I was scheduled to be out of the office for a conference that weekend, but I did follow along remotely and checked in for regular updates.

Sounds like everyone did a wonderful job planning and executing the WBOB outage this past Sunday. The systems admin teams successfully took down and brought back up about 700 servers with only 4 minor startup problems. The backup team successfully took offline all backup hardware and backup processes with no glitches on restart. And the storage team took down roughly 350 Terabytes of storage and the SAN fabric with only one hardware hiccup on startup.

In addition, our mainframe team reported no issues with their work. Production Services was on-hand to restart verify their applications, and release jobs. DBA restarted the databases and supporting processes with no problems. The DRS team had helped with communication beforehand, and was standing by to coordinate activities in case of an emergency.

While we still have much work to do, improving our technical practices for unplanned disaster recovery readiness, I think it's important to take a step back and acknowledge the successful work from this weekend. Thanks, everyone!

Update on external review

Steve Cawley sent the following update to all of OIT. I am re-posting it here in case you missed it. The external review is important to the U of M. As you know, one way the university measures our performance is by having an outside, neutral observer take notes on how we are doing and make recommendations for improvement. That is the purpose of this external review. Thanks to everyone for participating in this process - even responding to an email query related to the review was critical.

I wanted to give you an update on our external review. As you know, we have been working since last fall on preparation for our external review. Many of you helped with unit self assessments and the OIT self study portion of the review. Your help was greatly appreciated.

The external review team spent Monday and Tuesday of this week on campus meeting with OIT management, University leadership, faculty and students. At the end of the day on Tuesday the external review team met with the OIT senior management team and Senior Vice President Robert Jones to provide a summary of their finding and recommendations.

I am pleased to tell you that they gave us a very favorable report. The team commented on the University's strong culture of collaboration, trust and partnership. OIT was viewed as a collaborative partner. They also complimented us on our efforts around common good services and partnership with the colleges and campuses. They challenged us to think about OITs leadership role in innovation, determining the right balance between service provider and innovator. They also challenged us to think about moving beyond our current six-year planning process, which is somewhat limited to central investments and resource planning, and begin to include the same degree of analysis and planning for IT investments and resources across at the entire University.

The external review team will provide us with a written report on May 15th. Once we receive their written report we will prepare the OIT response.

A special thanks to Ann Hill Duin for her great leadership of the external review preparation and self study along with Bernie Gulachek, Kris Adelmann (and the entire OCA staff), Steve Carnes, John Sonnack, Renee Rivers, and everyone else involved.