A recent article in Computerworld, Stupid Data Center Tricks, gives several examples of human errors causing outages in data centers. A university network brought to its knees when someone inadvertently plugged two network cables into the wrong hub. An employee injured after an ill-timed entry into a data center. Overheated systems shut down after a worker changes a data center thermostat setting from Fahrenheit to Celsius.
It's a wake up call to watch your data center very closely. Failure is inevitable. It's our job in IT to make sure that one failure doesn't take down the whole environment.
Now would be a good time to look at your server rooms, make an honest evaluation of what areas are most at risk, and try to find ways to simplify and automate the recovery of systems. What can you do in your area to make systems less prone to the failure of a single component? What happens to your applications if one server in the process flow (web server, app server, database server, ... other servers?) becomes unavailable?
I look at systems through three different lenses:
If I lose one web server, is there another that can take the load? Does the load transfer to that other server transparently? The best failure scenario is when no one notices that one component of the system had a problem, because you had enough redundancy built into it to prevent an outage.
If you are unfortunate enough to experience an outage, how quickly can you bring things back online? What are the critical systems? Which systems are less important? Typically, the "dev" and "test" environments get least priority, and production systems get first attention.
IT can address the disaster recovery portion, the physical act of bringing systems back online. But while applications are down, how can the business continue to operate? Business Continuity is, by definition, the responsibility of the business owners. But are your business customers aware of that, and are they prepared to conduct business in another way (temporarily) if their main system are unavailable?