Having worked in IT for almost 20 years, I've experienced my share of disasters. IT disasters can come in different "shapes" and "sizes": simple human error that dropped database, a storage device that failed in just the right way, road maintenance that cut off the network, an exploding power transformer, and a bridge collapse next to our primary data center. These are all situations where IT experienced a "disaster" of some definition. Remember, the "smoking hole" disaster scenario rarely happens; most often, it's starts small and just gets out of control.
We survived our disasters by having good disaster recovery practices. A key component of that is the Disaster Recovery Plan.
Note that a Disaster Recovery Plan is different from a Business Continuity Plan. DR is undertaken by the technology teams, and deals with bringing things back to normal. BC is about how to continue doing business in the face of an IT failure.
A friend once shared with me a story with me that demonstrates this difference:
On his way to work, my friend stopped for coffee at a local coffee shop. Most coffee shops have a PC server in the back room that tracks sales - how many espressos or cappuccinos or mochas (etc.) were sold, which helps in predicting sales and in ordering new supplies. That day, my friend learned the PC server had died, so the cash register wouldn't ring up any sales.
An IT support tech was already in the back room, working to fix the server. Maybe restoring data, or replacing a failed hard drive, or any number of things. This was Disaster Recovery, bringing the technology back to normal.
But the coffee shop still sold coffee. The barista had a reference sheet that said how much to charge for different beverages and bakery items, and she used a calculator to work out tax and change. Transactions were recorded on paper, to be entered into the system later. This was their Business Continuity, how the coffee shop was able to remain open and stay in business, despite the PC server and cash register not working.
Our responsibility in Computing Services is Disaster Recovery. When I joined Morris almost 2 years ago, we immediately started work on several initiatives. One important project was establishing a disaster recovery plan for our core web sites, including "www". This was the first of our DR planning.
While we aren't done with DR planning for our all systems at Morris (this is a long-term effort) we have made significant progress. And it's an important part of our IT stewardship at the University.
Are you planning for your next disaster? Disaster Recovery isn't an option anymore, according to several sources, including eWeek: First of all, industry analysts from Gartner and IDC say that 30 to 40 percent of all IT shops either have no disaster recovery system in place or do not know how to use it correctly. Second, even if a shop does have a DR apparatus in place and tests it occasionally, there are plenty of examples of such systems not performing according to plan.
Disaster Recovery planning shouldn't be ignored. Despite the careful planning involved, DR should be simple, at least in concept. eWeek gives these basic steps to working on your DR plan:
- Find a system that fits your business and implement it. Don't laugh; many companies don't have one.
- Select a system that includes snapshots, mirroring and/or replication to a separate location, whether that location is within the confines of the physical enterprise or a cloud-service package.
- Test the system on a regular basis, even if it involves just a portion of the system at a time.
Testing should ideally include restoring pieces of the system, to a separate "test" system, so you can exercise the physical activities required for recovery. But a tabletop exercise may be sufficient, depending on your process, to work through a recovery scenario and identify gaps. For example, we ran through a constructed scenario to test our recovery plan for web sites.
We've managed to reduce our risk portfolio by moving our critical transactional systems to the Twin Cities; their Disaster Recovery plans cover our systems. In case of a major disruption, OIT has procedures to restore service. While it may take several days for all systems to come back online, we are reasonably assured that our critical systems are safe under OIT's care. This includes our Housing system, core web sites, and other key services and applications.
