You were in your favourite bar one Saturday night when, suddenly, you hear your mobile phone ring. You pick up the phone and heard the sound of a screaming voice on the other end (no, it’s not your wife telling you to go home and take out the trash). The background noise is preventing you from understanding what is actually being said. You checked on the phone number that registered on the phone – it’s your manager. You get out of the bar to clearly hear what is being said until you barely hear the last phrase, “the production database is in recovering state for more than an hour now…” And, then, your battery went dead. Sounds familiar?
In a previous blog post, I talked about the different acronyms that come with the term disaster recovery. In this blog post, I’ll talk about key items that we sometimes tend to ignore when creating a disaster recovery strategy – the lion, the “switch” and the wardrobe (I’ve been a fan of the Narnia movie series from which I got the idea). And, yes, I did get a phone call similar to that while I was driving with my family that I had to pull over and guide the other person on the line as they try to recover the database.
I call them the lions because they represent people with authority and responsibility over the infrastructure that you are preparing the disaster recovery strategy for. You definitely need to include them in. If you’re the DBA and you’re designing a disaster recovery strategy for the database server, then that lion is you. However, there are cases where the server administrator is not necessarily the DBA. That other lion is the server administrator. Oh, and isn’t your database server connected to the network? Then, you have another lion in the pack – the network administrator. And isn’t that a faulty hard drive that caused the disaster? Do you know who the supplier was for that hardware? Yes, that lion belongs to your pack as well. I can go on and on and include a ton of people in this list – the service provider for your network link, the company that stores your tapes offsite, the junior staff that needs to know what to do in case you’re on vacation, your IT manager who needs to make the tough calls in case the need arise. Make sure you know who the lions are in your pack and how to get in touch with them. Document who is responsible for what because a missing lion in the pack will definitely affect your service level agreement.
This is intentional. The switch (not the witch) represents the other types of “hardware” that affect your service reliability. I’ve had some discussions in the past with one of my former customers who happened to have high availability built into their database servers. They had their SQL Server instances running on top of Windows Failover Cluster which they designed after upgrading to Windows Server 2008. However, one of their past outages clearly show that Windows Failover Clustering is totally meaningless if you do not consider the other components of the hardware stack. While multiple nodes of the cluster have provided high availability for their cluster, the main culprit for the outage was the shared storage. Their SAN happened to be on a dedicated network that was causing a bit of an issue with routing. To make matters worse, the “switch” on which the SAN is connected to was shutting down unexpectedly due to power outages that might have been caused by improper wiring on the UPS. They focused so much on the availability for the cluster that they didn’t look at the storage and the network to be potential causes of outages.
Also, in my previous life as a data center engineer, we have had an incident where the production server suddenly experienced performance issues. We couldn’t figure out why because even our remote access sessions won’t go thru to allow us to perform troubleshooting. Until one of the heat sensors in the data center went off. The high CPU utilization was caused by overheating. One of the air conditioners shut down, thus, causing drastic temperature increase inside the data center. Those who spend time visiting a data center know that you need to be wearing a thick enough coat to keep yourself warm while working. Air conditioners are used to control the temperature and humidity to help prevent equipment overheating and, potentially, disaster. While fixing a faulty air conditioner won’t happen in less than a day, designing the data center to allow for such incidents should be considered as part of the disaster recovery strategy (we have had to bring in electric fans and portable air conditioners to temporarily prevent the temperature from rising while the air conditioners are being fixed). You also need to know the lion in the pack responsible for the data center management in case you have your servers co-located somewhere. Bottom line is that you need to consider the other types of hardware that affect your service reliability and should be included in your disaster recovery strategy.
The wardrobe represents storage of stuff. And stuff could be anything that affects your service reliability. One of my favourite wardrobes as far as disaster recovery is concerned is the runbook. It stores the information for a particular system that can be used by anybody should the need arise. Not too many DBAs or IT professionals like the idea of documentation but, as I’ve heard from a few, it’s a necessary evil. If you need to rebuild the server because of hardware upgrades or, worse, disaster recovery, the runbook will be your guide to have the server rebuilt just as it was before. If you don’t have one, chances are that you won’t be able to rebuild your server with the exact same configuration as before. With a runbook, you can have junior staff go thru the process themselves by simply following it in written format. You can even include processes for recovering databases based on your backup strategy. A common rule of thumb for runbooks is simply this – write it so that even the most junior staff can figure it out. The challenge is keeping the runbook updated with the changes made on the system. However, runbooks are definitely a must for disaster recovery strategies.
What about backups? Where are you keeping them? Do you have access to your backups? Are the backup tapes labelled intuitively? Are they safe? Are they stored offsite? This type of wardrobe should be documented as well so you will know where to look for your backups when you need them.
How about storage media? Yes, the media for installing your operating system, your database server, your patches, service packs, application software, etc. Have you heard about a legacy application that is only supported on Windows 2000 Server only to find out that the installation media is missing after the server crashed? Create backups of storage media and document them accordingly so that you can be sure they’re there when you need them.
And, have you even considered yourself as a wardrobe? Yes, you’ve got a ton of information on your head that needs to be shared within your team so that you don’t get as much of those emergency phone calls. You can set up a mentoring session with the junior staff, write documentation (or a blog post like this), do an internal presentation on how to perform a test restore of a backup – anything to make sure you’re not the only person on your team who can do the job. Most people don’t like this idea for fear of losing their job. But this is one thing that would make you more invaluable to the organization. For now, I’ll leave this topic for a professional development blog post.
I’m tempted to dive into the technical details of SQL Server disaster recovery and high availability but I realized that in order to really appreciate the technology aspects of disaster recovery, we need to understand the other aspects that affect and influence it.