Knowing and Understanding Your Disaster Recovery Strategy

This last weekend, I had the opportunity to speak at SharePoint Saturday Montreal – being one of the few English speakers for the event – on the topic of SQL Server high availability and disaster recovery options for SharePoint. After my presentation, one of my SharePoint MVP friends approached me about a database corruption issue that happened on their SharePoint farm. Now, being a SQL Server geek, I certainly would love to get my hands dirty and start looking at the error messages that DBCC CHECKDB reported. Unfortunately, since they do not have a dedicated SQL Server DBA, the only error message that they have is the one logged on the Windows Event Log as reported by SharePoint – Event ID 5586  from SharePoint Foundation. Unfortunately, there aren’t a lot of information from the Windows Event Log that tells any SharePoint (nor SQL Server) administrator that this is a database corruption issue. I asked if there were any Error Numbers 823, 824 or 825 reported and, sure enough, there was.

If it was only the database…

As a SQL Server administrator who has encountered any form of database corruption, our first instinct is to run DBCC CHECKDB, identify the level of corruption and how to possibly fix it. I’ve highlighted in a previous blog post that the output of DBCC CHECKDB can tell us what the corruption is and what possible options are available to resolve it. Armed with a good, working backup to use for repairing the corrupted database, we can safely restore the pages (if you’re on Enterprise Edition) or rows that were affected by the corruption and save the day. This could easily make any SQL Server DBA’s day a great one.

But, let’s pause for a minute and think, “what is this database used for?

What is the database used for?

If the database is used by an internal, custom built application, you can simply ask the development team about how the data is stored and what the impacts are in terms of the corruption. Of course, we’re also assuming that this is just ONE database, not many. But what if there were many databases used by the application? What if it was a third-party application? What is covered by the support contract? Do we have that in writing? What is the service-level agreement (SLA) for this application? These are just some of the questions that you need to ask yourself before a real disaster strikes. Here are some of the ways that you can find out what your disaster recovery strategy will be like so you can prepare way ahead of time should it happen.

  1. Know what application uses your database. Knowing the application that uses your databases can help you get more information about your possible options. As a SQL Server DBA, we are sometimes blindsided by the fact that our approach only revolves around SQL Server – restoring from backups, repairing corruption, provisioning high availability and disaster recovery solutions, etc. We need to know the application to prepare our plan of action.
  2. Know what other databases are involved. Enterprise applications use either one or more databases to implement their functionality. SharePoint is a classic example that I use when explaining this concept – it uses a farm configuration database, an administration site content database, a bunch of content databases used by different site collections, several service application databases, etc. (a more detailed list of the different SharePoint databases are available in this TechNet article.) This is because the transactional consistency of the application no longer depends on just one database. Even if you managed to repair a corrupted database in an application that uses multiple databases, there is no guarantee that the application is transactionally consistent.
  3. If a third-party application is used, know what the application vendor considers as SUPPORTED. I mentioned Microsoft KB 841057 in my previous blog post and how the REPAIR_ALLOW_DATA_LOSS option in DBCC CHECKDB is considered to be an unsupported database change. When my friend told me that one of the IT staff ran DBCC CHECKDB using the REPAIR_ALLOW_DATA_LOSS option, my immediate response was, “he should not have done that.” By doing so, it rendered the application – in this case, SharePoint – unsupported. Furthermore, because the impacts of the REPAIR_ALLOW_DATA_LOSS was not considered, the SharePoint content database became even more corrupted.Now, another option that we SQL Server DBAs would probably do is to restore the SharePoint configuration database. This approach is not officially supported as per Microsoft KB 948725. This means that regardless of whether or not your SharePoint database backups are valid and working, restoring them on the existing farm is not supported. Now, this is just one of many examples. Even SAP has it’s own supported way of dealing with corrupted databases. SAP Notes # 1597910Handling of database corruptions on SQL Server – discusses what is supported and what isn’t in terms of dealing with database corruption. This blog post from Luis Darui (SAP Support Engineer) on this particular topic talks about how not to use REPAIR_ALLOW_DATA_LOSS without guidance from SAP.
  4. Plan your disaster recovery strategy based on collected information. At this point, there really is no other option to recover the SharePoint databases other than restore all of the database backups taken at a specific point in time on a DIFFERENT SQL Server instance to keep the farm transactionally consistent. Or, if at all possible, rebuild the whole farm using the documented process to keep the farm configuration consistent and restore the content databases to it. This will definitely take the whole farm offline while restoring the backups and lose all of the data after the backups were taken. This is why it is important to know what the defined and agreed upon recovery objectives (RPO/RTO) and SLAs are in order to properly design a disaster recovery solution. A SQL Server Availability Group configuration that uses a different storage subsystem could have been implemented to address the database corruption issue. Once the disaster recovery strategy is implemented and documented, it is important to test it out on a regular basis to make sure that recovery objectives and SLAs are met.

In my presentations, I always joke around the fact that we don’t really need a high availability and a disaster recovery solution – until we do. It’s like the health insurance policy that we don’t really need – until we do. But more than just the database for the SQL Server DBA, it is important to know and understand the overall disaster recovery strategy from the ground up, involving individuals from different teams, the different components that make up the application architecture and how to execute the strategy when the real disaster strikes. It’s also the main reason why I recorded the video below to emphasize that while we SQL Server DBAs need to be focused on the database, we need to have an understanding of the overall application architecture that our databases are being used for.

 

 

Comments

comments

Please note: I reserve the right to delete comments that are offensive or off-topic.

Leave a Reply

Your email address will not be published. Required fields are marked *