Reviewing the SQL Server Error Logs to Find the Two Most Common Keywords That Can Cause Availability Issues

Troubleshooting Availability of SQL Server Workloads Running on Windows Server Failover Cluster - Part 2

SQL Server Failover Clustered Instances (FCI) and Availability Groups (AG) depend a lot on Windows Server Failover Clustering (WSFC). Understanding how the underlying WSFC platform works can help us maintain availability of our databases

This blog post is the second in a series that talks about the process that I follow when troubleshooting availability of SQL Server failover clustered instances and Availability Groups. Note that the focus of this series is primarily on availability – identifying and dealing with downtime. 

 

As I was driving out of my driveway, I noticed something that is considered to be every car owner’s worst nightmare: the check engine light. And what’s frustrating about it is that it comes without a warning and no explanation whatsoever.

I’ve seen this light go off in the past. So, I did what I did before to get it to turn off. I opened the gas tank, cleaned the gas cap and the surrounding area and closed it real tight. That didn’t work. I was hoping to save a few hundred bucks and a trip to the car mechanic. I was unlucky. I had to take it to the mechanic.

Modern vehicles are equipped with sensors and mini-computers that track what’s wrong with your car. Your car dealer or mechanic will have devices that can read these sensors and mini-computers to identify the code that is triggered the check engine light. If you really want to know what caused that check engine light to turn on, you have to read the error codes with a scanner tool. See, even cars have error codes.

The SQL Server Error Log

The SQL Server error log contains information about the operational aspects of your databases. Depending on how you log operational events, it can contain everything from failed logins, corrupted data pages, failed (and annoyingly successful) backups, out-of-memory issues, etc. Reading the SQL Server error log can be overwhelming, especially if you don’t know what you’re looking for.

Troubleshooting availability issues with a SQL Server failover clustered instance (FCI) or Availability Groups (AGs) can feel like reading the codes that triggered your car’s check engine light. But, hey, you don’t need any special code reader to do it. You can use any text editor/reader to open the SQL Server error log even when the SQL Server instance is offline. In fact, this is one common misconception about reading the SQL Server error log: you can read it even when the SQL Server instance is offline. You just need to know where to find it.

Two Common Keywords That Identify Potential Availability Issues

I could list a ton of reasons why a SQL Server FCI or AG is offline. Which is why my first step in troubleshooting availability issues on a SQL Server FCI or AG is to look at the Cluster Dependency Report. More often than not, you can quickly identify availability issues for a SQL Server FCI using the Cluster Dependency Report. The most common availability issues for a SQL Server FCI are caused by the underlying Windows Server Failover Cluster (WSFC). Just take a look at this Failover Cluster Troubleshooting guide to see what I mean.

But what about SQL Server AGs?

Because SQL Server AGs work on the database-level, anything related to availability issues at the database-level can cause issues at the AG-level. Below are two of the common keywords that identify SQL Server AG availability issues. You can use them to filter the results in the SQL Server error log. These are simply meant to give you a sense of how you can use the SQL Server error log to start identifying availability issues.

RESOLVING

This could be caused by a number of different reasons. Here’s one example.

Error 983, Severity 14: Unable to access database ‘%.*ls’ because its replica role is RESOLVING which does not allow connections.

This could be caused by a network communication issue between the primary and the secondary.

Here’s a tip that helps me simplify this idea: Think of this as a typical client-server communication issue with the primary replica acting as the client (sending log records) to the secondary replica and vice versa.

  • firewall – Windows Firewall or otherwise
  • endpoint permissions – the SQL Server service account does not have access to the endpoint
  • bad cable – you need to keep rats off your data center
  • corrupted storage on the secondary – this renders your database offline regardless
  • offline secondary WSFC node – this renders your database server offline regardless
  • large number of VLFs on database that is taking time to analyze during a restart of the secondary – this is just like any other databases undergoing crash recovery
  • Transparent Data Encryption enabled on the primary but not on the secondary

I could go on and on and list other reasons. But the point is to find this keyword in the SQL Server error log and think of reasons why your primary replica could not connect to the secondary and vice versa.

Now, this does not mean that your SQL Server AG is offline. If the WSFC is properly configured and you have proper quorum, your SQL Server AG will still be online with the primary replica just happily doing its thing. It’s like being able to drive your car even with the check engine light on.

FAILOVER

This could be related to the previous item RESOLVING. Now, you can be as paranoid as I am to get alerted when a SQL Server AG (or FCI) failover occurred so that you can quickly identify what caused it. But you can also be confident that your system just works as expected.

The failover process is initiated by the WSFC. But in order for the WSFC to automatically failover the SQL Server AG from the primary to the secondary replica, the following conditions should be met.

  • The WSFC has quorum and can decide to automatically failover
  • Maximum failure threshold has not been exceeded
  • Cluster service on the secondary replica is running and healthy
  • The secondary replica configured for automatic failover is fully synchronized

Noticed that the first three items in the list all pertain the WSFC. The last item applies specifically to the database. Which is why you need to quickly identify and resolve any error messages that contain the keyword RESOLVING to guarantee automatic failover.

Here’s an example error message.

Error 19406, Severity 10: The state of the local availability replica in availability group ‘%.*ls’ has changed from ‘%ls’ to ‘%ls’. 

Of course, you won’t see anything like this when a SQL Server FCI fails over because that is no different from starting a SQL Server instance. The SQL Server error log would be recycled as a side effect of the failover.

NOTE: Automatic failover is not available on SQL Server AGs when one of the replicas is running a SQL Server FCI because the SQL Server FCI automatic failover takes precedence over the SQL Server AG automatic failover.

Reading the SQL Server Errror Log

I’ve been using the undocumented sp_readerrorlog (was xp_readerrorlog) system stored procedure ever since I can remember to search for keywords inside the SQL Server error log. I’m still not good with SQL Server Management Studio, even SQL Server Enterprise Manager from the good old days of SQL Server 2000 and earlier. To use the sp_readerrorlog stored procedure to search the current SQL Server error log for the RESOLVING keyword,


--to search for the RESOLVING keyword

--in the current SQL Server error log

sp_readerrorlog 0,1, 'RESOLVING'

My friend and SQL Server senior escalation engineer Balmukund Lakhani (blog | Twitter) wrote about the different parameters that you can use when calling this system stored procedure. This makes it easy for me to search the SQL Server error log to look for these keywords.

UPDATE:  I’ve included Rob Sewell’s (blog | Twitter) one-line PowerShell script example below. This just demonstrates how powerful PowerShell is when performing tasks like this.

(Get-SQLErrorLog -Server SERVER/INSTANCE).Where{$_.Message -like '*Resolving*' -or $_.Message -like '*Failover*'}

Additional Resources

Please note: I reserve the right to delete comments that are offensive or off-topic.

Leave a Reply

Your email address will not be published. Required fields are marked *

4 thoughts on “Reviewing the SQL Server Error Logs to Find the Two Most Common Keywords That Can Cause Availability Issues

  1. Thank you Edwin for sharing great stuff ! The kind of correlation you establish is just awesome to make it easier to understand and retain the same for longer.

    Currently Balmukund sir is a Escalation Engineer – I guess it doesn’t matter much…

    Thank you.

  2. (Get-SQLErrorLog -Server SERVER/INSTANCE).Where{$_.Message -like ‘*Resolving*’ -or $_.Message -like ‘*Failiver*’}

    If you have the latest sqlserver PowerShell module by downloading SSMS 2016