When A Log Backup Does Not Truncate Your SQL Server Log Files In An Availability Group

This was one of the questions asked during my Monitoring SQL Server Availability Group webinar two weeks ago. If you want to get notified for upcoming webinars, be sure to sign up for my email newsletter.

There are only two ways that can truncate your SQL Server log files – a checkpoint process when the database is in simple (or pseudo-simple) recovery model or a log backup when the database is in full or bulk-logged recovery models. In order for log truncation to occur, the virtual log files (VLFs) need to be full and marked as inactive. SQL Server will note that those VLFs can be overwritten the next time the transaction log file wraps around. This is true regardless of whether your database is running as a standalone or participating in a replication configuration (database mirroring, replication, Availability Groups.)

There are several reasons why SQL Server won’t truncate the log file even when you’re in simple recovery model or running regular log backups. Have a look at the list here in SQL Server Books Online and check out the log_reuse_wait_desc value of AVAILABILITY_REPLICA.  The question asked during the webinar was: “How come my transaction log file on the primary replica still grows unexpectedly even when I run regular backups?

The Availability Group Secondary Replica Is Not At Par With The Primary

From the definition, it says that “A secondary replica of an availability group is applying transaction log records of this database to a corresponding secondary database. ” This means that the log backups on the primary will not truncate the log if ALL of the secondary replicas have not completed the processing of the log records for REDO*.

This blog post was referenced in one of the discussions between Microsoft Data Platform MVPs. There have been some misconceptions on the processes behind when the transaction log gets truncated on the primary replica (I was confused myself in the good-old-days of database mirroring). Paul Randal (Twitter | blog)  pointed out the ambiguity in the post and provided a great definition for when the transaction log gets truncated on the primary replica:

“The virtual log file (VLF) on the primary *cannot* be overwritten/truncated until it is no longer required on the secondaries, as the log block write on the primary would cause the log block to be written to the same location on the secondary logs, which cannot be allowed to happen until that VLF has been processed for redo

Thank you, sir.

There are several side effects that you can take away from this definition, all of which you should really pay attention to.

  • The transaction log file of your primary replica database will continue to grow if there are no inactive portions of the log to truncate
  • A transaction log file growth is considered a transaction. It will generate transaction log records that will get sent to your secondary replicas, causing their transaction log file to also grow
  • Crash recovery starts reading transaction log records from the start of the active portion of the log. It means the UNDO portion of crash recovery will take longer when you failover to any of the secondary replica
  • Your primary replica is at the mercy of the SLOWEST SECONDARY REPLICA

The last item is one that you should really pay attention to. Often times, the Availability Group secondary replicas are provisioned primarily as a standby system – they only exist for the “just-in-case-something-happens” situation.  So they are either hardware with lower specifications than the primary replica or are virtual machines running on a physical host with lots of other virtual machines. This causes the REDO portion of crash recovery to take longer because of slower resources on the secondary.

There are several things that you can do to check this.

  1. Monitor the log_reuse_wait_desc value of the sys.databases catalog and see if you get AVAILABILITY_REPLICA. This will show up if your primary replica is waiting for a secondary replica to complete the processing of log records for a REDO.
  2. Monitor the SQL Server:Database Replica > Redo Bytes Remaining Performance Monitor counter on the secondary replicas. If this counter continues to increase, it means that the secondary replica could not process the log records fast enough for REDO, causing the primary replica to wait and for SQL Server to not be able to truncate the log on the primary
  3. Monitor the SQL Server:Database > Log bytes flushed\sec Performance Monitor counter on the primary replica. This is a good counter to monitor regardless of whether or not your database is in an Availability Group. This will tell you the amount of log records generated by transactions on the database. This will definitely show you some things that you can do to optimize your transaction log. For example, you might be unnecessarily rebuilding all of your indexes which generates a lot of transaction log records. Or maybe you can modify your existing table schemas to use the smallest possible data type that meets your business requirements. In a previous blog post, I highlighted how a simple data type change can impact the amount of transaction log records generated and its effects on your high availability and disaster recovery solutions.

Of course, this assumes that you are only running your Availability Groups secondary replicas as a hot standby for high availability purposes. But what about if you are using your Availability Groups secondary replicas for read-only workloads? I won’t have to tell you that you need to have the readable secondary replica licensed before doing so. In a future blog post, I will explain and demonstrate how using your Availability Groups secondary replicas for read-only workloads can affect log truncation on the primary replica.

Additional Resources


Don’t put your databases at risk when deploying SQL Server Always On Availability Groups

 
Did you know that SQL Server Always On Availability Groups depend so much on Windows Server Failover Clustering (WSFC)?  The availability of the SQL Server databases relies heavily on the WSFC. And if you don’t know these other external dependencies, you are putting your mission-critical SQL Server databases at risk. Don’t be caught off guard. I’ve lost count of the number of times that I have had to bring a SQL Server Always On Availability Group online in an emergency just to explain that it was the WSFC that caused the outage.

I’ve been thinking about those heart-pounding moments when I needed to resolve an outage in the middle of the night involving SQL Server Always On Availability Groups. It doesn’t have to be painful and extremely complex.

That’s why I created this training class on SQL Server Always On Availability Groups: The Senior DBA’s Field Guide. Because I hate complicated, complex and confusing documentation from Microsoft. I want to help SQL Server database administrators learn complex technologies in very simple terms, using analogies that anyone can relate to.

So, if you want to be more confident in deploying, implementing and managing SQL Server Always On Availability Groups, sign up for this 3-day, online training class on December 20-22, 2017 (Wed-Fri) 9:00-5:00 Eastern (14:00-22:00 UTC). This is a live, online training class, hosted in GoToMeeting.

Use this link to register for the training class.

Please note: I reserve the right to delete comments that are offensive or off-topic.

Leave a Reply

Your email address will not be published. Required fields are marked *