When A Log Backup Does Not Truncate Your SQL Server Log Files In An Availability Group

This was one of the questions asked during my Monitoring SQL Server Availability Group webinar two weeks ago. If you want to get notified for upcoming webinars, be sure to sign up for my email newsletter.

There are only two ways that can truncate your SQL Server log files – a checkpoint process when the database is in simple (or pseudo-simple) recovery model or a log backup when the database is in full or bulk-logged recovery models. In order for log truncation to occur, the virtual log files (VLFs) need to be full and marked as inactive. SQL Server will note that those VLFs can be overwritten the next time the transaction log file wraps around. This is true regardless of whether your database is running as a standalone or participating in a replication configuration (database mirroring, replication, Availability Groups.)

There are several reasons why SQL Server won’t truncate the log file even when you’re in simple recovery model or running regular log backups. Have a look at the list here in SQL Server Books Online and check out the log_reuse_wait_desc value of AVAILABILITY_REPLICA.  The question asked during the webinar was: “How come my transaction log file on the primary replica still grows unexpectedly even when I run regular backups?

The Availability Group Secondary Replica Is Not At Par With The Primary

From the definition, it says that “A secondary replica of an availability group is applying transaction log records of this database to a corresponding secondary database. ” This means that the log backups on the primary will not truncate the log if ALL of the secondary replicas have not completed the processing of the log records for REDO*.

This blog post was referenced in one of the discussions between Microsoft Data Platform MVPs. There have been some misconceptions on the processes behind when the transaction log gets truncated on the primary replica (I was confused myself in the good-old-days of database mirroring). Paul Randal (Twitter | blog)  pointed out the ambiguity in the post and provided a great definition for when the transaction log gets truncated on the primary replica:

“The virtual log file (VLF) on the primary *cannot* be overwritten/truncated until it is no longer required on the secondaries, as the log block write on the primary would cause the log block to be written to the same location on the secondary logs, which cannot be allowed to happen until that VLF has been processed for redo

Thank you, sir.

There are several side effects that you can take away from this definition, all of which you should really pay attention to.

  • The transaction log file of your primary replica database will continue to grow if there are no inactive portions of the log to truncate
  • A transaction log file growth is considered a transaction. It will generate transaction log records that will get sent to your secondary replicas, causing their transaction log file to also grow
  • Crash recovery starts reading transaction log records from the start of the active portion of the log. It means the UNDO portion of crash recovery will take longer when you failover to any of the secondary replica
  • Your primary replica is at the mercy of the SLOWEST SECONDARY REPLICA

The last item is one that you should really pay attention to. Often times, the Availability Group secondary replicas are provisioned primarily as a standby system – they only exist for the “just-in-case-something-happens” situation.  So they are either hardware with lower specifications than the primary replica or are virtual machines running on a physical host with lots of other virtual machines. This causes the REDO portion of crash recovery to take longer because of slower resources on the secondary.

There are several things that you can do to check this.

  1. Monitor the log_reuse_wait_desc value of the sys.databases catalog and see if you get AVAILABILITY_REPLICA. This will show up if your primary replica is waiting for a secondary replica to complete the processing of log records for a REDO.
  2. Monitor the SQL Server:Database Replica > Redo Bytes Remaining Performance Monitor counter on the secondary replicas. If this counter continues to increase, it means that the secondary replica could not process the log records fast enough for REDO, causing the primary replica to wait and for SQL Server to not be able to truncate the log on the primary
  3. Monitor the SQL Server:Database > Log bytes flushed\sec Performance Monitor counter on the primary replica. This is a good counter to monitor regardless of whether or not your database is in an Availability Group. This will tell you the amount of log records generated by transactions on the database. This will definitely show you some things that you can do to optimize your transaction log. For example, you might be unnecessarily rebuilding all of your indexes which generates a lot of transaction log records. Or maybe you can modify your existing table schemas to use the smallest possible data type that meets your business requirements. In a previous blog post, I highlighted how a simple data type change can impact the amount of transaction log records generated and its effects on your high availability and disaster recovery solutions.

Of course, this assumes that you are only running your Availability Groups secondary replicas as a hot standby for high availability purposes. But what about if you are using your Availability Groups secondary replicas for read-only workloads? I won’t have to tell you that you need to have the readable secondary replica licensed before doing so. In a future blog post, I will explain and demonstrate how using your Availability Groups secondary replicas for read-only workloads can affect log truncation on the primary replica.

Additional Resources


Feeling helpless and confused when dealing with Windows Server Failover Clustering  (WSFC) for your SQL Server databases?

You’re not alone. I’ve heard the same thing from thousands of SQL Server administrators throughout my entire career. These are just a few of them.

“How do I properly size the server, storage, network and all the AD settings which we do not have any control over?”

“I don’t quite understand how the Windows portion of the cluster operates and interacts with what SQL controls.”

“I’m unfamiliar with multi-site clustering.”

Our servers are setup and configured by our parent company, so we don’t really get much experience with setting up Failover Clusters.

If you feel the same way, then, this course is for you. It’s a simple and easy-to-understand way for you to learn and master how Windows Server Failover Clusters can keep your SQL Server databases highly available. Be confident in designing, building and managing SQL Server databases running on Windows Server Failover Clusters.

But don’t take my word for it. Here’s what my students have to say about the course.

“The techniques presented were very valuable, and used them the following week when I was paged on an issue.”

“Thanks again for giving me confidence and teaching all this stuff about failover clusters.”

“I’m so gladdddddd that I took this course!!”

“Now I got better knowledge to setup the Windows FC ENVIRONMENT (DC) for SQL Server FCI and AlwaysON.”

NOTE: Registration for my online course Windows Server Failover Clustering (WSFC) for the Smart SQL Server DBA will re-open in January 2018. But be sure you do not miss out. This will be the last time that the course will be offered. After this, you will no longer be able to register for the course.

 

Get notified about the next batch of enrollment so you don’t miss out.

* indicates required




Please note: I reserve the right to delete comments that are offensive or off-topic.

Leave a Reply

Your email address will not be published. Required fields are marked *