Have you been made aware of an item in your household that seem so common and familiar but that could have an impact in your safety? Like that little remote you tuck in the sun visor of your car to open your garage. In case you are not aware, someone might be able to hack open your garage door without you knowing.
Similarly, when you are deploying and/or managing a Windows Server Failover Cluster (WSFC) for either a traditional SQL Server Failover Clustered Instance (FCI) or Availability Group (AG,) there are properties that we almost always ignore or, even worse, are not even aware of. These properties affect your WSFC availability which in turn affect all of the clustered applications running on top of it. I, myself, have not been aware of them until I started designing and deploying multi-subnet WSFC in Windows Server 2008. These properties are the SameSubnetDelay and SameSubnetThreshold.
The WSFC Heartbeat
Inter-node communication is critical to proper operation of a WSFC. This is where the concept of a heartbeat comes in. The heartbeat is the communication between nodes in a cluster that determines their status. This communication medium exists only within the cluster and transported thru the available network adapters in the nodes. I simply look at the heartbeat communication as a means for the cluster to know what is going on within its members. Imagine the cluster asking the nodes, “are you OK?” on a regular basis.
The SameSubnetDelay property is the amount of time it takes for the WSFC to ask the next “are you OK?” question. This property value is set in milliseconds where the default value is 1000 or one (1) second. That means the WSFC will ask the “are you OK?” question to all of the nodes every second. Now, if this was my friend asking the question, it could be very annoying to hear it every second. It’s like getting a text message on my phone every second and making sure I respond immediately. The SameSubnetThreshold property defines how many times the question wasn’t answered before the WSFC concludes that there is something wrong. This property value is set in numeric type where the default value is five (5.) That means the WSFC will ask the “are you OK?” question five times and not get a response consecutively, like getting a text message on my phone every second for 5 seconds but not responding to it. My friend would probably panic and assume that there’s something wrong. Now, you might be thinking, “how can that be so important?” I’m glad you asked.
How the WSFC Heartbeat Affects the Quorum
In a previous blog post, I talked about why quorum matters and how it affects the availability of the WSFC. If a node in a WSFC does not respond within the configured SameSubnetDelay and SameSubnetThreshold values, it is considered to be unavailable and, therefore, cannot vote towards the quorum. Eventually, when the WSFC no longer has majority of votes because of the unavailable nodes, it will take itself offline. Unfortunately, in a traditional 2-node WSFC configuration where both nodes in a cluster are in the same data center, we barely even notice these properties. In the past, it was common to use cross-over cables to connect two servers directly for dedicated heartbeat communications; for more than two nodes in a WSFC, a dedicated router/switch is used. Because the cluster heartbeat communication goes thru a dedicated network path, there are no interruptions and noticeable latency.
The Appropriate Values For These Properties
In a perfect world, we don’t really need to change these default values. But as more components are added in your network infrastructure – virtualization, network routing, firewalls, etc. – on top of existing traffic that is already going thru, the heartbeat communication might suffer. Imagine driving in a highway where you have five lanes. Even if you have a very wide road, traffic congestion will not allow you to go your usual speed. But even if you only have a single-lane road, if you are the only one using it, you are guaranteed to go with the recommended speed. Same thing with the heartbeat communications. While it is OK to accept the default values of one (1) and five (5) for the SameSubnetDelay and SameSubnetThreshold properties, respectively, you need to modify appropriately. Talk to your network engineers about the current traffic that goes thru your network. They will have a profile of the network traffic – what time of the day is the network busy, what application is consuming most of the network traffic, etc. Measure the network latency between nodes in your WSFC. If you currently only have two nodes in your WSFC, a cross-over cable can still be used for dedicated cluster heartbeat communications. You just need to document everything in case you decide to add nodes in your WSFC. Of course, in the modern data center setting, I doubt that you have access to the physical servers or if they are even physical servers at all.
- Tuning Failover Cluster Network Thresholds
- Windows Server 2008 Failover Clusters: Networking (Part 1) (still applies to Windows Server 2012 and higher)
- SQL Server 2012 Multi-Subnet Cluster Part 2 (an article I wrote about how this relates to SQL Server workloads)
Feeling helpless and confused when dealing with Windows Server Failover Clustering (WSFC) for your SQL Server databases?
You’re not alone. I’ve heard the same thing from thousands of SQL Server administrators throughout my entire career. These are just a few of them.
“How do I properly size the server, storage, network and all the AD settings which we do not have any control over?”
“I don’t quite understand how the Windows portion of the cluster operates and interacts with what SQL controls.”
“I’m unfamiliar with multi-site clustering.”
“Our servers are setup and configured by our parent company, so we don’t really get much experience with setting up Failover Clusters.“
If you feel the same way, then, this course is for you. It’s a simple and easy-to-understand way for you to learn and master how Windows Server Failover Clusters can keep your SQL Server databases highly available. Be confident in designing, building and managing SQL Server databases running on Windows Server Failover Clusters.
But don’t take my word for it. Here’s what my students have to say about the course.
“The techniques presented were very valuable, and used them the following week when I was paged on an issue.”
“Thanks again for giving me confidence and teaching all this stuff about failover clusters.”
“I’m so gladdddddd that I took this course!!”
“Now I got better knowledge to setup the Windows FC ENVIRONMENT (DC) for SQL Server FCI and AlwaysON.”