But I’ll tell you what hermits realize. If you go off into a far, far forest and get very quiet, you’ll come to understand that you’re connected with everything.
– Alan Watts –
WARNING: There will be a ton of acronyms in this blog post. Make sure to keep your notebook handy for note-taking reference.
Whenever I talk to customers about SQL Server Availability Groups (AG) and Failover Clustered Instances (FCI,) I make sure I highlight the importance of Active Directory (AD) and DNS in their configuration. That’s because most of the presentations, video tutorials and resources that I see online only focus on the underlying Windows Server Failover Cluster (WSFC) layer. While it is worthwhile to really dig deep into the WSFC aspect, we sometimes miss the other components that make this technology work. Even worse, we fail to realize that those outside of WSFC play a crucial role into keeping our SQL Server databases highly available. Let me explain.
When I was still working as a data center engineer, one of the regular tasks assigned to me was to install Microsoft security patches released every month. I was responsible for patch management for all of the Windows servers in the data center, using Windows Software Update Services, PowerShell/VB scripting and AD to properly install security patches. We have a process in place that separates different environments – development, staging, production, DR and infrastructure services – for patch management. Then, the time came to patch the production servers. As I was rebooting the active (I can still call it active since it’s a Windows Server 2003 running SQL Server 2005) node on one of our SQL Server clusters, one of the systems engineers rebooted two out of the three domain controllers in the production environment. Unaware that my SQL Server FCI was impacted by the domain controllers being unavailable and that it should successfully failover to the standby node, I did my usual PING test to make sure that the nodes were available. After about five minutes into the process, I started getting email alerts about the SQL Server FCI being unavailable. This got me thinking. Why would my FCI be unavailable when all I did was to reboot a cluster node? Shouldn’t that automatically cause it to failover on the standby node? And, so, I thought.
The Authentication Piece
Versions of Windows Server prior to 2012 R2 depended heavily on AD. That’s because when you create a WSFC, a computer name object (CNO) gets created in AD. When you create a FCI in a WSFC, that also creates a virtual computer object (VCO) in AD. Refer to the screenshot below of a WSFC named AG-CLUS and an AG listener named AG-PROD-LN.
That means that, while totally outside the scope of a typical SQL Server DBA, SQL Server AG (since it also depends on WSFC) and FCI are now at the mercy of AD. That also means that your database availability are now dependent on AD. That’s what happened to my SQL Server FCI when I rebooted one of the nodes.
The WSFC is no different from any other computers in AD. It is required that computers that will be used as nodes in a WSFC be members of the same AD domain. Computer accounts – similar to user accounts – authenticate the identity of the computer in AD to authorize or deny access to domain resources. Also, similar to user accounts, the computer accounts have passwords – we just don’t know them like we do with user accounts. The computer account communicates directly with AD for all of the things that we are familiar with concerning user accounts – password policies, password changes, ACLs, group policies, etc. Just imagine a computer account, after rebooting, connecting to the network and asking AD to authenticate so it can do what it needs to do.
When I rebooted the cluster node that I patched, the WSFC simply took the FCI offline as part of the failover process. Think of this as the computer getting rebooted. The WSFC, then, has to bring the FCI online on the standby node. As it brings the FCI on the standby node, the VCO that corresponds to the FCI has to authenticate in AD by finding the nearest domain controller in the domain. This is influenced by your primary/preferred and secondary/alternate DNS configuration on the network adapter.
screenshot of the TCP/IP configuration of my domain controller
If the VCO could not find any domain controller to authenticate, like what happened when one of our systems engineers rebooted two out of three domain controllers, it won’t come online on the WSFC. And since the SQL Server FCI clustered resource is dependent on the VCO (also known as the virtual network name,) the FCI won’t come online. The same is true when the VCO gets either disabled or deleted
But It’s Someone Else’s Job
Managing AD is definitely outside the scope of any SQL Server DBA’s job description (unless you’re originally an AD guy who now manages the SQL Server databases or you’re the only IT guy in your organization.) And since we’re not responsible for that, we wouldn’t know what is being done in AD that may impact availability of our SQL Server databases. Like what happened to my FCI after rebooting one of the cluster nodes, I expected that the cluster would failover the FCI to the standby node as it always has. It may be true that it’s someone else’s job, the reality is that my job as a SQL Server DBA now depends on it. As I highlight in most of my WSFC presentations, “we SQL Server DBAs are now responsible for things that we don’t even know we’re responsible for.” Besides, we still need to write the root cause analysis (RCA) report after the incident.
Talk to your AD guy. Be nice to him. Heck, buy him coffee every once in a while. Understand what his typical day looks like. And, while you’re at it, ask how you can help him make his job easier. I’m not suggesting that you take on the role of managing AD, managing SQL Server databases is still more fun and exciting, as far as I’m concerned. If you’re on his side, it would be easier to tell him that your job now depends on his. To meet your database recovery objectives and service level agreements, you need to know who’s on your pack. And they need to know that you’re on the same page.
Or Upgrade to Windows Server 2012 R2
Windows Server 2012 R2 introduced AD-detached clusters. The idea behind this is that you can now deploy a WSFC that does not require creating a CNO in AD. That being said, any cluster-aware application that you run on top of the WSFC – AG and FCI, for example – will not have corresponding VCOs as well. The cluster nodes still need to be member servers in an AD domain. This decouples your SQL Server availability from anything happening in AD.
Deploying AD-detached clusters is only available using Windows PowerShell (sorry, GUI folks, this is where command-line trumps mouse-clicking.)
In the example code, I’m creating a WSFC named, AG-CLUS using the servers AG1 and AG2 as nodes, passing a static IP address of 172.16.0.57 but only using DNS as the administrative access point. I still need DNS for my WSFC because that’s how my client applications will access my WSFC and all the cluster-aware applications running on top of it. The only caveat here at the moment is that you can only use SQL Server (mixed mode) authentication for your AG and FCI.
The more we are aware of the dependencies that our databases have on external systems, the better we can prepare for and include them when designing and maintaining highly available SQL Server databases. AD and DNS are just two of the other dependencies that you need to consider when deploying SQL Server AGs and FCIs.
- Failover Cluster Step-by-Step Guide: Configuring Accounts in Active Directory
- How Domain Controllers Are Located in Windows (still applicable even though it says Windows 2000)
- Failover Clustering and Active Directory Integration
- Deploy an Active Directory-Detached Cluster
IF YOU ENJOYED THIS POST, GET MORE UPDATES STRAIGHT TO YOUR INBOX.