Someone writing in a note book

Sometimes, you just have to write it down.

A while back, I was looking into a very specific problem with SQL Server Database Mirroring. Yes, I’m aware that Availability Groups are now all the rage. I *did* say it was a while back.

Anyway, I keep getting asked about how database mirroring works, and what with remote working being the order of the day, I thought I’d write it up. Or down. Whatever. This very specific problem was that although the Principal had moved, connections from an application had not.

Let us begin….

Assume we only have one application server. Let’s call that AppServerA.

For this scenario, we have a SQL Server database that is protected from hardware failure by High Safety (Automatic Failover) Database Mirroring.

For that to work, we need 3 servers:

One to act as the principal to be the active database. Let’s call that DBServerA.

One to act as a standby in case DBServerA fails for some reason. Let’s call that DBServerB.

One to act as a witness to the database mirroring session to keep track of which server is acting as the principal, and to arbitrate which server should be the principal in the event of either of the other servers failing. Let’s call that DBServerC.

It is important to remember at this point that AppServerA does *not* know about DBServerC – it has no need to.

In the setup for database mirroring, AppServerA needs a connection string to set up the initial connection to whichever server is the current principal. This is done by adding parameters to the connection string that AppServerA has within its locally stored configuration.

The connection string contains “Server=DBServerA,50505; Failover_Partner=DBServerB,50505”. This tells AppServerA to attempt a connection to DBServerA on TCP port 50505, and in the event of that being unsuccessful, AppServerA should attempt a connection to DBServerB on TCP port 50505. As per best practice, you’ll note that a custom port is used. It’s not massively important what the port number is right now, it’s just one that’s available.

In this way, the *initial* connection to the database should always be directed to the active principal.

In the scenario, AppServerA successfully connects to DBServerA on the first attempt. What happens then is DBServerA *updates* the information for the “Failover_Partner” that AppServerA originally got from its connection string to whatever DBServerA has recorded in its internal tables. In this case, the partner information is stored as DBServerB\INSTANCE01, not as DBServerB,50505.

At this point, this is of no consequence, as DBServerA is up and functioning correctly. If DBServerA fails and SQLServer on that server is not contactable, AppServerA drops all its existing connections and attempts new connections using its connection string to use DBServerB as the principal. When DBServerA is available again, it assumes the role of the standby (mirror) in case DBServerB fails.

The application on APPServerA is using connection pooling. Connection pooling saves AppServerA resources (CPU and memory) by re-using a connection to the database if one already exists – the initial connection is *not* dropped and recreated, it is simply reset by the next application call that needs a connection to the database. There is a timeout value for idle connections in the connection pool that will drop unneeded connections. At this point, the original connection information has come from DBServerA and has DBServerB\INSTANCE01 as the failover partner.

Imagine that for some reason, there was an interruption in communication between DBServerA and DBServerB. There will be an event recorded in DBServerB’s logs that indicates that the connection between the two servers had been down for 10 seconds or more. In this situation, DBServerB contacts the witness (DBServerC) to confirm that DBServerC can contact DBServerA.

If the witness confirms that DBServerA is online and contactable, DBServerB remains as the standby and data that needs to be transferred from DBServerA will be queued on DBServerA until the link between it and DBServerB is restored.

If the witness confirms that DBServerA is indeed unavailable, DBServerB assumes the role of principal and data that is required to be send to the standby is queued until DBServerA comes back online. When DBServerA comes back, it assumes the role of standby until such time as DBServerB becomes unavailable.

The specific problem that I was asked to look into was that although DBServerB became the principal (as designed), connection pooling on AppServerA meant that the connections to DBServerA were not dropped, merely recycled. Connectivity between AppServerA and DBServerA was *not* interrupted in the same way as between DBServerA and DBServerB. It is not known at this point what the threshold for idle connections to be dropped from AppServerA’s connection pool is, but it assumed to be longer than the 10 seconds that it took for DBServerB to assume the principal role. Did I mention that AppServerA is super-massively chatty ? No ? Well, it is, because Developers.

This is where the distinction between ‘DBServerB,50505’ and ‘DBServerB\INSTANCE01’ becomes important. The former is a direct request to a specific TCP port on DBServerB, the latter is a request for DBServerB to provide the TCP port number for the instance named INSTANCE01. To serve this request, AppServerB issues a request to the SQL Browser service on DBServerB to do the instance name to port number translation.

The SQL Browser service (in *all* installations of SQL Server, not just those for database mirroring setups) listens on UDP port 1434 for requests.

On many firewalls, port 1434 (all protocols) is not open, so any request to UDP 1434 from AppServerA would get blocked by the firewall and not succeed.

So, given that the initial successful connection to DBServerA replaced the partner information of ‘DBserverB,50505’ with ‘DBServerB\INSTANCE01’, any failure of connection from that point on would result in AppServerA attempting to connect to ‘DBServerB\INSTANCE01’ in the event of DBServerA being unavailable.

This attempt would not succeed because of the firewall rules.

Even when DBServerA becomes available again, it is now functioning as the standby and so cannot serve data requests. If the connection is reused from the connection pool, AppServerA will still be trying to contact DBServerB on UDP port 1434 for the instance to port information. Only when DBServerA becomes the principal again will connections from AppServerA be successful if those connections are coming from an application pool that has not been restarted or otherwise cleared.

There are several ways to recover from this condition. You can bounce DBServerA, bounce the SQL Server instance on DBServerA, or recycle the application pool on AppServerA.

I favour the latter, because there’s NOTHING WRONG with any of the SQL Servers.

And for once, it’s *not* DNS…

Leave a comment