Sometimes, you just have to write it down.

A while back, I was looking into a very specific problem with SQL Server Database Mirroring. Yes, I’m aware that Availability Groups are now all the rage. I *did* say it was a while back.

Anyway, I keep getting asked about how database mirroring works, and what with remote working being the order of the day, I thought I’d write it up. Or down. Whatever. This very specific problem was that although the Principal had moved, connections from an application had not.

Let us begin….

Assume we only have one application server. Let’s call that AppServerA.

For this scenario, we have a SQL Server database that is protected from hardware failure by High Safety (Automatic Failover) Database Mirroring.

For that to work, we need 3 servers:

One to act as the principal to be the active database. Let’s call that DBServerA.

One to act as a standby in case DBServerA fails for some reason. Let’s call that DBServerB.

One to act as a witness to the database mirroring session to keep track of which server is acting as the principal, and to arbitrate which server should be the principal in the event of either of the other servers failing. Let’s call that DBServerC.

It is important to remember at this point that AppServerA does *not* know about DBServerC – it has no need to.

In the setup for database mirroring, AppServerA needs a connection string to set up the initial connection to whichever server is the current principal. This is done by adding parameters to the connection string that AppServerA has within its locally stored configuration.

The connection string contains “Server=DBServerA,50505; Failover_Partner=DBServerB,50505”. This tells AppServerA to attempt a connection to DBServerA on TCP port 50505, and in the event of that being unsuccessful, AppServerA should attempt a connection to DBServerB on TCP port 50505. As per best practice, you’ll note that a custom port is used. It’s not massively important what the port number is right now, it’s just one that’s available.

In this way, the *initial* connection to the database should always be directed to the active principal.

In the scenario, AppServerA successfully connects to DBServerA on the first attempt. What happens then is DBServerA *updates* the information for the “Failover_Partner” that AppServerA originally got from its connection string to whatever DBServerA has recorded in its internal tables. In this case, the partner information is stored as DBServerB\INSTANCE01, not as DBServerB,50505.

At this point, this is of no consequence, as DBServerA is up and functioning correctly. If DBServerA fails and SQLServer on that server is not contactable, AppServerA drops all its existing connections and attempts new connections using its connection string to use DBServerB as the principal. When DBServerA is available again, it assumes the role of the standby (mirror) in case DBServerB fails.

The application on APPServerA is using connection pooling. Connection pooling saves AppServerA resources (CPU and memory) by re-using a connection to the database if one already exists – the initial connection is *not* dropped and recreated, it is simply reset by the next application call that needs a connection to the database. There is a timeout value for idle connections in the connection pool that will drop unneeded connections. At this point, the original connection information has come from DBServerA and has DBServerB\INSTANCE01 as the failover partner.

Imagine that for some reason, there was an interruption in communication between DBServerA and DBServerB. There will be an event recorded in DBServerB’s logs that indicates that the connection between the two servers had been down for 10 seconds or more. In this situation, DBServerB contacts the witness (DBServerC) to confirm that DBServerC can contact DBServerA.

If the witness confirms that DBServerA is online and contactable, DBServerB remains as the standby and data that needs to be transferred from DBServerA will be queued on DBServerA until the link between it and DBServerB is restored.

If the witness confirms that DBServerA is indeed unavailable, DBServerB assumes the role of principal and data that is required to be send to the standby is queued until DBServerA comes back online. When DBServerA comes back, it assumes the role of standby until such time as DBServerB becomes unavailable.

The specific problem that I was asked to look into was that although DBServerB became the principal (as designed), connection pooling on AppServerA meant that the connections to DBServerA were not dropped, merely recycled. Connectivity between AppServerA and DBServerA was *not* interrupted in the same way as between DBServerA and DBServerB. It is not known at this point what the threshold for idle connections to be dropped from AppServerA’s connection pool is, but it assumed to be longer than the 10 seconds that it took for DBServerB to assume the principal role. Did I mention that AppServerA is super-massively chatty ? No ? Well, it is, because Developers.

This is where the distinction between ‘DBServerB,50505’ and ‘DBServerB\INSTANCE01’ becomes important. The former is a direct request to a specific TCP port on DBServerB, the latter is a request for DBServerB to provide the TCP port number for the instance named INSTANCE01. To serve this request, AppServerB issues a request to the SQL Browser service on DBServerB to do the instance name to port number translation.

The SQL Browser service (in *all* installations of SQL Server, not just those for database mirroring setups) listens on UDP port 1434 for requests.

On many firewalls, port 1434 (all protocols) is not open, so any request to UDP 1434 from AppServerA would get blocked by the firewall and not succeed.

So, given that the initial successful connection to DBServerA replaced the partner information of ‘DBserverB,50505’ with ‘DBServerB\INSTANCE01’, any failure of connection from that point on would result in AppServerA attempting to connect to ‘DBServerB\INSTANCE01’ in the event of DBServerA being unavailable.

This attempt would not succeed because of the firewall rules.

Even when DBServerA becomes available again, it is now functioning as the standby and so cannot serve data requests. If the connection is reused from the connection pool, AppServerA will still be trying to contact DBServerB on UDP port 1434 for the instance to port information. Only when DBServerA becomes the principal again will connections from AppServerA be successful if those connections are coming from an application pool that has not been restarted or otherwise cleared.

There are several ways to recover from this condition. You can bounce DBServerA, bounce the SQL Server instance on DBServerA, or recycle the application pool on AppServerA.

I favour the latter, because there’s NOTHING WRONG with any of the SQL Servers.

And for once, it’s *not* DNS…

Absolutely stunning

Virtual Machines. VMs. Guests. Hypervisees.

Everything’s gone virtual, and we *love* it. We get unlimited power over unlimited compute resources, and in IaaS/MPC a nice familiar GUI so all the Windows SysAdmins feel like they’re contributing.

I’ve never been much of a fan of running SQL Server on virtual machines from waaay back – it used to be touch and go whether your server would stay up, and it wasn’t really clear whose fault it was when it went down. Thankfully, things have moved on: virtualisation is a mature discipline, software is now built to run anywhere, and the location and physical existence of a server are not things we need to worry about right now.

So, assuming we want to do things the hard way and set up a managed private cloud on-prem so the Windows SysAdmins don’t have to worry about networking and firewalls, let’s create an estate of VMWare hypervisors and run *lots* of Windows guest VMs on them. Super. All tickety-boo.

We’re going to want backups with that so let’s leverage all that VMWare goodness, like snapshotting disks once they’ve been quiesced so we get a consistent backup. Hell, let’s live life on the edge and NOT quiesce the OS drives – you know, for fun !

We now have Windows servers (don’t care where or how) and we’re going to put SQLServer on some of them. Don’t worry, it’ll be fine. Trust me. We’ve moved on. Tell you what, we won’t backup the disks your MDFs, LDFs and NDFs are on. Happy now ? Good.

You might very well think that all is well in the world. Your servers are serving, your users are using, and your apps are…. apping ? Whatever. And it’s all being backed up and the DR people are happy.

You are fooling yourself.

Let’s take a closer look at what VMWare actually does when it takes a snapshot.

To get a consistent backup while the OS is running, VMWare uses the Windows Volume Shadow Copy service to create the snapshot. IO requests are then queued somewhere (I’m not *that* into the nuts and bolts of this) while the backup is taken, and then the snapshot is removed.

It is in this removal phase that the queue of stuff that’s waiting to be written to the VM is applied WHILE THE MACHINE IS STILL RUNNING. And what does VMWare do to make sure your hyperactive VM takes a breather to allow those writes ? It ‘stuns’ the machine – the digital equivalent of giving your server a quick slap in the face and telling it to calm the F down. It may need to stun the server several times to complete the outstanding writes.

Don’t worry, though. All of this stunning is for very short periods, and there are checks to make sure it won’t have an impact, right ?

Again, you are fooling yourself.

What VMWare does *before* it starts to stun a server is estimate how long it will take to apply the pending updates. If it thinks that will take less than 15 seconds, it will go ahead (note: It normally takes milliseconds). If it thinks the operation will take more than 15 seconds, it waits for 10 seconds and then has another go.

VMWare will do this estimation process 9 times. On the 10th, it just gives up and gives your server a really hard slap (however long it may take to apply the changes) so that it can complete the job and move on. Your server will be stunned for however long it takes. End of. You can read more about it here. The example quoted is 403 seconds, or a little over 6.5 minutes in old money.

As you might imagine, as a SQLServer DBA with high-transaction applications, I’m not 100% happy with my precious being slapped about or otherwise Tasered just because someone went and spent all that time and effort doing an on-prem MPC.

VMWare themselves have admitted that if you’re running busy SQL Servers on their software, you probably shouldn’t use their backup process and use an agent-based solution on the guest instead (Section 3.8.2.5 refers – and yes, the numbering sequence in the document is truly messed up).

Virtual Machines. VMs. Guests. Hypervisees.

Be careful what you wish for….

On withdrawing my session from SQLBits

This morning, I shared this on Twitter.

This post is some more information…. I’m not sure it is information, it’s more about feelings, anxiety, etc. You know, standard 2020 stuff.

The response from the #SQLFamily has been overwhelming, and affirms my long held belief that SQL Server people are among the very best people in the world.

[apologies for the next bit – it’s taken from messages I’ve already sent privately but I’d like to share the text here and expand a bit]

It’s not been the best of years to be honest, and on a personal note 2019 ended badly with a bereavement followed by another in the early part of this year. I’ve been a bit adrift work-wise for a couple of months, and the ever-changing times we find ourselves in meant that I was lurching from one thing I couldn’t control to another thing I had even less chance of dealing with.

It’s weird, all this stuff. Lockdowns, home schooling, home working, and international issues have forced me out of my bubble and put many terrible things front and center. I thought I was doing OK. Sure, there were good days and bad days, but on balance, things weren’t too bad.

I didn’t realise how much it had all got to me until I started to prep for recording my session and I just couldn’t do it. Like, total freeze. I couldn’t even sit at the keyboard and read what I’d done previously. I felt that ‘No one wants to see this. What *on earth* were you thinking??’. I guess part of the anxiety was seeing all these great people lining up great content, and I just *knew* I couldn’t live up to that. I’m not sure I could have given SQLBits the content they deserve.

Another large part of that was that with other things going on in the world, I couldn’t help but feel, well, helpless. So, although my heart was very much in it, my head just wouldn’t put all this useless noise to one side and then it kept telling me that I couldn’t do it even if I tried every time I even thought about it.

It’s not like I’m broken or anything, but I needed to make one change so I can make others. I’m really sorry it had to be SQLBits, and I feel a bit like I’ve let you all down. Trust me, I’ll get over it and I’ll be back presenting as soon as it’s right.

Admitting this on Twitter is the best thing that happened this year. Well, maybe not the actual best, but it’s in the top two. Top three, definately. After eating the frog this morning and actually formally withdrawing, I’ve admitted to myself that I can’t deal with it right now and it turns out that is OK. It’s also removed that voice that tells me I can’t do other stuff either. This is a Good Thing, because now I can get back to doing what I love.

So, thank you for reading, thank you for having my back, and most importantly thank you for being such a supportive part of my life. I hope to pay that back in the future.

 

How to **** everything up

I originally wrote this back in September 2019 on another blog platform, but thought it worth reproducing here because 2020…

So, apropos of a complete breakdown in service, here’s a few handy tips and action points to make sure your legacy lives on in the most psychologically painful way possible.

They also have the useful side effect of demoralising your team so much that they actively consider leaving a good job and a great employer because they’re unchallenged, unvalued, and unheard.

  1. When choosing a new technology for your product or service, actively consider how badly you messed it up last time, and seek at all times to replicate those ‘winning’ strategies.
  2. Consider how the new technology can be bent to your will.
  3. How will you hobble performance ?
  4. How will you make your new design less resilient ?
  5. What’s the lowest level of function that you can provide and could you push that bar even lower ?
  6. When putting together your team, get the brightest minds and force them into scoping meetings that last for weeks but don’t actually have an output.
  7. Getting money to do things is hard – always try to foster a ‘who will pay for this?’ mindset.
  8. Embrace new ways of working, but only as far as they support and perpetuate your existing business practices – no one likes change.
  9. If change *must* happen, make sure it’s delayed for as long as possible, but announce delays in small increments (or better still, not at all).
  10. If your customer is important, make sure that everything is process-centric – consistent customer service is better than good customer service.
  11. You’re a person outside of work – when you enter the workplace, don’t let on that you’re a functioning member of society.
  12. Never ask a direct question – words mean far more than actions.
  13. If buying a product off-the-shelf, make sure that customisations you implement mean that it can never be upgraded – you made it better, so that’s how it’ll stay. This is especially important if considering SaaS.
  14. Ask your user community what features of technology they’d like to see, then implement policies or processes that mean those goals can never be reached.
  15. In the event of your solution being accidentally viable, create management and authorisation frameworks so labyrinthine that people give up trying to request enhancements or alterations.
  16. Remember, even a delivery of 10% of the promised functions equates to a 100% successful delivery.
  17. Celebrate success with members of staff who had little or no input and who in all likelihood were not even aware of your project.
  18. If you’re making progress, claim that one in front of senior management – your team is all behind you.
  19. If you’re not making progress, micromanage. Two-hourly status calls are a great way to focus the team on solving problems and removing time-sapping blockers.
  20. When preparing presentations, consider your audience. Put as many words on a slide as you can, and ensure each slide contains information on at least three unrelated topics.
  21. When presenting, always stand behind your screen. Make sure to say ‘I’m not just going to read from the slide’ and then read from the slide.
  22. Block-book 15-seater meeting rooms well in advance, then use them for 1-to-1’s every other week.
  23. Bookend each workday with all-team meetings with titles like ‘DIARY-BUSTER’ and “ALL MUST ATTEND’.
  24. Send those awkward ‘bomb-shell’ emails out of hours. Copy in as many people as you think might make it look important.
  25. Above all, remember there is no ‘I’ in ‘Team’. There is an ‘I’ in failure, but that’s not an option.

 

This is by no means an exhaustive list, but hopefully they can help you be a better version of you if you’re into that kind of thing. Or you could just have read it, recognised some behaviours, smiled nervously, nodded sagely and moved on.

The decision, as they say, is YOURS !

Hunting high and log

Sometimes, you just lose stuff. There. I said it.

Well, maybe not ‘lose’ as such, but database files can be put anywhere by anyone with permissions, and that’s not always a good thing. Keeping everything nice and Kentucky is a DBA trait, so how do you track down where things are ?

You can use WMI (oof), PowerShell (oofoof) or maybe even SQLServer DMVs (tell me more).

I’ve never really got to grips with PowerShell – its on my list, for sure – but there are some occasions where PowerShell might not work for you, for example in restricted environments where you don’t have access to the OS directly. Here’s a way of getting database and mount point space information if you only have ‘public’ access to master…

DMVs are one of my favourite things ever. There’s just so many of them these days and we’re getting more all the time. My *other* favourite thing is CROSS APPLY – also known as the ‘fairy join’ – because I don’t know what it does and yet I can use it to get information that might be hard to join up otherwise !

Let’s put these two to work together to get a list of databases files, where they are, and how much space there is free on the disk where they reside:

--All database mdf mount points and disk free space in MB
SELECT DISTINCT db_Name(f.database_id) AS  [Database], volume_mount_point, (total_bytes / 1024) /1024 AS [Total MB], (available_bytes / 1024) / 1024 AS [Available MB]
FROM sys.master_files AS f 
CROSS APPLY sys.dm_os_volume_stats(f.database_id, f.file_id)
WHERE f.file_id = 1;

 

Database volume_mount_point Total MB Available MB
tempdb H:\SQLTemp\ 557733 545029
MyDB1 H:\SQLData\ 12046999 346111
MyTestDatabase H:\SqlData\ 12046999 346111
OtherDB1 H:\SQLLogs\ 334638 142694

 

You’ll see here that I’ve taken the easy route and used file_id = 1. That’s the MDF – the primary database file – and you *must* have one of those. Oh, and I can see that someone created a database with its MDF on a mount point intended for transaction logs. I’ll follow that one up later…

A small tweak, and you’ve got the same information for log files (file_id = 2, the LDF);

 

Database volume_mount_point Total MB Available MB
tempdb H:\SQLLogs\ 334638 142694
MyDB1 H:\SQLLogs\ 334638 142694
MyTestDatabase H:\SQLLogs\ 334638 142694
OtherDB1 H:\SQLLogs\ 334638 142694

 

That’s better – everything’s lined up nicely and where it should be, except maybe tempdb’s transaction log appears to not to be on dedicated disk. Another item for the backlog…

You might be wondering why I’ve used the DISTINCT keyword in the script. That’s because if you have multiple NDFs (secondary database files) you only really want the one entry back. I haven’t included a script for NDFs (file_id > 2) because I’m not in control of where *you* put them. Besides, giving you everything takes some of the fun out of it, don’t you think ?