Absolutely stunning

Virtual Machines. VMs. Guests. Hypervisees.

Everything’s gone virtual, and we *love* it. We get unlimited power over unlimited compute resources, and in IaaS/MPC a nice familiar GUI so all the Windows SysAdmins feel like they’re contributing.

I’ve never been much of a fan of running SQL Server on virtual machines from waaay back – it used to be touch and go whether your server would stay up, and it wasn’t really clear whose fault it was when it went down. Thankfully, things have moved on: virtualisation is a mature discipline, software is now built to run anywhere, and the location and physical existence of a server are not things we need to worry about right now.

So, assuming we want to do things the hard way and set up a managed private cloud on-prem so the Windows SysAdmins don’t have to worry about networking and firewalls, let’s create an estate of VMWare hypervisors and run *lots* of Windows guest VMs on them. Super. All tickety-boo.

We’re going to want backups with that so let’s leverage all that VMWare goodness, like snapshotting disks once they’ve been quiesced so we get a consistent backup. Hell, let’s live life on the edge and NOT quiesce the OS drives – you know, for fun !

We now have Windows servers (don’t care where or how) and we’re going to put SQLServer on some of them. Don’t worry, it’ll be fine. Trust me. We’ve moved on. Tell you what, we won’t backup the disks your MDFs, LDFs and NDFs are on. Happy now ? Good.

You might very well think that all is well in the world. Your servers are serving, your users are using, and your apps are…. apping ? Whatever. And it’s all being backed up and the DR people are happy.

You are fooling yourself.

Let’s take a closer look at what VMWare actually does when it takes a snapshot.

To get a consistent backup while the OS is running, VMWare uses the Windows Volume Shadow Copy service to create the snapshot. IO requests are then queued somewhere (I’m not *that* into the nuts and bolts of this) while the backup is taken, and then the snapshot is removed.

It is in this removal phase that the queue of stuff that’s waiting to be written to the VM is applied WHILE THE MACHINE IS STILL RUNNING. And what does VMWare do to make sure your hyperactive VM takes a breather to allow those writes ? It ‘stuns’ the machine – the digital equivalent of giving your server a quick slap in the face and telling it to calm the F down. It may need to stun the server several times to complete the outstanding writes.

Don’t worry, though. All of this stunning is for very short periods, and there are checks to make sure it won’t have an impact, right ?

Again, you are fooling yourself.

What VMWare does *before* it starts to stun a server is estimate how long it will take to apply the pending updates. If it thinks that will take less than 15 seconds, it will go ahead (note: It normally takes milliseconds). If it thinks the operation will take more than 15 seconds, it waits for 10 seconds and then has another go.

VMWare will do this estimation process 9 times. On the 10th, it just gives up and gives your server a really hard slap (however long it may take to apply the changes) so that it can complete the job and move on. Your server will be stunned for however long it takes. End of. You can read more about it here. The example quoted is 403 seconds, or a little over 6.5 minutes in old money.

As you might imagine, as a SQLServer DBA with high-transaction applications, I’m not 100% happy with my precious being slapped about or otherwise Tasered just because someone went and spent all that time and effort doing an on-prem MPC.

VMWare themselves have admitted that if you’re running busy SQL Servers on their software, you probably shouldn’t use their backup process and use an agent-based solution on the guest instead (Section 3.8.2.5 refers – and yes, the numbering sequence in the document is truly messed up).

Virtual Machines. VMs. Guests. Hypervisees.

Be careful what you wish for….