Categorised – The Modern Dataist

Linked-up thinking

Last weekend, I had the pleasure (and, let’s face it, sometimes pain) of being a co-organiser of a full-day Azure Synapse conference – Data Toboggan.

We run three of these events per year: a Winter Edition (the one just gone), a Cool Runnings Edition in the summer, and an Alpine Coaster Edition towards the end of the year. This last one is a little bit different to the others. Maybe I’ll go into more details another time (or maybe you could just attend !)

There were over 350 sign-ups for the last event, and over 100 actual attendees. It was split into three ‘rooms’ called ‘Azure’, ‘Synapse’, and ‘Analytics’ (see what we did there ??), so there was much meeting-on-hold shenanigans going on and it got a bit hectic (well for the event team it did, apparently everyone else thought it was really slick).

Sidebar: Please please please upvote this Teams Feature Request – it’s to let attendees self-select breakout rooms (a feature that’s already in other collaboration platforms). Thank you.

Anyway, what was I saying ? Oh yes, conference stuff.

What’s great about online events (other than attending in your PJs from wherever you happen to be) is that people post helpful comments and other things in the chat if you let them. Yes, opening up chat in meetings can be a bit scary, but if everyone knows there’s a Code Of Conduct like this one and the organisers ensure compliance, it can bring rewards !

One thing I thought of at this last event was that there were some *awesome* additonal resources and references that people posted in the chat window. If you did that, THANK YOU ! because that’s what a sharing community is about.

Problem is, when you leave the meeting, you lose access to the chat. I guess that’s a Teams ‘feature’ that we’re all going to have to live with. As an organiser, I’m *not* going to create identities in our M365 tenant for people to use because Yikes That’s A Lot Of Work. It seems to work fine with the speakers being Guests in our tenant and everyone else is an attendee…

Anyway, what was I saying ? Oh yes, conference stuff.

So, what follows at the end of this post is *all* of the additional resources, web pages, github repos and the like that were posted into any of the chats during the day. They’re not in any order. Well, actually they are – you get bonus points and a reasonable amount of Kudos for working out what it is. They may even all open in new windows, but who knows ? I loves me a bit of jeopardy, I doos.

Hopefully that might be useful to some people. It would be a shame to lose all that additional insight and further reading just because the event isn’t on any more.

Thanks for reading, and maybe catch you at the next Data Toboggan ? Or SQL Bits, perhaps ? The Data Toboggan team will all be there 🙂

Back soon….

https://learn.microsoft.com/en-us/azure/synapse-analytics/guidance/security-white-paper-network-security

https://www.kneedeepintech.com/

https://learn.microsoft.com/en-us/azure/synapse-analytics/guidance/security-white-paper-network-security

https://learn.microsoft.com/en-us/certifications/azure-data-fundamentals/

https://learn.microsoft.com/en-us/certifications/azure-data-engineer/?WT.mc_id=DP-MVP-5004032

https://learn.microsoft.com/en-us/certifications/azure-enterprise-data-analyst-associate/?WT.mc_id=DP-MVP-5004032

https://techcommunity.microsoft.com/t5/microsoft-learn-blog/coming-soon-the-new-azure-enterprise-data-analyst-associate/ba-p/3254422?WT.mc_id=DP-MVP-5004032

https://www.databricks.com/learn/training/login

https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE4MbYT

https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/develop-openrowset

https://techcommunity.microsoft.com/t5/azure-synapse-analytics-blog/understand-synapse-dedicated-sql-pool-formerly-sql-dw-and/ba-p/3594628

This PDF

https://techcommunity.microsoft.com/t5/azure-synapse-analytics-blog/how-to-use-cetas-on-serverless-sql-pool-to-improve-performance/ba-p/3548040

https://techcommunity.microsoft.com/t5/azure-synapse-analytics-blog/synapse-spark-delta-time-travel/ba-p/3646789

https://techcommunity.microsoft.com/t5/azure-synapse-analytics-blog/monitoring-synapse-serverless-sql-pool-query-history/ba-p/3710423

https://techcommunity.microsoft.com/t5/azure-synapse-analytics-blog/monitoring-serverless-sql-ended-requests-by-using-log-analytics/ba-p/3650383

https://learn.microsoft.com/en-us/azure/synapse-analytics/migration-guides/

https://techcommunity.microsoft.com/t5/fasttrack-for-azure/azure-synapse-compatibility-checks/ba-p/3603939

https://learn.microsoft.com/en-us/azure/synapse-analytics/synapse-link/synapse-link-for-sql-known-issues

Dimensional Modeling Techniques

https://modelstorming.com/templates

https://github.com/brianbonk/public/tree/master/Speaks/2023/2023-01-28%20Data%20Toboggan

https://learn.microsoft.com/en-us/azure/cosmos-db/synapse-link?context=%2Fazure%2Fsynapse-analytics%2Fcontext%2Fcontext

https://github.com/MicrosoftLearning/DP-203-Data-Engineer

https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-repair-table.html

https://stackoverflow.com/questions/62260468/convert-databricks-noteboooks-dbc-into-standard-py-files

https://github.com/microsoft/Azure_Synapse_Toolbox

https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-azure-log-analytics

https://techcommunity.microsoft.com/t5/azure-synapse-analytics-blog/monitoring-synapse-serverless-sql-pool-query-history/ba-p/3710423

http://aka.ms/SynapseCommunityResources

https://github.com/pawelpo/presentations

Presentations

Doing Power BI The Right Way – for Enterprise Reporting

https://www.youtube.com/c/DataToboggan

https://detective.kusto.io/

https://events.sqlbits.com/2023/training-days

Synapse Analytics and Synthetic Data

‘Fake it ’til you make it’

I love it when people provide you with something that lets you get started and is easy to implement. Azure Synapse Analytics Database Templates are (in my opinion) a bit short of the mark, though. They are only as good as what someone else thinks is good – for example, the Consumer Goods template has tables for Customer, Communication, and Warehouse, but none of the columns look like they’re for a physical address (and don’t get me started on the Transaction table).

That said, they still have value – if you’re just poking around and getting used to the environment they’re a great place to start. I’ve already talked about the Map Data tool to get your data into the templates (that’s an in-person only type thing – not something I think I can write down because it would be a loooooong read), but what if you didn’t have to ? What if there was a way of generating synthetic data and using that instead ?

Sidebar: Synthetic Data is one of the new hot topics. Generally, products get all excited about training Machine Learning models, and using them to generate fake data in a consistent manner. Opinion: This actually leans more into good RegEx patterns than ML, but let’s not go there right now.

Helpfully, there is a resource in GitHub that will generate data in your chosen empty template 🙂 It doesn’t even need to know the schema because it works that out itself – it just needs to know the name of your Lake database !

Oh, and it also needs you to set up some stuff and assumes that you know how to do something that’s not immediately apparent when you has set up the stuff. So close !!

The data is (are?) generated by a Python notebook that’s provided, but require you to set up a Spark Pool to run the notebook, and also make sure that a specific Python library is available at runtime.

OK, let’s take a look at the mechanics…

Here’s a bright, shiny, empty Consumer Goods lake database that was created from the Consumer Goods template (duh):

As mentioned, we’re going to need a Spark Pool to run the Notebook. It doesn’t need to be massive, but if it’s too small (not enough cores or RAM) generating the synthetic data will take yonks, and Spark Pools are billed by the hour…

Head over to the ‘Manage’ icon, then ‘Spark Pools’ and select ‘+ New’:

The configuration recommended is in the GitHub documentation, so we’ll just use that.

Go right ahead and create the Spark Pool.

We’ve read the documentation that says we need “A Spark pool with the python library ‘Faker’ installed (simplest method is to have a requirements.txt with a line of text with the library name “faker” loaded in the cluster configuration). ” We have just created the Spark Pool. The simplest method would actually be to explain how you do that in the marksdown. Here goes…

Create a text document called ‘requirements.txt’ on your device’s filesystem. Open it, and type the word ‘Faker’ on the first line, with no leading or trailing spaces or other whitespace that migh mess you up (it’s not YAML, but good practice is good practice, right ?). Save that file and remember where you saved it. <– I can’t always remember where I saved the document I just created 🙂

We now come across another one of those special moments in Azure where things aren’t where you’d expect them to be. What we want to do now is configure our Spark Pool to always load the Faker library on start-up, so you’d think that would be in the configuration blade…

It’s not. It’s here:

We need to enable session packages in the Spark Pool (they’re disabled by default) and tell it what our requirements are for default packages to apply at startup by uploading the requirements.txt file we created earlier:

When the file is upoaded, hit ‘Apply’ and possiby put the kettle on. It can take a few minutes to apply. My advice is to keep the Notifications blade open and do *nothing* until you see the ‘success’ message 🙂

So now we have a Spark Pool with Faker pre-initialised / loaded / installed / whatever the term is. Next step is to copy the notebook contents from the GitHub resource and paste them into a new notebook in Synapse Studio.

Once you’ve got the notebook created (I called mine ‘Synthetic Data Generator’), attach it to your Spark Pool, and change the database name in the first code cell to the name of your shiny new empty templated Lake database:

When we run the first code cell, it will detect the schema in your database (tables, columns, datatypes, etc) and use those in the next cell to generate the data.

Note: These next bits can take quite some time. It’ll take a few minutes to spin up the Spark Pool, and generating the data takes a while as well.

Yeah, OK. That’s just the first code cell. Feel free to run the second, but be prepared to wait *even* longer… at least we don’t have to wait for the pool to start up…

When that’s completed (my, that’s a lot of jobs!), we have successfully generated quite a few rows in each table. It’s not 100% clear how much data can be faked using this method, but a few hundred thousand rows seems like a reasonable data set.

Going back in the the database, we can see that there’s data there, and that there is *some* referential integrity preserved through relationships in the schema, but since there are no foreign keys in the schema, they’re not enforced in the generated dataset. For example, there are no customers that do not also have a transaction recorded (which you’d expect from a store)…

… but there are transactions that have customer ids that are not in the customer table…

That’s probably more to do with the fact that the schemas in the templates do not enforce referential integrity, and that’s probably due to the fact that the data is stored in block blobs in your DataLake Gen 2 storage, but I guess you already knew that… 🙂

So that’s just a quick look at creating synthetic data in Azure Synapse Analytics Database Templates.

Hope that’s useful ?

Back soon….

Sticky labels

Azure Synapse Analytics Dedicated SQL Pools – that’s a whole load of fun right there. Practically limitless compute and data, yumyumyum…

Apart from all the Data Science things that Azure Synapse enables, those lovely people at Microsoft have added a feature that on the face of it looks pretty innocuous but will, I promise you, prove to be very useful down the line. Especially when it comes to tracing who or what did what and how. That feature is called ‘labels’, and it’s available in Dedicated and Serverless SQL Pools (according to the documentation – I’ve only ever used it in Dedicated)

Underwhelmed ? I thought so. Let’s walk through this together…

We’re all familiar with the standard OPTION keyword that’s available across all supported SQL Server versions (on-prem and cloud) that lets you supply a query hint to the database engine (this example yoinked straight off the internet):

USE AdventureWorks2012; GO SELECT ProductID, OrderQty, SUM(LineTotal) AS Total FROM Sales.SalesOrderDetail WHERE UnitPrice < $5.00 GROUP BY ProductID, OrderQty ORDER BY ProductID, OrderQty OPTION (HASH GROUP, FAST 10); GO

So far so yawn, but for Azure Synapse and PDW, there’s an additional :

SELECT * FROM sys.tables OPTION (LABEL = 'My Query Label');

Also a bit yawn. But here’s where it gets interesting. You see, adding a label to a query adds that label to a DMV, so you can:

SELECT * FROM sys.dm_pdw_exec_requests r WHERE r.[label] = 'My Query Label';

Note, this is sys.dm_pdw_exec_requests. It’s not in the normal sys.dm_exec_requests that we all know and love. Here’s hoping they add it to lower versions of SQL Server soon 🙂

This means that (if there’s a will and a coding standard to adhere to) it is possible to label executions and identify them right in the database engine.

For example, you might want to identify each step of an ELT process that’s done in the database so you can identify performance of each step ? Simples – OPTION (LABEL = 'FunkyELT'); or even OPTION (LABEL = 'FunkyELTStep1'); (followed by a separate lable for each step / statement).

Maybe you want to track who’s doing what ? Again, OPTION (LABEL = 'DaveDidThis');

Or maybe refer back to the Jira tickets that brought the statement into being (or tuned it when it turned out to have a sub-optimal construct). Duh – OPTION (LABEL = 'JiraTicketRef');

Tracing them back is now as easy as:

SELECT * FROM sys.dm_pdw_exec_requests r WHERE r.[label] = 'YourLabelHere';

OK, OK, so it’s not a silver bullet, killer app, or anything – but as someone who’s always interested in what’s going on under the hood, a small tweak that helps you group stuff or uniquely identify a specific workload is *very* welcome !

Back soon…

Peas

“I eat my peas with honey;

I’ve done it all my life.

It makes the peas taste funny,

But it keeps them on the knife.” – Anonymous

I’m the sort of person who hoards random leftover stuff in the toolshed because reasons. One reason is that they’re useful (wrenches). Another is that I use them a lot (screwdrivers). Yet another is that I know other people use them, so I’ve got one in case I need one (chisels).

Perhaps the most maddening reason is that there’s some stuff I have just because someone offered me it and I just couldn’t say no. Like the bench-mounted double grinder (I don’t have a bench) and a mitre box, when I think something might be cool or useful, it goes in the shed perhaps never to be used, but I’d rather have a tool that I might need than not have a tool I do need.

With data, there are just so many tools though ! Knowing which one to use is difficult at the best of times – they all do one thing very well and other things beside – and it sets off my anxiety just thinking about the range of possibilities.

Helpfully, there’s someone in the data community (isn’t there always) who takes it upon themselves to document all the available options to make chooosing a tools easier. In this case, it’s Matt Turck. Every year for the past five years, maybe longer, Matt has produced a handy infographic of the Big Data Landscape, accompanied by a very insightful blog which you should make a point of reading. This is the picture for 2016, and this is the one for 2021. Please take a look at them before reading further…

The thing that’s most striking about these images is how much more there is in the later one. The explosion in data and analytics over the last few years, coupled with the availability of more open-source software has led (in my view) to a bewildering array of options for data – so much so that it’s probably easier to stick with what you know.

Making the right technology choice for your thing is important. Making the right tooling choice for manipulating, moving, transforming, aggregating, tagging, reporting, analysing, and visualising your data is also very important. And just as hard, if not harder.

Imagine designing a system in a NoSQL datastore like Apache Cassandra, only to realise that reporting is *not* as simples as a SELECT * FROM table JOIN otherTable ON commonColumn, and you should really redesign your entire keyspace to get that kind of data materialised up front (please, no comments on the relative merits of Materialised Views in Cassandra – that can only end badly).

My point here is that sometimes a plethora of choices can lead to suboptimal decisions. Tooling for the data professional is not ‘When all you’ve got is a hammer, everything starts to look like nails’ – it’s a delicate balancing act between function and useability. Not everyone is going to know every tool. No one expects them to, except possibly recruiters, and nor should anyone denigrate one tool in favour if another to another potential user.

It’s a personal choice, unless your org has a specific toolset, but I’m guessing that most larger orgs won’t have one that covers all of the current tech stack (or if they do, it’s so out of date it’s not funny any more).

The choice comes down to whether you continue to eat peas with honey, buy a spoon, borrow a fork, or stop eating peas in the first place.

Back soon….

Washing up

I’ve always hated washing dishes.

This comes from early childhood when, on Sundays, my sibling and I would be banished to the kitchen after the Family Dinner to perform the washing up, one parent having toiled for what seemed like the whole day to prepare the meal and the other having done nothing more arduous than read the newspaper and be older.

There are 3 specific things I hate about washing up:

When there are things hidden in other things, like cutlery in a pan that has a lid on.
When someone drops something into the bowl while you’re washing another.
When someone adds items to the queue of things to be washed after foraging elsewhere in the house.

I guess I hate these things because they change the planned task. You know how many pots there are to wash, you plan what order you want to wash things in, and you set about executing that plan. Then you find the hidden stuff that means there’s more to wash than originally planned. And then ~~some inconsiderate git~~ a sibling / parent nonchalently adds to your workload without so much as a ‘by-your-leave’ that means your plan is now broken, but you still have to do the pots.

Fast forward some <mumble> years, and I’m a DBA working with transactions in SQL Server. And it seems me and SQL Server have something in common.

Imagine a workload submitted to SQL Server. An execution plan will be generated based on statistics and data *at the time the request was submitted*, and SQL Server will attempt to execute that plan.

Now, imagine that one of the elements in the workload is a table variable. SQL Server will always assume that the row count of that table is one (actually, not true, but that’s only recently and spoils my illustration). Table variables are equal to a closed lunchbox in the washing up – you only know what’s in there when you have to open it to clean it. If someone else has crammed stuff in there, that will have a knock-on effect on your ability to complete the work as planned. Same for SQL Server – processing more rows than planned can be expensive.

Imagine that workload has a lookup that uses a large table that has wildly out of date statistics. When SQL Server comes to that part, the assumptions made when the execution plan was generated will be wrong and there’s a whole lot more lifting-and-shifting to be done that was orginally planned.

Also imagine that the workload batch-processes data from a table that is constantly being added to. At the time the plan is decided on, there may be only 50,000 rows in the table, but while the batched are being fetched, a further 200,000 rows are added (yes, I know it’s a contrived and unlikely example) and those new rows will also need to be processed before the workload can complete.

With SQL Server, there are a number of things you can do to avoid this, though:

Use table variables sparingly, and only when you are sure of the size and content. In my experience anything more that 1000 rows in a table variable could lead to complications, so I’d use a temporary table instead. *other experiences are available
Keep your table statistics up to date. Whether you do this via UPDATE STATISTICS or whatever method, I don’t mind, but keep them up to date – your batch processes will thank you later.
Be mindful of transaction management. Make sure you know how much data you’re picking up and how to ignore late-arriving data. Explicit transactions or transaction isolation levels such as RCSI might help you out, but as ever ‘It Depends’.

With washing up, there is only 1 thing you can do – get a dishwasher.

Back soon…

World Azure Synapse Day ’21

So, while there’s been all this stuff going on, a couple of data friends and I started a conference series and a user group both focused on Azure Synapse Analytics. Because, you know.. you can never have too many online conferences, training sessions, and presentations, right ?

We’ve probably all attended online lerns recently, been to meetings online, chatted with friends and family over new and sometimes unfamiliar apps like Teams and Zoom (yes, the nerds have been using them for a while, but collab apps just went *wild* recently).

But we (as in me and the aforementioned data friends) wanted to do something different. Something more…. interactive. More…. conversational. Less “I’ll sign up and download the recordings later, maybe“. And so, somewhat concietedly, we came up with ‘World Azure Synapse Day 2021’, where we can share, chat, learn, and laugh (maybe) to round out the year. A sort of retrospective on the future, if you like.

Full disclosure: we didn’t come up with the format ourselves. A few months back, a company called DataStax did something similar for the worldwide Apache Cassandra community where they had short talks from the engineering team, the sales people, their marketing. We’re aiming for something waaaayyyy less corporate, and more about you 🙂 We just borrowed the good bits and left out the sales pitch.

So we’re doing three sessions on the day, timed to hopefully coincide with people’s availability in each of APAC, EMEA, and AMER time-zones. We hope it will feel a bit more like a meeting than a presentation, as there’ll be several short talks and interviews in each. The sessions will *not* be recorded (sorry, you’re going to have to be there to see it !) – it’s really important to us that each session flows and everyone can participate, and recording that doesn’t seem fair. We’re also not even publishing an agenda, just a schedule of who will be talking at each session – we want the speakers to speak for themselves in every way possible – so there’s no pressure on anyone to ‘get it right first time’, and each and every contribution is welcome.

If you want to attend (and we hope you do), please sign up to the MeetUp for Data Toboggan here. There’s also a code of conduct under the heading ‘Be Excellent To Each Other’ that we’d ask you to adhere to, but that’s all you have to do. If you want to have your voice heard, just submit some thoughts through Sessionize, and we’ll be in contact. And don’t hesitate to ask any question you like in the MeetUp group – we’ll make sure you get a response 🙂

Hope to see you at Data Toboggan – World Azure Synapse Day 2021 ! Until then, stay safe.

The Data Toboggan Team

Month / End ?

We’ve all been there. Month End. What a nightmare ! So much data processing ! So much nervousness – ‘Will the servers hold up ? Is there enough storage ?” – that senior managers have that kinda hang-dog look on their faces and have things crossed that they didn’t even know could be crossed….

You hope, nay you pray that everything works out OK, that no-one gets called out in the middle of the night because ‘OMG it fell over !’, and that the C-Suite gets all their lovely reports on time.

And when it’s all done, everyone breathes a collective sight of relief, calms down, and forgets about all the anxienty – until next month end.

……

Batch processing has been The Way since ever there was a need for processing large amounts of data. It used to be the reason why cheques took a while to clear, and your bank balance took a day or two to update (there’s still no BACS runs on certain days, which is why your bank balance doesn’t change at weekends). Processing data in this way is great for large amounts of data. It’s a well-trodden path to gather up all your data at the end of the month and then work out what happened and how that effected your bottom line.

But maybe, just maybe, there’s a second Way.

Reporting, especially financial reporting, is essentially counting things up and (if you’re really clever) doing some other math on those numbers. Plain and simple – it’s just sums of numbers. Things that have been counted and added up.

Once something has happened, it can’t unhappen. You can’t untake a breath. You can’t uneat breakfast. Sure, there sometimes can be another event that reverses one action (eww), but essentially it remains done. The number of heartbeats you have had increases all the time (until something tragically prevents that happening). Yes, the number of beats per minute fluctuates and the number of beats per day is similarly fluid, but the counter is incremental and the variations are just counts with context. In financial transactions, money leaves an account and arrives in someone else’s. Sometimes (not frequently enough) it’s the other way round. In stocktaking, inventory items are taken and used, and replenished from somewhere else. In business, you gain some customers and you lose some customers.

All of these things cannot un-happen. They are all single, unrelated events. One does not cause the other, rather one triggers an action that causes the other.

So why wait until month end to count them and do the complex math over potentially massive data sets and make everyone feel that month-end anxiety ? Maybe, counting and math-ing each day would be less of a strain (in both compute and mental terms), and then your month-end would be just summing all the counts and making pictures, no ?

We’re not talking Stream Analytics here, just some sensible use of overnight compute resource. Cloud computing considerations come into play here – it may be more cost-effective to have one massive processing event every month than do 28+ shorter processing events (given capacity-on-demand, it’s a thought) – but on-premises you’re already paying for the servers to be up (with their fixed compute resources and licence fees) – why not look in to using them differently ? Smaller, more frequent batches could save you time, money, and anguish.

Maybe then people would relax about month-end because it’s just maths and not data movement and cleansing that could go wrong. That ‘rinse-and-receive’ process happens every day and is easier to fix if it goes wrong (if you design it to be robust and recoverable).

Who knows, maybe the C-Suite reports would be available sooner ? And that’s a win, right ?

Back soon…

Sometimes, you just have to write it down.

A while back, I was looking into a very specific problem with SQL Server Database Mirroring. Yes, I’m aware that Availability Groups are now all the rage. I *did* say it was a while back.

Anyway, I keep getting asked about how database mirroring works, and what with remote working being the order of the day, I thought I’d write it up. Or down. Whatever. This very specific problem was that although the Principal had moved, connections from an application had not.

Let us begin….

Assume we only have one application server. Let’s call that AppServerA.

For this scenario, we have a SQL Server database that is protected from hardware failure by High Safety (Automatic Failover) Database Mirroring.

For that to work, we need 3 servers:

One to act as the principal to be the active database. Let’s call that DBServerA.

One to act as a standby in case DBServerA fails for some reason. Let’s call that DBServerB.

One to act as a witness to the database mirroring session to keep track of which server is acting as the principal, and to arbitrate which server should be the principal in the event of either of the other servers failing. Let’s call that DBServerC.

It is important to remember at this point that AppServerA does *not* know about DBServerC – it has no need to.

In the setup for database mirroring, AppServerA needs a connection string to set up the initial connection to whichever server is the current principal. This is done by adding parameters to the connection string that AppServerA has within its locally stored configuration.

The connection string contains “Server=DBServerA,50505; Failover_Partner=DBServerB,50505”. This tells AppServerA to attempt a connection to DBServerA on TCP port 50505, and in the event of that being unsuccessful, AppServerA should attempt a connection to DBServerB on TCP port 50505. As per best practice, you’ll note that a custom port is used. It’s not massively important what the port number is right now, it’s just one that’s available.

In this way, the *initial* connection to the database should always be directed to the active principal.

In the scenario, AppServerA successfully connects to DBServerA on the first attempt. What happens then is DBServerA *updates* the information for the “Failover_Partner” that AppServerA originally got from its connection string to whatever DBServerA has recorded in its internal tables. In this case, the partner information is stored as DBServerB\INSTANCE01, not as DBServerB,50505.

At this point, this is of no consequence, as DBServerA is up and functioning correctly. If DBServerA fails and SQLServer on that server is not contactable, AppServerA drops all its existing connections and attempts new connections using its connection string to use DBServerB as the principal. When DBServerA is available again, it assumes the role of the standby (mirror) in case DBServerB fails.

The application on APPServerA is using connection pooling. Connection pooling saves AppServerA resources (CPU and memory) by re-using a connection to the database if one already exists – the initial connection is *not* dropped and recreated, it is simply reset by the next application call that needs a connection to the database. There is a timeout value for idle connections in the connection pool that will drop unneeded connections. At this point, the original connection information has come from DBServerA and has DBServerB\INSTANCE01 as the failover partner.

Imagine that for some reason, there was an interruption in communication between DBServerA and DBServerB. There will be an event recorded in DBServerB’s logs that indicates that the connection between the two servers had been down for 10 seconds or more. In this situation, DBServerB contacts the witness (DBServerC) to confirm that DBServerC can contact DBServerA.

If the witness confirms that DBServerA is online and contactable, DBServerB remains as the standby and data that needs to be transferred from DBServerA will be queued on DBServerA until the link between it and DBServerB is restored.

If the witness confirms that DBServerA is indeed unavailable, DBServerB assumes the role of principal and data that is required to be send to the standby is queued until DBServerA comes back online. When DBServerA comes back, it assumes the role of standby until such time as DBServerB becomes unavailable.

The specific problem that I was asked to look into was that although DBServerB became the principal (as designed), connection pooling on AppServerA meant that the connections to DBServerA were not dropped, merely recycled. Connectivity between AppServerA and DBServerA was *not* interrupted in the same way as between DBServerA and DBServerB. It is not known at this point what the threshold for idle connections to be dropped from AppServerA’s connection pool is, but it assumed to be longer than the 10 seconds that it took for DBServerB to assume the principal role. Did I mention that AppServerA is super-massively chatty ? No ? Well, it is, because Developers.

This is where the distinction between ‘DBServerB,50505’ and ‘DBServerB\INSTANCE01’ becomes important. The former is a direct request to a specific TCP port on DBServerB, the latter is a request for DBServerB to provide the TCP port number for the instance named INSTANCE01. To serve this request, AppServerB issues a request to the SQL Browser service on DBServerB to do the instance name to port number translation.

The SQL Browser service (in *all* installations of SQL Server, not just those for database mirroring setups) listens on UDP port 1434 for requests.

On many firewalls, port 1434 (all protocols) is not open, so any request to UDP 1434 from AppServerA would get blocked by the firewall and not succeed.

So, given that the initial successful connection to DBServerA replaced the partner information of ‘DBserverB,50505’ with ‘DBServerB\INSTANCE01’, any failure of connection from that point on would result in AppServerA attempting to connect to ‘DBServerB\INSTANCE01’ in the event of DBServerA being unavailable.

This attempt would not succeed because of the firewall rules.

Even when DBServerA becomes available again, it is now functioning as the standby and so cannot serve data requests. If the connection is reused from the connection pool, AppServerA will still be trying to contact DBServerB on UDP port 1434 for the instance to port information. Only when DBServerA becomes the principal again will connections from AppServerA be successful if those connections are coming from an application pool that has not been restarted or otherwise cleared.

There are several ways to recover from this condition. You can bounce DBServerA, bounce the SQL Server instance on DBServerA, or recycle the application pool on AppServerA.

I favour the latter, because there’s NOTHING WRONG with any of the SQL Servers.

And for once, it’s *not* DNS…

Absolutely stunning

Virtual Machines. VMs. Guests. Hypervisees.

Everything’s gone virtual, and we *love* it. We get unlimited power over unlimited compute resources, and in IaaS/MPC a nice familiar GUI so all the Windows SysAdmins feel like they’re contributing.

I’ve never been much of a fan of running SQL Server on virtual machines from waaay back – it used to be touch and go whether your server would stay up, and it wasn’t really clear whose fault it was when it went down. Thankfully, things have moved on: virtualisation is a mature discipline, software is now built to run anywhere, and the location and physical existence of a server are not things we need to worry about right now.

So, assuming we want to do things the hard way and set up a managed private cloud on-prem so the Windows SysAdmins don’t have to worry about networking and firewalls, let’s create an estate of VMWare hypervisors and run *lots* of Windows guest VMs on them. Super. All tickety-boo.

We’re going to want backups with that so let’s leverage all that VMWare goodness, like snapshotting disks once they’ve been quiesced so we get a consistent backup. Hell, let’s live life on the edge and NOT quiesce the OS drives – you know, for fun !

We now have Windows servers (don’t care where or how) and we’re going to put SQLServer on some of them. Don’t worry, it’ll be fine. Trust me. We’ve moved on. Tell you what, we won’t backup the disks your MDFs, LDFs and NDFs are on. Happy now ? Good.

You might very well think that all is well in the world. Your servers are serving, your users are using, and your apps are…. apping ? Whatever. And it’s all being backed up and the DR people are happy.

You are fooling yourself.

Let’s take a closer look at what VMWare actually does when it takes a snapshot.

To get a consistent backup while the OS is running, VMWare uses the Windows Volume Shadow Copy service to create the snapshot. IO requests are then queued somewhere (I’m not *that* into the nuts and bolts of this) while the backup is taken, and then the snapshot is removed.

It is in this removal phase that the queue of stuff that’s waiting to be written to the VM is applied WHILE THE MACHINE IS STILL RUNNING. And what does VMWare do to make sure your hyperactive VM takes a breather to allow those writes ? It ‘stuns’ the machine – the digital equivalent of giving your server a quick slap in the face and telling it to calm the F down. It may need to stun the server several times to complete the outstanding writes.

Don’t worry, though. All of this stunning is for very short periods, and there are checks to make sure it won’t have an impact, right ?

Again, you are fooling yourself.

What VMWare does *before* it starts to stun a server is estimate how long it will take to apply the pending updates. If it thinks that will take less than 15 seconds, it will go ahead (note: It normally takes milliseconds). If it thinks the operation will take more than 15 seconds, it waits for 10 seconds and then has another go.

VMWare will do this estimation process 9 times. On the 10th, it just gives up and gives your server a really hard slap (however long it may take to apply the changes) so that it can complete the job and move on. Your server will be stunned for however long it takes. End of. You can read more about it here. The example quoted is 403 seconds, or a little over 6.5 minutes in old money.

As you might imagine, as a SQLServer DBA with high-transaction applications, I’m not 100% happy with my precious being slapped about or otherwise Tasered just because someone went and spent all that time and effort doing an on-prem MPC.

VMWare themselves have admitted that if you’re running busy SQL Servers on their software, you probably shouldn’t use their backup process and use an agent-based solution on the guest instead (Section 3.8.2.5 refers – and yes, the numbering sequence in the document is truly messed up).

Virtual Machines. VMs. Guests. Hypervisees.

Be careful what you wish for….

Hunting high and log

Sometimes, you just lose stuff. There. I said it.

Well, maybe not ‘lose’ as such, but database files can be put anywhere by anyone with permissions, and that’s not always a good thing. Keeping everything nice and Kentucky is a DBA trait, so how do you track down where things are ?

You can use WMI (oof), PowerShell (oofoof) or maybe even SQLServer DMVs (tell me more).

I’ve never really got to grips with PowerShell – its on my list, for sure – but there are some occasions where PowerShell might not work for you, for example in restricted environments where you don’t have access to the OS directly. Here’s a way of getting database and mount point space information if you only have ‘public’ access to master…

DMVs are one of my favourite things ever. There’s just so many of them these days and we’re getting more all the time. My *other* favourite thing is CROSS APPLY – also known as the ‘fairy join’ – because I don’t know what it does and yet I can use it to get information that might be hard to join up otherwise !

Let’s put these two to work together to get a list of databases files, where they are, and how much space there is free on the disk where they reside:

--All database mdf mount points and disk free space in MB
SELECT DISTINCT db_Name(f.database_id) AS  [Database], volume_mount_point, (total_bytes / 1024) /1024 AS [Total MB], (available_bytes / 1024) / 1024 AS [Available MB]
FROM sys.master_files AS f 
CROSS APPLY sys.dm_os_volume_stats(f.database_id, f.file_id)
WHERE f.file_id = 1;

Database	volume_mount_point	Total MB	Available MB
tempdb	H:\SQLTemp\	557733	545029
MyDB1	H:\SQLData\	12046999	346111
MyTestDatabase	H:\SqlData\	12046999	346111
OtherDB1	H:\SQLLogs\	334638	142694

You’ll see here that I’ve taken the easy route and used file_id = 1. That’s the MDF – the primary database file – and you *must* have one of those. Oh, and I can see that someone created a database with its MDF on a mount point intended for transaction logs. I’ll follow that one up later…

A small tweak, and you’ve got the same information for log files (file_id = 2, the LDF);

Database	volume_mount_point	Total MB	Available MB
tempdb	H:\SQLLogs\	334638	142694
MyDB1	H:\SQLLogs\	334638	142694
MyTestDatabase	H:\SQLLogs\	334638	142694
OtherDB1	H:\SQLLogs\	334638	142694

That’s better – everything’s lined up nicely and where it should be, except maybe tempdb’s transaction log appears to not to be on dedicated disk. Another item for the backlog…

You might be wondering why I’ve used the DISTINCT keyword in the script. That’s because if you have multiple NDFs (secondary database files) you only really want the one entry back. I haven’t included a script for NDFs (file_id > 2) because I’m not in control of where *you* put them. Besides, giving you everything takes some of the fun out of it, don’t you think ?