Oct 2021 – The Modern Dataist

Washing up

I’ve always hated washing dishes.

This comes from early childhood when, on Sundays, my sibling and I would be banished to the kitchen after the Family Dinner to perform the washing up, one parent having toiled for what seemed like the whole day to prepare the meal and the other having done nothing more arduous than read the newspaper and be older.

There are 3 specific things I hate about washing up:

When there are things hidden in other things, like cutlery in a pan that has a lid on.
When someone drops something into the bowl while you’re washing another.
When someone adds items to the queue of things to be washed after foraging elsewhere in the house.

I guess I hate these things because they change the planned task. You know how many pots there are to wash, you plan what order you want to wash things in, and you set about executing that plan. Then you find the hidden stuff that means there’s more to wash than originally planned. And then ~~some inconsiderate git~~ a sibling / parent nonchalently adds to your workload without so much as a ‘by-your-leave’ that means your plan is now broken, but you still have to do the pots.

Fast forward some <mumble> years, and I’m a DBA working with transactions in SQL Server. And it seems me and SQL Server have something in common.

Imagine a workload submitted to SQL Server. An execution plan will be generated based on statistics and data *at the time the request was submitted*, and SQL Server will attempt to execute that plan.

Now, imagine that one of the elements in the workload is a table variable. SQL Server will always assume that the row count of that table is one (actually, not true, but that’s only recently and spoils my illustration). Table variables are equal to a closed lunchbox in the washing up – you only know what’s in there when you have to open it to clean it. If someone else has crammed stuff in there, that will have a knock-on effect on your ability to complete the work as planned. Same for SQL Server – processing more rows than planned can be expensive.

Imagine that workload has a lookup that uses a large table that has wildly out of date statistics. When SQL Server comes to that part, the assumptions made when the execution plan was generated will be wrong and there’s a whole lot more lifting-and-shifting to be done that was orginally planned.

Also imagine that the workload batch-processes data from a table that is constantly being added to. At the time the plan is decided on, there may be only 50,000 rows in the table, but while the batched are being fetched, a further 200,000 rows are added (yes, I know it’s a contrived and unlikely example) and those new rows will also need to be processed before the workload can complete.

With SQL Server, there are a number of things you can do to avoid this, though:

Use table variables sparingly, and only when you are sure of the size and content. In my experience anything more that 1000 rows in a table variable could lead to complications, so I’d use a temporary table instead. *other experiences are available
Keep your table statistics up to date. Whether you do this via UPDATE STATISTICS or whatever method, I don’t mind, but keep them up to date – your batch processes will thank you later.
Be mindful of transaction management. Make sure you know how much data you’re picking up and how to ignore late-arriving data. Explicit transactions or transaction isolation levels such as RCSI might help you out, but as ever ‘It Depends’.

With washing up, there is only 1 thing you can do – get a dishwasher.

Back soon…

World Azure Synapse Day ’21

So, while there’s been all this stuff going on, a couple of data friends and I started a conference series and a user group both focused on Azure Synapse Analytics. Because, you know.. you can never have too many online conferences, training sessions, and presentations, right ?

We’ve probably all attended online lerns recently, been to meetings online, chatted with friends and family over new and sometimes unfamiliar apps like Teams and Zoom (yes, the nerds have been using them for a while, but collab apps just went *wild* recently).

But we (as in me and the aforementioned data friends) wanted to do something different. Something more…. interactive. More…. conversational. Less “I’ll sign up and download the recordings later, maybe“. And so, somewhat concietedly, we came up with ‘World Azure Synapse Day 2021’, where we can share, chat, learn, and laugh (maybe) to round out the year. A sort of retrospective on the future, if you like.

Full disclosure: we didn’t come up with the format ourselves. A few months back, a company called DataStax did something similar for the worldwide Apache Cassandra community where they had short talks from the engineering team, the sales people, their marketing. We’re aiming for something waaaayyyy less corporate, and more about you 🙂 We just borrowed the good bits and left out the sales pitch.

So we’re doing three sessions on the day, timed to hopefully coincide with people’s availability in each of APAC, EMEA, and AMER time-zones. We hope it will feel a bit more like a meeting than a presentation, as there’ll be several short talks and interviews in each. The sessions will *not* be recorded (sorry, you’re going to have to be there to see it !) – it’s really important to us that each session flows and everyone can participate, and recording that doesn’t seem fair. We’re also not even publishing an agenda, just a schedule of who will be talking at each session – we want the speakers to speak for themselves in every way possible – so there’s no pressure on anyone to ‘get it right first time’, and each and every contribution is welcome.

If you want to attend (and we hope you do), please sign up to the MeetUp for Data Toboggan here. There’s also a code of conduct under the heading ‘Be Excellent To Each Other’ that we’d ask you to adhere to, but that’s all you have to do. If you want to have your voice heard, just submit some thoughts through Sessionize, and we’ll be in contact. And don’t hesitate to ask any question you like in the MeetUp group – we’ll make sure you get a response 🙂

Hope to see you at Data Toboggan – World Azure Synapse Day 2021 ! Until then, stay safe.

The Data Toboggan Team

Month / End ?

We’ve all been there. Month End. What a nightmare ! So much data processing ! So much nervousness – ‘Will the servers hold up ? Is there enough storage ?” – that senior managers have that kinda hang-dog look on their faces and have things crossed that they didn’t even know could be crossed….

You hope, nay you pray that everything works out OK, that no-one gets called out in the middle of the night because ‘OMG it fell over !’, and that the C-Suite gets all their lovely reports on time.

And when it’s all done, everyone breathes a collective sight of relief, calms down, and forgets about all the anxienty – until next month end.

……

Batch processing has been The Way since ever there was a need for processing large amounts of data. It used to be the reason why cheques took a while to clear, and your bank balance took a day or two to update (there’s still no BACS runs on certain days, which is why your bank balance doesn’t change at weekends). Processing data in this way is great for large amounts of data. It’s a well-trodden path to gather up all your data at the end of the month and then work out what happened and how that effected your bottom line.

But maybe, just maybe, there’s a second Way.

Reporting, especially financial reporting, is essentially counting things up and (if you’re really clever) doing some other math on those numbers. Plain and simple – it’s just sums of numbers. Things that have been counted and added up.

Once something has happened, it can’t unhappen. You can’t untake a breath. You can’t uneat breakfast. Sure, there sometimes can be another event that reverses one action (eww), but essentially it remains done. The number of heartbeats you have had increases all the time (until something tragically prevents that happening). Yes, the number of beats per minute fluctuates and the number of beats per day is similarly fluid, but the counter is incremental and the variations are just counts with context. In financial transactions, money leaves an account and arrives in someone else’s. Sometimes (not frequently enough) it’s the other way round. In stocktaking, inventory items are taken and used, and replenished from somewhere else. In business, you gain some customers and you lose some customers.

All of these things cannot un-happen. They are all single, unrelated events. One does not cause the other, rather one triggers an action that causes the other.

So why wait until month end to count them and do the complex math over potentially massive data sets and make everyone feel that month-end anxiety ? Maybe, counting and math-ing each day would be less of a strain (in both compute and mental terms), and then your month-end would be just summing all the counts and making pictures, no ?

We’re not talking Stream Analytics here, just some sensible use of overnight compute resource. Cloud computing considerations come into play here – it may be more cost-effective to have one massive processing event every month than do 28+ shorter processing events (given capacity-on-demand, it’s a thought) – but on-premises you’re already paying for the servers to be up (with their fixed compute resources and licence fees) – why not look in to using them differently ? Smaller, more frequent batches could save you time, money, and anguish.

Maybe then people would relax about month-end because it’s just maths and not data movement and cleansing that could go wrong. That ‘rinse-and-receive’ process happens every day and is easier to fix if it goes wrong (if you design it to be robust and recoverable).

Who knows, maybe the C-Suite reports would be available sooner ? And that’s a win, right ?

Back soon…