Synapse – The Modern Dataist

CETAS ? Simples.

Like some people, I really love/hate the fact that things in the Azure Portal change.

It’s great to see new functionality being added to make things easier (love), but it means that you have to keep going back and updating previous content (hate), and all the stuff you learned you now have to re-learn (hate). On the plus side though, going back to Azure after a while and finding something new and fun – that’s a LOVELOVELOVE from me !

Having taken a break for reasons I won’t go into rn, recently someone asked me about Synapse Analytics using CSVs or Parquet files in Data Lake Storage Gen2, and crafting queries and notebooks in a Synapse Workspace.

Naturally, you start with the OPENROWSET() feature and point your queries directly at the files, but this gets quite cumbersome with complex queries from multiple sources. It would be far easier to just use them like database tables – you know, old skool stylee.

Again, quite naturally from that you start to look into Create External Table As Select (CETAS) and all the syntactic joy and heartache that CETAS can bring. Seriously, why isn’t there an easy way to do this ??

Well….

…there is !😊

What follows is me getting rabbit-holed by stuff (easy, amirite?), and assumes that you’ve already linked the Bing COVID-19 sample dataset to your workspace:

* If you’ve also linked your ADLSGen2, you can do this from there as well, but let’s stick with something that’s easily reproduceable.

A quick click on the ellipsis shows you ‘New SQL Script’, and following the arrow shows you… Create External Table (tadaaaa):

Happy-happy-joy-joy ! A thing to decomplexitate the complex thing !!

Right, let’s go… we’re gonna CETAS LIKE A BOSS !

A blade opens up on the right hand side…

OK, so it wants to use the built-in serverless pool. That’s awesome – something else I don’t have to think about. Oh, but it wants a database to create the table in, and I don’t actually have any SQL Database at the moment, so let’s create a new one from the dropdown…

Once that’s done, we can then select it in the dropdown. We need to give the resulting table a schema and a name (I lazily typed [dbo].[SampleData] because I’m not that creative), and there’s a couple of interesting radio buttons after that – one to J*DI and give me a script to just show the first hundred rows from my new external table, and one to show me a script that does all the work, and *then* show the first hundred rows. Being an impetuous type, I J want to *ing D I, so I selected the automatic option.

*if you’re into scripting or want to re-use the code elsewhere, feel free to use the other option. It’s not massively interesting at this point is all I’m saying.

Boom ! We have a script for the first hundred rows, and when we run it, we get… 100 rows !

Awesome ! I can see that it’s using my new database ‘MyDB’ as well. Imma just gonna head over to the Workspace page and look at my new database 🤩..

Where’s my new database ? I’ve got a Lake Database, sure, but I should see a SQL Database here by now ? Time for a coffee.

………………

I’m back ! Sorry, had to have a chocolate Hob Nob as well. Hmm, I don’t want to lose my open script, but there’s nowhere to refresh *just* the workspace pane..

Refreshing the whole page will add the node for SQL Databases (but you’ll lose any unsaved/unpublished edits), and now it appears:

You could also just add a new empty SQLDatabase, and that would get the SQL Database node created in the left-hand pane, but that’s kinda my point – Azure is sometime a love/hate call, but whatever it is right now, it’ll probably have changed next time you look at it. It would be nice if the creation of the SQL Database from the Create External Table process added the node to the Workspace pane as well (if you already have some SQL Database, the new one will just appear in the list if you refresh it).

On balance, it’s still AMAZING that to CETAS, you don’t need to know all the ins and outs of CETAS, and you can carry on with your Old Skool database stuff without getting derailed. So it’s still a *love* from me.

Back soon(er)….

Linked in

‘Azure Synapse Link. You keep saying those words. I do not think it means what you think it means….’

OK, so, let me start by saying I *love* the idea of being able to query my Azure SQL Database from Synapse Analytics – who wouldn’t – but there are a couple of things to bear in mind when setting that up. If you don’t mind, I’ll step through the process so you can see what I mean…

Firstly, you set this up at the Azure SQL Database end first. I’m assuming here that you already have a database you want to link to Synapse Analytics, and that you already have an Azure Synapse workspace to query from ? Got that ? Great, let’s go…

Over in the Azure Portal, go find the database you want to link. Here I’ll be linking ‘MyDatabase’, but you’ll be linking ‘YourDatabase’ 🙂

Scroll down the.. what is that called ? A picker ? A menu ?? Whatever, scroll down in until you find ‘Azure Synapse Link’:

Helpfully, you’re taken to a ‘Getting Started’ tab on a new page. Change to the ‘Enable Azure Synapse Link tab, and you’ll see this:

This is why you need a workspace to connect your database to 🙂 Fill in your details for the Subscription, Resource Group, and Workspace, stopping for a moment to read the notice at the top…

I’ll come back to that in a bit.

Setting up a Synapse Link has some pre-reqs that need to be in place on your SQL Database – Public Network Access should be enabled, Azure services need permission to access the database, and a system assigned managed identity needs to be enabled. I’m not going to go into that rn, but that’s what needs to be in place, so that’s what I’ll do.

Hitting the ‘Configure for prerequisites’ tells you what’s going to happen. In my case, all the pre-reqs are already met. If you don’t already meet them, there’ll be a red ‘x’ instead of a green tick, but when you hit ‘Continue’, the changes will be made. Notice also that these settings affect the instance that your SQL Database is on, not just the database – handy if you have a number of databases on the same server that you want to link to Synapse (hint: maybe do some pre-planning around this and make sure everything is in the right place)

The changes are then made and you get a confirmatory tick before hitting ‘Enable Link’:

That doesn’t actually enable the link, in case you were wondering if that’s it. It connects you to your Azure Synapse workspace, starts Synapse Studio and presents you with some *more* choices to make in a ‘New Link Connection’. Side note: it doesn’t insert the name of the database you just enabled because reasons which will become apparent. Any existing databases already linked will appear in the drop-down. MyDatabase isn’t there, so I’m going to have to create a new one. No, not a new database, a new Source Linked Service.

Name it something memorable, and note ‘Auto-resolve Integration Runtime’ – you can change this if you want and create a new (or existing) one if you prefer. I’m kinda basic, so the default is good for me.

Make sure the FQDN is the name of the instance that your database is on and the database yuo want to create the link to is correct (this should be pre-populated), sprinkle over some credentials to use to access the database and hit ‘Create’ when you’re happy.

I have a table called ‘Customers’ in my database. It’s lonely because it’s the only table 😦 It also contains some datatypes that are not supported in Azure Synapse SQL Databases, but the link process is smart enough to tell me that it can ignore them for replication. It also points out that I don’t have a PRIMARY KEY on my table (for shame!), so won’t let me continue…

brb…..

OK, that’s fixed and I now have a PK on the table. All’s good. I’ll select the one solitary table in my database to link to Synapse and hit ‘Continue:

Right. So. Remember earlier on where it said that there would be cost implications in setting up Azure Synapse Link ? This is why:

This needs a little explanation, I think.

We all know that Synapse has two types of SQL Polls – Serverless and Dedicated.

Serverless is largely free (OK, there’s some cost for storage of your data, but that’s about it, I think) whereas Dedicated means you have to stand up some compute explicitly, and *that* can come at a cost depending on the size of pool you need (Not going into that rn either).

Since Synapse Link is a way of using Synapse Studio on your Azure SQL Databases that stays up to date, you *need* something that’s always there to receive records that you put in it. Did you notice the words ‘not be replicated’ in the table selection ? It’s not a linked server as you might have used previously on-prem, it’s a replica (much like secondary replicas in an Availability Group scenario), and as such requires a Dedicated SQL Pool in the Synapse world.

Side note: Take a look at Managed Instance Link for Availabilty Groups – it’s very much like that, if not identical.

To save costs, you can pause a dedicated pool (but then you won’t receive updates into your replica, so…), but you still have to have one for Azure Synapse Link to work.

Anyway, back to the pretty pictures….

If you already have a dedicated SQL pool spun up, use that and hit ‘Continue’. If not, create one and then hit it 🙂

Everything needs a name, and you need to make some decisions about how many cores to use (really depends on velocity of records going into your Azure SQL Database), and whether you want to run this all the time or on a schedule. I’m just going to use ‘Continuous’ mode for now:

You then get some *more* decisions to make around Distribution and Structure Types. Again, I’m just using the defaults. Don’t feel like you have to, though…

Since it’s Azure Synapse now, you need to publish the link connection before you can start it:

Once it’s published, you can start it. This might take a minute or two because your Dedicated SQL Pool is being spun up and connected to your source. Note also that you can pause the link here as well:

Indeed it does. Especially when I forgot that I paused the pool a while back to save costs ! D’oh ! But eventually, when the stars align, it starts:

You can add/remove data/tables/other stuff on your AZure SQL Database and that’ll be replicated to the .. err… replica in Azure Synapse, and you can do what you do in Synapse Studio 🙂

One last thing to point out, though. In Synapse Studio, you connect to the Dedicated Pool to query the data that’s now in Synapse. Don’t expect to see the details of the source database in there !

I guess in summary, Azure Synapse Link for Azure SQL Database isn’t the best name since it actually creates a replica of your source database and costs you money instead of just having a ‘special’ connection type. But I have to admit, it’s great tech whichever way you slice it.

Back soon…

Spring cleaning

I don’t know about you, but sometimes I just create stuff in Azure and then don’t touch it again for ages, and then when I go back I can’t remember why I created stuff in the first place…

In these financially-constrained times (and also because when I looked at my resources in the Azure Portal and went ‘YIKES’), I thought that some housekeeping was long overdue. Turns out deleting stuff is both *extremely* satisfying and *slightly* annoying, but the pay-off is a tidy portal, and that has to be good, right ?

Anyway, so I deleted some stuff that I hadn’t used in ages and was costing me money. I was fairly ruthless about it as well: CosmosDB for Cassandra ? Gone. Same for MongoDB. Gone. Same for some Azure Table storage… I figured that if I need it again, I can just recreate it anyways.

I also thought that my Azure Synapse Workspace was also a little.. well, messy to be frank, so while I was in a cleansing mode, let’s bin that as well and start again fresh. I can just recreate it…

So I did, following my own advice here. (Side note: I’m soooo happy I wrote this down somewhere. The steps are the same even if the UI has changed a little)

After having considered my location and region, I went for a local Azure place – feels nice to be close to your data, right ? Making sure that all pre-reqs were in place, sufficiently unique and rememberable, I got to the part where it said ‘Create’.

Happy days. The portal went off and did the things it does so well in sumitting the required tasks and actioning them..

But… disaster ! Oh noes !! I haz a sad:

The storage accounts were created OK, but there was something wrong with my Synapse Workspace creation !! How can this be ?!? I filled in all the things like a good Data Engineer ? I’m sure I did !

Diving into this a little deeper by choosing the Operation Details link, the problem is clear:

Well, clearer. Let’s take a look in the new Resource Group’s activity log for some more info. Ah yes, here it is:

Switching to the JSON view of the failure, we get a *lot* of information, but also details of the actual error:

OK, so that’s suboptimal. Never mind, I can just delete the deployment and start again.

Also, suboptimal. OK, so I can move the new resources into another datacen… ?

Also, nope. I don’t know if it’s possible via some CLI majick, but on the face of it it looks like it’s a no-no.

Now, I *could* try and create the missing resources manually (probably), but rn I’m a little tired and disheartened, so I’m just going to delete the whole shebang and try again in a different region 😦

This time, success !

So, there’s a couple of things to unpick here.

First, although the error seems to say that my subscription isn’t allowed to create new workspaces any more, in reality is looks like a resource constraint in that region (so an Azure thing). That’s good news.

Secondly, I didn’t actually request any ‘Windows Azure SQL Database servers’ when setting up the workspace, and this one is actually quite interesting as it exposes a part of Serverless SQL Pools that people sometimes forget.

Serverless SQL Pools are *not* serverless, it’s just that you can’t control (or even see) the servers that your queries are running on. For any query to work, there *has* to be some compute allocated, even if you’re just using OPENROWSET() to query an external file. Conceptually, the Workspace SQL Query window is just an abstraction of a normal query environment in that it’s exposed *only* via the web UI – it still has to connect to a database engine *somewhere*.

I guess UK West is just a bit busy at the moment. Any why wouldn’t it be ? Who doesn’t love the South West of the UK ? We have some great beaches for surfing, some excellent restaurants, and breathtaking scenery 🙂

Back soon…