Dec 2010 – The Modern Dataist

Surely it\'s just text data ?

Surely it\’s just text data ?

Right then.

I\’m at a bit of a loss on how to start this one….So here goes….

The last few posts have been about estimating the size of indexes (indices?), primarily driven by the challenge of having a multi-terabyte table in development. People are getting a bit nervous about how what \’multi-terabyte\’ actually means in dollar value and wondering how to shave off a couple of terabytes here and there.

Without going into too much detail about how the table will be physicalised (but you can bet that it\’ll be partitioned to death), I was looking into the various ways of storing textual data in this monster, and which one would use the least amount of disk space.

While doing this, someone mentioned that SQL Server stores XML data in a double-byte format if you use the native XML datatype. I thought that this was a bit odd, since surely the database doesn\’t care what the text actually is, and the fact that the character set used in XML is defined within the XML itself (in the \’encoding\’ tag). So I set up a simple test of this, just to make sure:

CREATE TABLE XMLTest
        (stdText VARCHAR (1000),
        xmlText XML,
        unicodeText NVARCHAR (1000))

GO

That gives me 3 columns to insert the same XML into using 3 different text datatypes, and since I\’m a completely lazy DBA, I wanted to reuse the CREATE TABLE instead of creating 3 tables 🙂 After each test, I simply dropped the table and started again – there didn\’t seem a lot of point in persisting the data between tests.

Here\’s the XML I wanted to insert (some of you may recognise this from Sams \’Teach Yourself XML in 14 Days\’ – an excellent publication, I might add):

\’

                Remember to buy some milk on the way home from work

\’

Since I\’m still a completely lazy DBA, using GO 5000 is a quick way of inserting a few thousand lines. Again, I\’m not interested in the uniqueness of each record, just the storage required to hold it (so shoot me).

Insert #1 goes like this:

INSERT XMLTest (stdText)
VALUES
(\’

                Remember to buy some milk on the way home from work

\’)
GO 5000

A quick sp_spaceused \’XMLTest\’ tells me that these 5000 rows take about 1480KB of disk space. OK, I\’m happy with that (for now) since it\’s just in a normal VARCHAR column.

Moving swiftly on to the next test – inserting 5000 rows into the native XML datatype…

This time, sp_spaceused \’XMLTest\’ tells me that exactly the same amount of data (albeit in a different datatype) requires about 2760KB of storage.

This in itself is odd, don\’t you think ? If SQL Server stores XML in double-byte, shouldn\’t it take twice the storage ? 2x 1480 is 2960, not 2760….

For the sake of my sanity, I also did the same inserts into a Unicode datatype, NVARCHAR. Unicode is double-byte, after all, so I\’m expecting these 5000 rows to take near-as-dammit 2960KB. Are you ?

Imagine, then, how perplexed I was after the unicode insert. This time, the sp_spaceused \’XMLTest\’ shows that 3400KB is used. For THE SAME DATA !!! Damn you, SQL Server storage for messing with my otherwise neatly ordered world view !!!

So, the moral of this story could be seen as:

\”If you don\’t need to store XML data in the XML datatype (ie You\’re not planning on doing any clever Xquery stuff on it), then don\’t – It\’ll cost you storage in the long run.\”

Although it\’s not for me to second-guess your design decisions. I just thought I should mention it….

Back soon !

Estimating the size of a non-clustered index

After all the fun and games when doing this for a clustered index, I said I\’d post up the non-clustered version if it was interesting.

Turns out, it\’s not that different from the clustered index version, so if you\’ve already gone to the trouble of doing that, then you\’re most of the way there.

Again, there\’s a very good MSDN HowTo about it at:

http://msdn.microsoft.com/en-us/library/ms190620.aspx

…so I won\’t repeat either that or my own calculations from before. One thing to bear in mind, though, is the provision for estimating any indexes with included columns (one of the nicest indexing changes in SQL2005, in my opinion), but it\’s all detailed in the HowTo.

Of course, you may not always have a clustered index on your table, but you\’ve at least got a primary key, so that counts 🙂

Oh, and just an aside about column nullability : I don\’t mean to teach you to suck eggs or anything, but please give nullability a thought. When you\’re estimating ANY index size, it\’s quite important.

If a column can contain nulls, SQL Server has to keep track of it using a null bitmap. This null bitmap has a storage overhead in the index, so when dealing with very large tables with several indexes, you might want to keep nullable columns to a minimum. I\’m just sayin\’…..

Of course, there is a whole other piece on SQL 2008 Sparse Column support, but that brings with it it\’s own idiosynchrasies (?) when it comes to indexing and data compression, so I\’ll attempt to cover those off in a later post. Maybe.

Back soon !

Estimating the size of a Clustered Index

OK, let me first start by saying that I\’m not a expert in SQL Server Storage by any stretch of the imagination, and I\’m sure you\’re aware that estimating ANYTHING in SQL is a bit of a dark art. That\’s got that out of the way.

I recently had occasion to estimate the size of a multi-terabyte database, and thought I should factor in the Clustered Indexes (Indices?) to be a little more sure of the numbers. After wishing that I hadn\’t agreed to do this, I set to, attempting to make some sense out of this article:

http://msdn.microsoft.com/en-us/library/ms178085.aspx

I advise you to read this, and then lie down in a darkened room for a while. Then perhaps grab something to eat / watch a movie / take a bath / whatever you need to do to give you the strength to read it again.

Thankfully, I\’ve distilled this for you into a form that can easily be reproduced in your Office Spreadsheet program of choice.

Storage required @ Leaf level
Rows    1,000   Number of rows in the table
Num_Cols        1       Total number of columns (fixed and variable length)
Fixed_Data_Size 0       Total byte-size of all fixed length columns
Num_Variable_Cols       0       Number of variable length columns
Max_Variable_Size       0       Maximum byte-size of all variable length columns
Null_Bitmap     3       2 + ((Num_Cols + 7) / 8)
Variable_Data_Size      2       2 + Num_Variable_Cols * 2) + Max_Var_Size
RowSize 9       Fixed_Data_Size + Variable_Data_Size + Null_Bitmap + 4
Rows_Per_Page   736     8096 / (Row_Size + 2)
Fill_Factor     80      index fill factor
Free_Rows_Per_Page      147.2   8096 * ((100 – Fill_Factor) /100) / (Row_Size + 2)
Num_Leaf_Pages 2       Num_Rows / (Rows_Per_Page – Free_Rows_Per_Page)
Leaf_Space_Used 16384   8192 * Num_Leaf_Pages

Storage for Index information
Num_Key_Columns 1       Total number of key columns (fixed and varialbe length)
Fixed_Key_Size 0       Total byte-size of all fixed length key columns
Num_Variable_Key_Cols   0       Number of variable length key columns
Max_Var_Key_Size        8       Maximum byte-size of all variable length key columns
Index_Null_Bitmap       3       2 + ((Number of columns in the index row +7 ) / 8)
Variable_Key_Size       10      2 + Num_Variable_Key_Columns * 2 ) + Max_Var_Key_Size
Index_Row_Size 18      Fixed_Key_Size + Variable_Key_Size + Index_Null_Bitmap + 1 + 6
Index_Rows_Per_Page     405     8096 / (Index_Row_Size + 2)
LOG function    2.607455023     LogIndex_Rows_Per_Page
Non_leaf_Levels 2       1 + LOG Function * (Num_Leaf_Pages / Index_Rows_Per_Page)
Num_Index_Pages 2       Num_Leaf_Pages/(Index_Rows_Per_Page^2)+ Num_Leaf_Pages/(Index_Rows_Per_Page^1)
Index_Space_Used        16384   8192 * Num_Index_Pages

There are a few things to note here.

Firstly, this calculation estimated the storage required for data at the leaf level – essentially, all the data in the table. Adding the clustered index \’changes\’ the tables into a more ordered structure (please don\’t shout at me about this, I\’m trying to make it easy to understand, not 100% accurate – I said I wasn\’t an expert).

If you\’ve already got a better way of estimating the sizes of a heap, then feel free to stick with that (I did in my example), but you\’ll need some of the numbers in the top half to calculate the bottom half !

Most of the math is fairly self-explanatory, so I\’ll not go through that. The article referenced above gets complicated at around Section 7, so if you split out the LOG function into a separate calculation, it\’ll make your life a bit easier (as in the table above).

The bit I really wanted to mention is the calculation of Num_Index_Pages. And here\’s the rub: You have to raise the Index_Rows_Per_Page to the power of whatever the number of Non_Leaf_Levels is, then add that that to the next bit, which raises to the power of Non_Leaf_Levels -1, and you continue adding these together until the substituted number of Non_Leaf_Levels = 1. Not a problem when you have only a few Non_Leaf_Levels……

However, one table I was dealing with had > 8000 Non_Leaf_Levels. Thinking \’There goes my weekend!!\’ , I headed out for a bite to eat so I didn\’t have to think about it for a while.

And then a dirty hack came to me…

I\’m not proud of it, but it got me my weekend back….

And here it is……

OK, for each time you do the Num_Leaf_Pages / (Index_Rows_Per_Page ^ whatever), you then round up the answer to the nearest whole number. You do. Believe me. It\’s in the article. Bit beginning \”For each of the summands…\”. See ?

What this means is that you\’re actually only interested in any calculation here that results in a value > 1. Everything else will just be 1. All you have to do is work out (somewhere else) which power your raise to gives the first answer > 1, and then just add all the other calculations as a single number.

For example, for a table that has 4135 Non_Leaf_Levels, your calculation becomes (and I\’m using a common spreadsheeter):

4133+ROUNDUP((N19/(N28^2)),0)+ROUNDUP((N19/(N28^1)),0)

…because the previous power (3) resulted in a number less than 1.

Sneaky !

And I\’ve tested it out using an actual table with real data, and the estimation is almost spot-on. There\’s a whole other piece in Estimating the size of a non-clustered index. If that gets as exciting as this, then I\’ll post that up as well.

Back soon !!!