Estimating the size of a Clustered Index

OK, let me first start by saying that I\’m not a expert in SQL Server Storage by any stretch of the imagination, and I\’m sure you\’re aware that estimating ANYTHING in SQL is a bit of a dark art. That\’s got that out of the way.

I recently had occasion to estimate the size of a multi-terabyte database, and thought I should factor in the Clustered Indexes (Indices?) to be a little more sure of the numbers. After wishing that I hadn\’t agreed to do this, I set to, attempting to make some sense out of this article:

http://msdn.microsoft.com/en-us/library/ms178085.aspx

I advise you to read this, and then lie down in a darkened room for a while. Then perhaps grab something to eat / watch a movie / take a bath / whatever you need to do to give you the strength to read it again.

Thankfully, I\’ve distilled this for you into a form that can easily be reproduced in your Office Spreadsheet program of choice.

Storage required @ Leaf level
Rows    1,000   Number of rows in the table
Num_Cols        1       Total number of columns (fixed and variable length)
Fixed_Data_Size 0       Total byte-size of all fixed length columns
Num_Variable_Cols       0       Number of variable length columns
Max_Variable_Size       0       Maximum byte-size of all variable length columns
Null_Bitmap     3       2 + ((Num_Cols + 7) / 8)
Variable_Data_Size      2       2 + Num_Variable_Cols * 2) + Max_Var_Size
RowSize 9       Fixed_Data_Size + Variable_Data_Size + Null_Bitmap + 4
Rows_Per_Page   736     8096 / (Row_Size + 2)
Fill_Factor     80      index fill factor
Free_Rows_Per_Page      147.2   8096 * ((100 – Fill_Factor) /100) / (Row_Size + 2)
Num_Leaf_Pages 2       Num_Rows / (Rows_Per_Page – Free_Rows_Per_Page)
Leaf_Space_Used 16384   8192 * Num_Leaf_Pages

Storage for Index information
Num_Key_Columns 1       Total number of key columns (fixed and varialbe length)
Fixed_Key_Size 0       Total byte-size of all fixed length key columns
Num_Variable_Key_Cols   0       Number of variable length key columns
Max_Var_Key_Size        8       Maximum byte-size of all variable length key columns
Index_Null_Bitmap       3       2 + ((Number of columns in the index row +7 ) / 8)
Variable_Key_Size       10      2 + Num_Variable_Key_Columns * 2 ) + Max_Var_Key_Size
Index_Row_Size 18      Fixed_Key_Size + Variable_Key_Size + Index_Null_Bitmap + 1 + 6
Index_Rows_Per_Page     405     8096 / (Index_Row_Size + 2)
LOG function    2.607455023     LogIndex_Rows_Per_Page
Non_leaf_Levels 2       1 + LOG Function * (Num_Leaf_Pages / Index_Rows_Per_Page)
Num_Index_Pages 2       Num_Leaf_Pages/(Index_Rows_Per_Page^2)+ Num_Leaf_Pages/(Index_Rows_Per_Page^1)
Index_Space_Used        16384   8192 * Num_Index_Pages

There are a few things to note here.

Firstly, this calculation estimated the storage required for data at the leaf level – essentially, all the data in the table. Adding the clustered index \’changes\’ the tables into a more ordered structure (please don\’t shout at me about this, I\’m trying to make it easy to understand, not 100% accurate – I said I wasn\’t an expert).

If you\’ve already got a better way of estimating the sizes of a heap, then feel free to stick with that (I did in my example), but you\’ll need some of the numbers in the top half to calculate the bottom half !

Most of the math is fairly self-explanatory, so I\’ll not go through that. The article referenced above gets complicated at around Section 7, so if you split out the LOG function into a separate calculation, it\’ll make your life a bit easier (as in the table above).

The bit I really wanted to mention is the calculation of Num_Index_Pages. And here\’s the rub: You have to raise the Index_Rows_Per_Page to the power of whatever the number of Non_Leaf_Levels is, then add that that to the next bit, which raises to the power of Non_Leaf_Levels -1, and you continue adding these together until the substituted number of Non_Leaf_Levels = 1. Not a problem when you have only a few Non_Leaf_Levels……

However, one table I was dealing with had > 8000 Non_Leaf_Levels. Thinking \’There goes my weekend!!\’ , I headed out for a bite to eat so I didn\’t have to think about it for a while.

And then a dirty hack came to me…

I\’m not proud of it, but it got me my weekend back….

And here it is……

OK, for each time you do the Num_Leaf_Pages / (Index_Rows_Per_Page ^ whatever), you then round up the answer to the nearest whole number. You do. Believe me. It\’s in the article. Bit beginning \”For each of the summands…\”. See ?

What this means is that you\’re actually only interested in any calculation here that results in a value > 1. Everything else will just be 1. All you have to do is work out (somewhere else) which power your raise to gives the first answer > 1, and then just add all the other calculations as a single number.

For example, for a table that has 4135 Non_Leaf_Levels, your calculation becomes (and I\’m using a common spreadsheeter):

4133+ROUNDUP((N19/(N28^2)),0)+ROUNDUP((N19/(N28^1)),0)

…because the previous power (3) resulted in a number less than 1.

Sneaky !

And I\’ve tested it out using an actual table with real data, and the estimation is almost spot-on. There\’s a whole other piece in Estimating the size of a non-clustered index. If that gets as exciting as this, then I\’ll post that up as well.

Back soon !!!

Share this:

Related

Leave a comment Cancel reply