Estimating the size of a Clustered Index
OK, let me first start by saying that I\’m not a expert in SQL Server Storage by any stretch of the imagination, and I\’m sure you\’re aware that estimating ANYTHING in SQL is a bit of a dark art. That\’s got that out of the way.
I recently had occasion to estimate the size of a multi-terabyte database, and thought I should factor in the Clustered Indexes (Indices?) to be a little more sure of the numbers. After wishing that I hadn\’t agreed to do this, I set to, attempting to make some sense out of this article:
http://msdn.microsoft.com/en-us/library/ms178085.aspx
I advise you to read this, and then lie down in a darkened room for a while. Then perhaps grab something to eat / watch a movie / take a bath / whatever you need to do to give you the strength to read it again.
Thankfully, I\’ve distilled this for you into a form that can easily be reproduced in your Office Spreadsheet program of choice.
Storage required @ Leaf level
Rows 1,000 Number of rows in the table
Num_Cols 1 Total number of columns (fixed and variable length)
Fixed_Data_Size 0 Total byte-size of all fixed length columns
Num_Variable_Cols 0 Number of variable length columns
Max_Variable_Size 0 Maximum byte-size of all variable length columns
Null_Bitmap 3 2 + ((Num_Cols + 7) / 8)
Variable_Data_Size 2 2 + Num_Variable_Cols * 2) + Max_Var_Size
RowSize 9 Fixed_Data_Size + Variable_Data_Size + Null_Bitmap + 4
Rows_Per_Page 736 8096 / (Row_Size + 2)
Fill_Factor 80 index fill factor
Free_Rows_Per_Page 147.2 8096 * ((100 – Fill_Factor) /100) / (Row_Size + 2)
Num_Leaf_Pages 2 Num_Rows / (Rows_Per_Page – Free_Rows_Per_Page)
Leaf_Space_Used 16384 8192 * Num_Leaf_Pages
Storage for Index information
Num_Key_Columns 1 Total number of key columns (fixed and varialbe length)
Fixed_Key_Size 0 Total byte-size of all fixed length key columns
Num_Variable_Key_Cols 0 Number of variable length key columns
Max_Var_Key_Size 8 Maximum byte-size of all variable length key columns
Index_Null_Bitmap 3 2 + ((Number of columns in the index row +7 ) / 8)
Variable_Key_Size 10 2 + Num_Variable_Key_Columns * 2 ) + Max_Var_Key_Size
Index_Row_Size 18 Fixed_Key_Size + Variable_Key_Size + Index_Null_Bitmap + 1 + 6
Index_Rows_Per_Page 405 8096 / (Index_Row_Size + 2)
LOG function 2.607455023 LogIndex_Rows_Per_Page
Non_leaf_Levels 2 1 + LOG Function * (Num_Leaf_Pages / Index_Rows_Per_Page)
Num_Index_Pages 2 Num_Leaf_Pages/(Index_Rows_Per_Page^2)+ Num_Leaf_Pages/(Index_Rows_Per_Page^1)
Index_Space_Used 16384 8192 * Num_Index_Pages
There are a few things to note here.
Firstly, this calculation estimated the storage required for data at the leaf level – essentially, all the data in the table. Adding the clustered index \’changes\’ the tables into a more ordered structure (please don\’t shout at me about this, I\’m trying to make it easy to understand, not 100% accurate – I said I wasn\’t an expert).
If you\’ve already got a better way of estimating the sizes of a heap, then feel free to stick with that (I did in my example), but you\’ll need some of the numbers in the top half to calculate the bottom half !
Most of the math is fairly self-explanatory, so I\’ll not go through that. The article referenced above gets complicated at around Section 7, so if you split out the LOG function into a separate calculation, it\’ll make your life a bit easier (as in the table above).
The bit I really wanted to mention is the calculation of Num_Index_Pages. And here\’s the rub: You have to raise the Index_Rows_Per_Page to the power of whatever the number of Non_Leaf_Levels is, then add that that to the next bit, which raises to the power of Non_Leaf_Levels -1, and you continue adding these together until the substituted number of Non_Leaf_Levels = 1. Not a problem when you have only a few Non_Leaf_Levels……
However, one table I was dealing with had > 8000 Non_Leaf_Levels. Thinking \’There goes my weekend!!\’ , I headed out for a bite to eat so I didn\’t have to think about it for a while.
And then a dirty hack came to me…
I\’m not proud of it, but it got me my weekend back….
And here it is……
OK, for each time you do the Num_Leaf_Pages / (Index_Rows_Per_Page ^ whatever), you then round up the answer to the nearest whole number. You do. Believe me. It\’s in the article. Bit beginning \”For each of the summands…\”. See ?
What this means is that you\’re actually only interested in any calculation here that results in a value > 1. Everything else will just be 1. All you have to do is work out (somewhere else) which power your raise to gives the first answer > 1, and then just add all the other calculations as a single number.
For example, for a table that has 4135 Non_Leaf_Levels, your calculation becomes (and I\’m using a common spreadsheeter):
4133+ROUNDUP((N19/(N28^2)),0)+ROUNDUP((N19/(N28^1)),0)
…because the previous power (3) resulted in a number less than 1.
Sneaky !
And I\’ve tested it out using an actual table with real data, and the estimation is almost spot-on. There\’s a whole other piece in Estimating the size of a non-clustered index. If that gets as exciting as this, then I\’ll post that up as well.
Back soon !!!