Estimating the size of a Clustered Index

Estimating the size of a Clustered Index

OK, let me first start by saying that I\’m not a expert in SQL Server Storage by any stretch of the imagination, and I\’m sure you\’re aware that estimating ANYTHING in SQL is a bit of a dark art. That\’s got that out of the way.


I recently had occasion to estimate the size of a multi-terabyte database, and thought I should factor in the Clustered Indexes (Indices?) to be a little more sure of the numbers. After wishing that I hadn\’t agreed to do this, I set to, attempting to make some sense out of this article:

http://msdn.microsoft.com/en-us/library/ms178085.aspx


I advise you to read this, and then lie down in a darkened room for a while. Then perhaps grab something to eat / watch a movie / take a bath / whatever you need to do to give you the strength to read it again.


Thankfully, I\’ve distilled this for you into a form that can easily be reproduced in your Office Spreadsheet program of choice.


Storage required @ Leaf level                  
Rows    1,000   Number of rows in the table    
Num_Cols        1       Total number of columns (fixed and variable length)    
Fixed_Data_Size 0       Total byte-size of all fixed length columns    
Num_Variable_Cols       0       Number of variable length columns      
Max_Variable_Size       0       Maximum byte-size of all variable length columns       
Null_Bitmap     3       2 + ((Num_Cols + 7) / 8)       
Variable_Data_Size      2       2 + Num_Variable_Cols * 2) + Max_Var_Size      
RowSize 9       Fixed_Data_Size + Variable_Data_Size + Null_Bitmap + 4 
Rows_Per_Page   736     8096 / (Row_Size + 2)  
Fill_Factor     80      index fill factor      
Free_Rows_Per_Page      147.2   8096 * ((100 – Fill_Factor) /100) / (Row_Size + 2)     
Num_Leaf_Pages  2       Num_Rows / (Rows_Per_Page – Free_Rows_Per_Page)
Leaf_Space_Used 16384   8192 * Num_Leaf_Pages  
                       
Storage for Index information                  
Num_Key_Columns 1       Total number of key columns (fixed and varialbe length)
Fixed_Key_Size  0       Total byte-size of all fixed length key columns
Num_Variable_Key_Cols   0       Number of variable length key columns  
Max_Var_Key_Size        8       Maximum byte-size of all variable length key columns   
Index_Null_Bitmap       3       2 + ((Number of columns in the index row +7 ) / 8)     
Variable_Key_Size       10      2 + Num_Variable_Key_Columns * 2 ) + Max_Var_Key_Size  
Index_Row_Size  18      Fixed_Key_Size + Variable_Key_Size + Index_Null_Bitmap + 1 + 6 
Index_Rows_Per_Page     405     8096 / (Index_Row_Size + 2)    
LOG function    2.607455023     LogIndex_Rows_Per_Page 
Non_leaf_Levels 2       1 + LOG Function * (Num_Leaf_Pages / Index_Rows_Per_Page)      
Num_Index_Pages 2       Num_Leaf_Pages/(Index_Rows_Per_Page^2)+ Num_Leaf_Pages/(Index_Rows_Per_Page^1) 
Index_Space_Used        16384   8192 * Num_Index_Pages 

There are a few things to note here.

Firstly, this calculation estimated the storage required for data at the leaf level – essentially, all the data in the table. Adding the clustered index \’changes\’ the tables into a more ordered structure (please don\’t shout at me about this, I\’m trying to make it easy to understand, not 100% accurate – I said I wasn\’t an expert).


If you\’ve already got a better way of estimating the sizes of a heap, then feel free to stick with that (I did in my example), but you\’ll need some of the numbers in the top half to calculate the bottom half !


Most of the math is fairly self-explanatory, so I\’ll not go through that. The article referenced above gets complicated at around Section 7, so if you split out the LOG function into a separate calculation, it\’ll make your life a bit easier (as in the table above).


The bit I really wanted to mention is the calculation of Num_Index_Pages. And here\’s the rub: You have to raise the Index_Rows_Per_Page to the power of whatever the number of Non_Leaf_Levels is, then add that that to the next bit, which raises to the power of Non_Leaf_Levels -1, and you continue adding these together until the substituted number of Non_Leaf_Levels = 1. Not a problem when you have only a few Non_Leaf_Levels……


However, one table I was dealing with had > 8000 Non_Leaf_Levels. Thinking \’There goes my weekend!!\’ , I headed out for a bite to eat so I didn\’t have to think about it for a while.


And then a dirty hack came to me…


I\’m not proud of it, but it got me my weekend back….


And here it is……


OK, for each time you do the Num_Leaf_Pages / (Index_Rows_Per_Page ^ whatever), you then round up the answer to the nearest whole number. You do. Believe me. It\’s in the article. Bit beginning \”For each of the summands…\”. See ?


What this means is that you\’re actually only interested in any calculation here that results in a value > 1. Everything else will just be 1. All you have to do is work out (somewhere else) which power your raise to gives the first answer > 1, and then just add all the other calculations as a single number.


For example, for a table that has 4135 Non_Leaf_Levels, your calculation becomes (and I\’m using a common spreadsheeter):


4133+ROUNDUP((N19/(N28^2)),0)+ROUNDUP((N19/(N28^1)),0)


…because the previous power (3) resulted in a number less than 1.


Sneaky !


And I\’ve tested it out using an actual table with real data, and the estimation is almost spot-on. There\’s a whole other piece in Estimating the size of a non-clustered index. If that gets as exciting as this, then I\’ll post that up as well.


Back soon !!!


Leave a comment