Surely it\'s just text data ?

Surely it\’s just text data ?

Right then.


I\’m at a bit of a loss on how to start this one….So here goes….


The last few posts have been about estimating the size of indexes (indices?), primarily driven by the challenge of having a multi-terabyte table in development. People are getting a bit nervous about how what \’multi-terabyte\’ actually means in dollar value and wondering how to shave off a couple of terabytes here and there.


Without going into too much detail about how the table will be physicalised (but you can bet that it\’ll be partitioned to death), I was looking into the various ways of storing textual data  in this monster, and which one would use the least amount of disk space.


While doing this, someone mentioned that SQL Server stores XML data in a double-byte format if you use the native XML datatype. I thought that this was a bit odd, since surely the database doesn\’t care what the text actually is, and the fact that the character set used in XML is defined within the XML itself (in the \’encoding\’ tag). So I set up a simple test of this, just to make sure:


CREATE TABLE XMLTest
        (stdText VARCHAR (1000),
        xmlText XML,
        unicodeText NVARCHAR (1000))
       
GO


That gives me 3 columns to insert the same XML into using 3 different text datatypes, and since I\’m a completely lazy DBA, I wanted to reuse the CREATE TABLE instead of creating 3 tables 🙂 After each test, I simply dropped the table and started again – there didn\’t seem a lot of point in persisting the data between tests.


Here\’s the XML I wanted to insert (some of you may recognise this from Sams \’Teach Yourself XML in 14 Days\’ – an excellent publication, I might add):


\’


       
                Remember to buy some milk on the way home from work
       
\’


Since I\’m still a completely lazy DBA, using GO 5000 is a quick way of inserting a few thousand lines. Again, I\’m not interested in the uniqueness of each record, just the storage required to hold it (so shoot me).


Insert #1 goes like this:


INSERT XMLTest (stdText)
        VALUES
                (\’


       
                Remember to buy some milk on the way home from work
       
\’)
GO 5000


A quick sp_spaceused \’XMLTest\’ tells me that these 5000 rows take about 1480KB of disk space. OK, I\’m happy with that (for now) since it\’s just in a normal VARCHAR column.


Moving swiftly on to the next test – inserting 5000 rows into the native XML datatype…


This time,  sp_spaceused \’XMLTest\’ tells me that exactly the same amount of data (albeit in a different datatype) requires about 2760KB of storage.


This in itself is odd, don\’t you think ? If SQL Server stores XML in double-byte, shouldn\’t it take twice the storage ? 2x 1480 is 2960, not 2760….


For the sake of my sanity, I also did the same inserts into a Unicode datatype, NVARCHAR. Unicode is double-byte, after all, so I\’m expecting these 5000 rows to take near-as-dammit 2960KB. Are you ?


Imagine, then, how perplexed I was after the unicode insert. This time, the sp_spaceused \’XMLTest\’ shows that 3400KB is used. For THE SAME DATA !!! Damn you, SQL Server storage for messing with my otherwise neatly ordered world view !!!


So, the moral of this story could be seen as:


\”If you don\’t need to store XML data in the XML datatype (ie You\’re not planning on doing any clever Xquery stuff on it), then don\’t – It\’ll cost you storage in the long run.\”

 

Although it\’s not for me to second-guess your design decisions. I just thought I should mention it….

 
Back soon !


Leave a comment