Losing your \'master\' database. Only parts of the world have ended…

Losing your \’master\’ database. Only parts of the world have ended…

One of the questions that DBAs automatically know the answer to automatically is:


\”If \’master\’ is unavailable, then is it game over ?\”


Until today, I\’d have to answer with a resounding \’YES\’, because that\’s all that I knew.


There was a situation that surprised me today, in that SQLServer CAN stay up without a \’master\’ database if certain conditions are met. I know, it sounds weird, but as long as nothing touches the database (for example, to authenticate an incoming connection) then it will quite happily stay up for a while without major issue.


This isn\’t heresy – it happened. I kid you not. And here\’s how….


Going back through the logs, I could see that there were no events in the error log after about 11:00 (that is, until another DBA restarted the instance to get service back up and running). There were no \’login failed\’s, no \’Log was backed up\’, nothing. And yet there were no service incidents raised until about 12:15. So, let\’s dive into the Application Event log to see what was going on between 11:00 and 12:15.


It turns out that a chkdsk was run (not deliberately, I assure you, and certainly not by me) on the drive that holds the SQL binaries and the data files for \’master\’, \’MSDB\’ and \’model\’:

Event Type:     Information
Event Source:   Chkdsk
Event Category: None
Event ID:       26212
Date:           24/01/2011
Time:           11:00:46
User:           N/A
Computer:       ExampleServer
Description:
Chkdsk was executed in read-only mode on a volume snapshot.


Now, I\’ve not checked up on this (my bad), but I\’m fairly sure that running chkdsk on a drive that holds running instances of SQLServer is probably not recommended ! Notwithstanding that point, it was run, and a couple of minutes after this event, this was logged:


Event Type:     Information
Event Source:   Chkdsk
Event Category: None
Event ID:       26214
Date:           24/01/2011
Time:           11:02:10
User:           N/A
Computer:       ExampleServer
Description:
Chkdsk was executed in read/write mode.


So it would appear that chkdsk was executed again with the /f switch to fix inconsitency errors reported by the first run. It was almost exactly at this time that SQL events stopped being logged.


The first we knew of the problem was at about 12:15, when calls started to come in indicating that the \’master\’ database was unavailable.


It is my theory that the chkdsk process \’fixed\’ a detected error in the database physical files which led to SQL not being able to access it properly. Reason: If there\’s no reason to talk to the master database for anything, then it behaves pretty much like any other database that\’s not being accessed – it stays dormant, can be offlined, whatever, and you only find out when you need to use it again and it\’s unavailable. The service that was running on the instance already had all of it\’s database connections in place, and was already authenticated when the \’corruption\’ happened. I think we also lost MSDB by the same method at the same time, so no jobs could run that would invoke any xp_s that are held in master. It was only when the connections from the application got recycled (forcing the re-authentication) that there was any visible issue at all.


So, by my reckoning, that tells me that SQL Server is capable of running without a fully functioning master database. Just how long for depends on what the instance is actually doing at the time.


But I could be wrong….


Back soon !