Anatomy of a Solid State Disk Failure

Solid state drives can fail just like anything can. I’ve been told that they fail gracefully and I’ll have enough time to rescue my files.  Is that true?  Read on.

For a couple weeks my home PC’s SSD (a 60GB OC-Z Agility 1 I bought two years ago on sale) has been reporting errors when I do a disk check – just one or two until this week, when hundreds of files and other artifacts have been lost – no idea what, but Windows dutifully counted them out.  And recently Windows started showing signs of flaking out on me – for example, Event Viewer won’t load, but I could read the event files from another program.  After running SFC (System File Checker), I confirmed that some system files are corrupted… including some used by Event Viewer.  Apparently, when the drive wrote data, it didn’t verify, fail the write and use another sector like I had believed – maybe only enterprise class drives do that; this one wrote the data and continued on just like a consumer class magnetic drive would.  Fortunately, I bought my wife a shiny new 120GB SSD for Christmas, and it sat on a table for over three months, waiting for me to prioritize it.  Last night, it became mine.  I created an image backup of my 60GB SSD that I bought a couple years back for a boot disk.  I used Acronis True Image Home 2012, because it’s been recommended to me a few times.

I’m impressed – Acronis finished the backup in about 15 minutes, mostly due to the external drive’s USB 2.0 interface speed being maxed out.  It even compressed by 50%.  To see how fast it CAN get, I copied the file to my super-fast RAID drive, made a boot CD, installed the new SSD (an OCZ Agility3 120GB, bought on NewEgg for about $110) and booted directly into Acronis.  It let me choose the backup file from the external drive or internal, so I chose internal for speed.  Formatting, validation and restore completed in 8 minutes, sustaining writes at about 200MB/sec which barely challenged the new drive.  A later file copy test showed the same speed and the drive was only about 10% ‘busy’ when the other one would be 100%.  So I’m pretty happy with the new drive and the backup tool.

So, how did I know the disk was failing, and what does it look like when a 2010-era SSD fails?

Symptom 1: Changes don’t seem to save.  Programs (including parts of Windows) crash randomly, and say that a file is corrupt.  If you see that, scan your disk – but be aware, if you tell it to fix, it will amputate to save the patient.  Windows 7 recovers files, but I have no idea where my recovered files went, or which ones were lost.  If you tell it to scan for bad sectors, watch the screen constantly –  “33 bad clusters found/marked” flashed by and the machine rebooted.  I can’t see any report, possibly because the event log files are corrupted.  As soon as ou see bad clusters, run, don’t walk, and get a replacement disk.  Every disk write operation could be corrupting files.  Windows 7 is much more resilient than XP, but nothing’s immune.

Symptom 2: If you run SFC /VERIFYONLY or SFC /SCANNOW at an admin command prompt, it may tell you system files are corrupted.  This seems odd because system files don’t change, and SSDs are supposed to fail only on writes.  But remember the updates?  Yes, Windows is constantly updating OS files, recompiling the .Net framework and updating Windows Search indexes – which on an SSD uses new sectors, some of which may be bad.  So in theory I could install the old drive as read-only and it won’t give more bad sectors.  Anyone want to place bets?

I’ve encountered spinning drive failures before, and in almost all cases it was a controller or armature type failure.  For the low, low price of $3,000 I could get the data restored by a rescue service – I prefer prevention, thanks.  I’ve done file backups for years, but if I ever lost my boot disk, I’d lose hundreds of hours to reinstalling apps.  I think I’ll be keeping OS image backups from now on, and I think Acronis is doing a good job – I’ll have to test the incremental backup, but seriously, an sector-by-sector image in under an hour, without a reboot?  That’s worth money.  I have no vested interest in Acronis, so if you know of a similar product that does well, please mention it in the comments.

So, would I rely on an SSD for my important stuff? Yes, with backups.  Because, everyone should have backups anyway.  I think most SSD media failures would show warning signs early, so a regular full ‘checkdisk’ scan is a good idea for home users.   The speed boost is so incredible compared to a spinning disk, that the ‘new’ technology (really, it’s not that new) is worth the unknowns.  I absolutely would use enterprise-class SSDs in a production environment, subject to cost/benefit analysis, as long as I had them set up as RAID10 or RAID5 (as appropriate) and a few spare drives on hand (which is what enterprise IT should do anyway, right?).   And, your backup tool should be easy enough to use that you can do a restore test on a whim or on a schedule.  That goes for any type of backup.

 

Leave a Reply

%d bloggers like this: