Maybe the title is a bit dramatic – since I built my device, I have suffered no ill effects yet but that is possibly due to the fact I tend to take good care of my devices. But I did allude to some potential problems with SSDs and their behaviour when faced with unexpected powerdown.
Fortuitously, the recently held USENIX FAST ’13 conference has something to say about this! This paper on Understanding the Robustness of SSDs under Power Fault by Mai Zheng, Joseph Tucek, Feng Qin and Mark Lillibridge made Slashdot front page this morning.
This is perfectly applicable to my concerns with SSDs in external enclosures – and their summary that “thirteen of fifteen tested SSDs exhibit surprising failure behaviours under power faults, including bit corruption, shorn writes, unserializable writes, metadata corruption and total device failure” calls for a closer examination of the results.
A Microscopic Examination
The paper’s attributions are to The Ohio State University and HP Labs – both respectable entities. Mai Zheng is a PhD student at OSU in CSE, and this is the first paper of his to focus on SSDs. Feng Qin is an Assistant Professor at OSU in CSE with focus on dependability of computer systems. Joseph Tucek is a member of the Storage Research Group at HP Labs, while Mark Lillibridge is a Principal Research Scientist with an interest in researching storage amongst other things. This is a formidable team to examine the effects of power faults – I would love to see more from them in this area in the future, as SSD reliability has not been examined carefully.
The paper looks at trying to quantify and detect these errors through writing carefully formed records onto the disk, and interrupting power to the drive using hardware via the serial bus and then verifying the records written. They expected to see the following failure modes (from their paper):
- Bit Corruption – Records exhibit random bit errors
- Flying Writes – Well-formed records end up in the wrong place
- Shorn Writes – Operates are partially done at a level below the expected sector size
- Metadata Corruption – Metadata in FTL is corrupted
- Dead Device – Device does not work at all, or mostly does not work
- Unserializability – Final state of storage does not result from a serializable operation order
It is important to note that FTL stands for Flash Translation Layer which is the mapping of logical drive sectors to the flash cells. If this is damaged, the drive can be expected to behave eratically and be unable to read correct data. Metadata Corruption was my main concern when it came to SSDs in a USB enclosure. As corruptions to the FTL can also result in data being incorrectly mapped, flying writes can be potentially due to FTL update failure or corruption.
They also identified shorn writes, which I would also have expected of any consumer level device has interrupted power, and merely means a “partly written” sector. Random bit corruption may occur due to the way flash memory works when it is partially programmed.
Of course, the most worrying are dead device which could be quite costly, and unserializability which means that data written in a certain order even with flush buffer commands are not committed to storage in that order. This can result in inconsistent states in certain types of files (databases for example). This is probably due to manufacturer optimization (to schedule writes when it is convenient for the drive) without consideration to the consequences of power failure (possibly considered rare).
While 15 drives were used, there were no identifiers to the drives or chipsets, but some characteristics are noted:
We can’t reliably unravel the drive’s identities but there are some hints:
- “no longer registering on the SAS bus at all after 136 fault cycles, and another suffering one third of its blocks becoming inaccessible after merely 8 fault cycles.”
- “We observed random bit corruption in three SSDs.” “… cannot be completely hidden from the device-level in some devices. One common way to deal with bit errors is using ECC. However, by examining the datasheet of the SSDs, we find that two of the failed devices have already made use of ECC for reliability.”
- “Contrary to our expectations, we observed shorn writes on three drives: SSDs no. 5, 14, and 15. Among the three, SSD#5 and SSD#14 are the most expensive ones – supposedly “enterprise-class” in our experiments.” “This is an interesting finding indicating that some SSDs use a sub-page programming technique internally treats 512 bytes as a programming unit, contrary to manufacturer claims.”
- “SSD#3 has 256Gb of flash memory visible to users …”
So what can we take away from this?
We can see that the number of simulated faults for the hard disks are relatively low (with just 1 tested failure of the consumer hard drive), so I don’t think one can claim that we are making an exhaustive understanding of the behaviour of the hard drive for comparison. HDD#1 is the consumer 5400rpm drive, whereas HDD#2 appears to be an enterprise SAS hard drive at 15000rpm.
The first hint is SSD#3 had metadata corruption after 8 cycles and has 256Gb flash memory visible to users. This means SSD#3 is not a Sandforce SF-2xxx device, nor a Samsung 840 (as these would be provisioned at 240Gb and 250Gb respectively). Phew!
The table also tells us that SSD#6, 7, 10, 15 contain power failure protection, which means that they are likely to be enterprise devices and not consumer devices – so I’ll probably ignore them.
As no devices are TLC, that rules out the Samsung 840 as being part of this group (as those are TLC). SLC devices are not commonly available for consumers, ruling out #5 and 14 from being of interest.
This leaves us results potentially interesting for consumers in SSD# 1, 2, 4, 8, 9, 11, 12, 13. All of these drives bar SSD#1 (death) are serialization offenders. These may be Sandforce devices – we already know they are quite advanced in wear levelling and background garbage collection – could this extend to lazy writing? But the consumer level hard disk is also known as a serialization offender, so is this really a big problem? Also is noted that SSD #2 is an older SSD from 2010.
We should see that SSDs #4 and #13, #5 and #14, #6 and #10, #8 and #9, #11 and #12 are the same vendor and model – they are duplicates in their testing, and yet the results for these pairs in terms of observations seem to be identical which implies confidence in their test regimes.
Interestingly, when one examines the price per GB on their paper, many of those with power loss protection also turn out to be the cheaper SSDs on a per GB basis. This might be because they were recent acquisitions, but it also makes for very unusual reading. SSD#7 and #15 with power loss protection still saw problems which was surprising, put down to possibly lazy FTL updates.
The observation of no flying writes was a good result – as it means that it is unlikely to see data splattered all over an area causing widespread corruption.
I guess one has to be careful with the SSD they choose for their enclosures, but unfortunately, with no real identifier of manufacturers – this provides very little purchasing guidance. It does however appear that serialization errors are the common failure mode for consumer grade SSDs, by excluding the drives which appear to be enterprise drives. I guess they may have tested more enterprise devices, but we can take heart from some of these examples that very few devices exhibit sudden death.
Unfortunately tests on historical older devices can prove to be poorly representative of the population, as consumer devices seem to churn very quickly and new products supersede older ones.