Two weeks ago, a “friend” of mine turned up with an external hard drive which was declared dead and wanted to see what could be done about recovering the data from it. Naturally, as no novice when it comes to tech, the first thing I did was grab the enclosure (a third party one, with the entity name engraved on it, hence not pictured), crack it open and take out the drive.
As it turns out, it’s one of the early Western Digital GreenPower series hard drives, a 500Gb unit with the model code WD5000AAVS consisting of the smallest 8Mb cache available at the time. Serial number has been blanked out to avoid identification.
What actually happened to this drive is not known. Even if it were, most of the time users tend to claim that they did “nothing” to it and it just “broke”. Given its manufacture date in 2008, it’s well into its 8th year, well out of warranty and well past its design life of 3-5 years. It should really have been replaced.
A reader of this blog sent me a write blocker which would have been very handy for this recovery, but unfortunately, I didn’t even have the time to unpack it, document it, set it up and test it, so recovery proceeded with the drive taken out of its enclosure (to avoid enclosure related problems with faulty power supplies and bridge chips) and connected directly to the motherboard ports of my “recovery box” running Debian. I’ll get around to it one day …
The drive itself triggered no SMART warnings on boot up, it spun well and it was detected by the BIOS just fine. I proceeded to get ddrescue on the case imaging the drive, which promptly proceeded to hit bad sectors and take forever. As the data would be needed by a deadline, the recovery was left to proceed as long as possible.
Once the image had been created, the remaining bad sectors were filled with a pattern, WinHex was used to analyze the image and carve out all of the files, grep was used to detect corrupted files within the recovered batch and the data was returned – around 240GiB of data which consisted of everything on the drive at the time save for about 40 files corrupted. All in all, the standard procedure and not much noteworthy in that regards.
The drive itself was interesting. It was a very good patient. It was a very patient patient. When asked to read, it would dutifully try and try again until it timed out, or the data was returned. There was no need to interrupt power to the unit, it didn’t drop off the bus unexpectedly. You couldn’t have wished for a better drive to recover from.
The SMART data (abridged) collected after the recovery was interesting as well:
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org Model Family: Western Digital Caviar Green Device Model: WDC WD5000AAVS-00ZTB0 Firmware Version: 01.01B01 User Capacity: 500,107,862,016 bytes [500 GB] Sector Size: 512 bytes logical/physical SMART overall-health self-assessment test result: PASSED ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-- 200 200 051 - 29746 3 Spin_Up_Time PO---- 168 162 021 - 4566 4 Start_Stop_Count -O--CK 098 098 000 - 2689 5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0 7 Seek_Error_Rate -OSR-- 200 200 051 - 0 9 Power_On_Hours -O--CK 064 064 000 - 26948 10 Spin_Retry_Count -O--C- 100 100 051 - 0 11 Calibration_Retry_Count -O--C- 100 253 051 - 0 12 Power_Cycle_Count -O--CK 100 100 000 - 58 192 Power-Off_Retract_Count -O--CK 200 200 000 - 34 193 Load_Cycle_Count -O--CK 195 195 000 - 16697 194 Temperature_Celsius -O---K 106 090 000 - 41 196 Reallocated_Event_Count -O--CK 200 200 000 - 0 197 Current_Pending_Sector -O--C- 189 189 000 - 943 198 Offline_Uncorrectable ----C- 200 200 000 - 0 199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0 200 Multi_Zone_Error_Rate ---R-- 200 200 051 - 0 ||||||_ K auto-keep |||||__ C event count ||||___ R error rate |||____ S speed/performance ||_____ O updated online |______ P prefailure warning
Even though the drive reported 29,746 read errors to the host, it never failed the SMART value which remained static at 200. Even though the drive has 943 sectors it knows it cannot read (pending sectors) and might need reallocation, that attribute is not used for health assessment as it has a threshold of zero. As a result, based on threshold values alone, the drive believes itself to be healthy when it is evidently not. This is why paying attention to the SMART attribute raw values instead of looking just at normalized values versus the threshold is important.
Even though the drive has lots of issues reading some sectors, it never reallocated any of them. It suggests to me that something sudden may have happened to the drive resulting in its inability to read what was already written, with no writes to those areas occurring since the failure. This could indicate it had been toppled or dropped while running (as the unit has ramp load/unload, we don’t expect head damage when it is powered off as it is quite robust).
This is where things start to get a little interesting. From the outset, after ddrescue had completed its first three phases and scraped all the blocks, we had only lost 2816kB on the drive. It wasn’t clear whether we would be able to recover all of the data in time, but as the drive typically has its own long-winded retry algorithms, I wasn’t expecting much. Since time was on our side, I decided to leave it running to see just how much we could get back up till the moment it was to be returned to its owner, periodically checking in and recording the “amount of data to go”.
The recovery pace seems to follow a piecemeal function with two equations. The first is a pretty poor fit, but it’s a fairly steep gradient compared to the latter. According to Excel fitting, it was about 105.05kB per day for the first 26 hours. This then quickly settled down to a different rate – only 6.13kB per day with a fairly good fit.
My personal expectation was to see something which is similar to a flipped exponential function. Initially, the data recovered should be quite high in rate, as the sectors are only getting their first of many retries, and those which are less damaged have a higher probability of returning their data. This would “tail off” as time progresses, leaving only the critically damaged (permanently un-recoverable) sectors as the “asymptote”.
In our case, I only had about 7 days to work with the drive and quite a bit of time was lost in phases 1-3, so we may not have reached the asymptotic part of the curve yet, implying that more data would have been read back if time permitted. However, the rate of data recovery is so slow that it is not really time economical to wait – you could easily type faster than 6kB in a day (which is approximately one character every 14.4 seconds).
During the recovery, I actually “declared” the recovery at the red point to be at the end because no data had been recovered in about 18 hours. I copied off the image and log tile for analysis on another machine while the recovery continued on the recovery box. This determination proved to be wrong, so I later used ddrescue logs to update the recovery image with the new sectors that had been recovered (totalling 19kB) which helped save a few corrupted files.
As this was still a long way from 0kB of remaining data to recover, I don’t think it would have ever been worth waiting for it to finish. I would have needed to wait another 426 days from the end assuming the trend still holds. Of course, such a trend is likely to be different depending on the drive, the cause of the failure to access the data, etc. It’s probably the case that the data was recorded fine, but the heads (and associated circuitry) may have been damaged just enough to reduce the read channel margin to the point that data read-back is intermittent. The right combination of conditions (e.g. fly height, temperature, atmospheric pressure, track misalignment/wobble, platter wobble) may come together “by pure chance” to create a positive readback once every n attempts.
A Western Digital GreenPower drive came in, and proved to be a very well behaved recovery patient. A majority of the drives’ data was returned, and the corrupted areas did not prove to be critical to parsing most of the filesystem and recovering most of the data that remained. A few files were corrupted, and detected, but this is much better than losing all of your data.
It does explain why having SMART enabled is not enough – many times the drive’s health based on thresholds may pass and appear to the system to be good, even though the drive is obviously having trouble. Paying keen attention to SMART attribute raw data will help identify the signs of a problematic hard drive prior to SMART raising flags.
The data recovery trends with retrying in phase 4 ddrescue recovery were interesting – a rapid portion spanning about 26 hours was followed by a much slower but consistent period of recovery. Such trends are likely to differ from drive to drive and case to case, but given the time-sensitive nature of many recovery tasks, having some idea of whether a significant amount of data is likely to be yielded or not in a timeframe could be useful in deciding when to throw in the towel because of diminishing returns.