Experiment: WD Green Data Recovery – Recovered Data vs Time?

Two weeks ago, a “friend” of mine turned up with an external hard drive which was declared dead and wanted to see what could be done about recovering the data from it. Naturally, as no novice when it comes to tech, the first thing I did was grab the enclosure (a third party one, with the entity name engraved on it, hence not pictured), crack it open and take out the drive.

2016102215528831

As it turns out, it’s one of the early Western Digital GreenPower series hard drives, a 500Gb unit with the model code WD5000AAVS consisting of the smallest 8Mb cache available at the time. Serial number has been blanked out to avoid identification.

What actually happened to this drive is not known. Even if it were, most of the time users tend to claim that they did “nothing” to it and it just “broke”. Given its manufacture date in 2008, it’s well into its 8th year, well out of warranty and well past its design life of 3-5 years. It should really have been replaced.

Recovery

A reader of this blog sent me a write blocker which would have been very handy for this recovery, but unfortunately, I didn’t even have the time to unpack it, document it, set it up and test it, so recovery proceeded with the drive taken out of its enclosure (to avoid enclosure related problems with faulty power supplies and bridge chips) and connected directly to the motherboard ports of my “recovery box” running Debian. I’ll get around to it one day …

The drive itself triggered no SMART warnings on boot up, it spun well and it was detected by the BIOS just fine. I proceeded to get ddrescue on the case imaging the drive, which promptly proceeded to hit bad sectors and take forever. As the data would be needed by a deadline, the recovery was left to proceed as long as possible.

Once the image had been created, the remaining bad sectors were filled with a pattern, WinHex was used to analyze the image and carve out all of the files, grep was used to detect corrupted files within the recovered batch and the data was returned – around 240GiB of data which consisted of everything on the drive at the time save for about 40 files corrupted. All in all, the standard procedure and not much noteworthy in that regards.

The drive itself was interesting. It was a very good patient. It was a very patient patient. When asked to read, it would dutifully try and try again until it timed out, or the data was returned. There was no need to interrupt power to the unit, it didn’t drop off the bus unexpectedly. You couldn’t have wished for a better drive to recover from.

The SMART data (abridged) collected after the recovery was interesting as well:

smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

Model Family:     Western Digital Caviar Green
Device Model:     WDC WD5000AAVS-00ZTB0
Firmware Version: 01.01B01
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   200   200   051    -    29746
  3 Spin_Up_Time            PO----   168   162   021    -    4566
  4 Start_Stop_Count        -O--CK   098   098   000    -    2689
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR--   200   200   051    -    0
  9 Power_On_Hours          -O--CK   064   064   000    -    26948
 10 Spin_Retry_Count        -O--C-   100   100   051    -    0
 11 Calibration_Retry_Count -O--C-   100   253   051    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    58
192 Power-Off_Retract_Count -O--CK   200   200   000    -    34
193 Load_Cycle_Count        -O--CK   195   195   000    -    16697
194 Temperature_Celsius     -O---K   106   090   000    -    41
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--C-   189   189   000    -    943
198 Offline_Uncorrectable   ----C-   200   200   000    -    0
199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   200   200   051    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

Even though the drive reported 29,746 read errors to the host, it never failed the SMART value which remained static at 200. Even though the drive has 943 sectors it knows it cannot read (pending sectors) and might need reallocation, that attribute is not used for health assessment as it has a threshold of zero. As a result, based on threshold values alone, the drive believes itself to be healthy when it is evidently not. This is why paying attention to the SMART attribute raw values instead of looking just at normalized values versus the threshold is important.

Even though the drive has lots of issues reading some sectors, it never reallocated any of them. It suggests to me that something sudden may have happened to the drive resulting in its inability to read what was already written, with no writes to those areas occurring since the failure. This could indicate it had been toppled or dropped while running (as the unit has ramp load/unload, we don’t expect head damage when it is powered off as it is quite robust).

Trends

This is where things start to get a little interesting. From the outset, after ddrescue had completed its first three phases and scraped all the blocks, we had only lost 2816kB on the drive. It wasn’t clear whether we would be able to recover all of the data in time, but as the drive typically has its own long-winded retry algorithms, I wasn’t expecting much. Since time was on our side, I decided to leave it running to see just how much we could get back up till the moment it was to be returned to its owner, periodically checking in and recording the “amount of data to go”.

phase4-graph

The recovery pace seems to follow a piecemeal function with two equations. The first is a pretty poor fit, but it’s a fairly steep gradient compared to the latter. According to Excel fitting, it was about 105.05kB per day for the first 26 hours. This then quickly settled down to a different rate – only 6.13kB per day with a fairly good fit.

My personal expectation was to see something which is similar to a flipped exponential function. Initially, the data recovered should be quite high in rate, as the sectors are only getting their first of many retries, and those which are less damaged have a higher probability of returning their data. This would “tail off” as time progresses, leaving only the critically damaged (permanently un-recoverable) sectors as the “asymptote”.

In our case, I only had about 7 days to work with the drive and quite a bit of time was lost in phases 1-3, so we may not have reached the asymptotic part of the curve yet, implying that more data would have been read back if time permitted. However, the rate of data recovery is so slow that it is not really time economical to wait – you could easily type faster than 6kB in a day (which is approximately one character every 14.4 seconds).

During the recovery, I actually “declared” the recovery at the red point to be at the end because no data had been recovered in about 18 hours. I copied off the image and log tile for analysis on another machine while the recovery continued on the recovery box. This determination proved to be wrong, so I later used ddrescue logs to update the recovery image with the new sectors that had been recovered (totalling 19kB) which helped save a few corrupted files.

As this was still a long way from 0kB of remaining data to recover, I don’t think it would have ever been worth waiting for it to finish. I would have needed to wait another 426 days from the end assuming the trend still holds. Of course, such a trend is likely to be different depending on the drive, the cause of the failure to access the data, etc. It’s probably the case that the data was recorded fine, but the heads (and associated circuitry) may have been damaged just enough to reduce the read channel margin to the point that data read-back is intermittent. The right combination of conditions (e.g. fly height, temperature, atmospheric pressure, track misalignment/wobble, platter wobble) may come together “by pure chance” to create a positive readback once every n attempts.

Conclusion

A Western Digital GreenPower drive came in, and proved to be a very well behaved recovery patient. A majority of the drives’ data was returned, and the corrupted areas did not prove to be critical to parsing most of the filesystem and recovering most of the data that remained. A few files were corrupted, and detected, but this is much better than losing all of your data.

It does explain why having SMART enabled is not enough – many times the drive’s health based on thresholds may pass and appear to the system to be good, even though the drive is obviously having trouble. Paying keen attention to SMART attribute raw data will help identify the signs of a problematic hard drive prior to SMART raising flags.

The data recovery trends with retrying in phase 4 ddrescue recovery were interesting – a rapid portion spanning about 26 hours was followed by a much slower but consistent period of recovery. Such trends are likely to differ from drive to drive and case to case, but given the time-sensitive nature of many recovery tasks, having some idea of whether a significant amount of data is likely to be yielded or not in a timeframe could be useful in deciding when to throw in the towel because of diminishing returns.

About lui_gough

I’m a bit of a nut for electronics, computing, photography, radio, satellite and other technical hobbies. Click for more about me!

This entry was posted in Computing and tagged , , , . Bookmark the permalink.

5 Responses to Experiment: WD Green Data Recovery – Recovered Data vs Time?

  1. matson says:

    WWN is unique. Many treat it equally as another serial number, therefore decide to suppress/withhold _both_, never exclusively one or other.

    (In this specimen you will find WWN printed on label, left of spindle, below LBA count.)

    Season’s greetings!

    • lui_gough says:

      Indeed, the world-wide number is unique and I took some care to remove it from the smartctl output. But obviously I was too busy and didn’t notice it on the label. How silly of me. D’oh!

      Thanks for letting me know – should be fixed now on the server, although browsers may cache the old version of the image for a few days …

      – Gough

  2. rasz_pl says:

    at that point (recovered as much as was possible in limited time frame) it might be worth experimenting with temperature, heat top of the drive with a hair dryer to ~50’C and see if its able to read more, then try again cooling it (slowly) down to 5’C

    • lui_gough says:

      I suppose that’s a good idea if you have nothing to lose – after all, the larger the range of conditions the drive is exposed to (within its operational limits), I can see a potential for more data to be recovered. Heating is somewhat safe, although cooling always would risk the potential for condensation build-up. Unfortunately, as it goes, this “patient” had to be returned to the client, so there wasn’t any chance for any further experiments.

      – Gough

      • sparcie says:

        I used to do data recoveries like this, using an air conditioner set as low as possible and leaving the drive in its air path to get it as cool as possible. The humidity is reduced by the air con, and it doesn’t get very cold, so condensation is only a problem if you then take the drive some where humid enough that condensation occurs. I’ve also used heatsinks and fans placed on a drive to increase heat dissipation when over heating is a problem.

        Cooling a drive can help with some mechanical and electrical problems and keep a drive recovering longer when they aren’t as well behaved. I’ve not tried heating one, they normally heat themselves up enough.

        I’ve heard of some people wrapping a drive in plastic and putting it in the freezer, but that’s a recipe for condensation when removed, which isn’t such a problem for the electronics, but is a problem for the mechanical internals (that do share external air).

        As for failure, I’ve noticed hot weather and storm season are two of the biggest seasons for computer failures in general, so it’s not necessarily the users fault. A couple of my own drives have come up with smart errors of concern that caused me to back up my data (before it was a problem) and each time it happened was the beginning of summer.

        After I got my data back I wrote zeros to the entire drive (using the WD diag tool, which is as close to a low level format you can do) and that seemed to fix the marginal sectors (without a reallocation). This allowed me to get a little more use out of the drives, which didn’t misbehave/fail afterwards. I speculate that in my case the reason for the sectors being marginal is heat expansion due to the higher average temperature rather than an actual problem with the media.

        Anyway interesting to read others struggles with data recovery. It’s been a while since I’ve had to do it myself fortunately.

        Sparcie

Error: Comment is Missing!