Moments to Disaster: Hard Drive Failure & SMART Data

While I was away on holiday, blissfully enjoying my time outdoors, I wasn’t aware that a disaster was brewing at home. I got a message from my brother that his machine had started making a clicking noise, and it was already too late. I was sure that a hard drive had failed.

Luckily for him, the machine I built for him had an SSD for a boot drive, so after a bit of waiting, the unit still booted and continued to work. However, whatever was stored in the hard drive was inaccessible.

Attempted Rescue

When I came back from holidays, I tried to rescue it. The drive in question was a re-certified Samsung HD502IJ drive from my uni salvage spree back in 2015. The particular unit was serial number S1W3J9ASB00270, received with 21886 hours on the clock and passed commissioning tests just fine.

Upon pulling the drive and sticking it in my recovery cradle, it was confirmed that the drive spun up just fine, but it was failing to “find” its location and read-out the firmware from the disk. I pulled its serial-number mate S1W3J9ASB00276 from my spare pool and swap out the PCBs. No change in behaviour. Click-click-click-click … and eventually, spin-down.

At this stage, I was pretty sure recovery would be quite farfetched. Professional data recovery would have been a potential option, but it would probably be quite expensive. I didn’t feel it would be worth it to send the drive away. However, I was determined not to give up, and rather than waste my time taking photos of things, I just got on with it and cracked open both drives and “ghetto” swapped the whole head-stack assembly from one drive to the other. It was a long-shot, especially without all the fancy tools a proper lab might have, but I thought it’s better than nothing.

After powering-up, the drive still exhibited no change in its behaviour. It still couldn’t read. Worst still, transplanting the heads back to the original drive resulted in the original drive failing to work in a very similar way. Now I have two dead drives and nothing to show for it. At least, they didn’t cost me anything in the first place, except for some of my brother’s data. I did have a back-up before I left for holidays, but apparently, there was quite a bit of additional data added to the drive since I left … unfortunately, I can’t do anything about that.

Declared Dead – Here’s the Autopsy

At this stage, I declared both drives dead, and proceeded to dismantle them into component parts for disposal (as I usually do).

The Samsung HD502IJ is a 7200rpm PMR drive with 334Gb per-platter technology, with fly-on-demand (FOD) control and rotational vibration sensor. It uses old-fashioned contact start-stop instead of the newer ramp-loading technology for the heads.

When I disassembled out the platters, of which there are two in a 500Gb drive, I noticed some unusual damage.

It’s not the first time I’ve taken apart a drive and noticed the platter surfaces seem to be “ground down” in concentric rings. Notice how the landing zone section and mid-platter both have significant wear on the three active surfaces, although the degree of wear varies. The unused surface had no head, and thus without contact, it suffered no wear.

A closer look at the broadest wear band seems to show it has a interesting striated pattern, almost like a washboarding of roads.

The width of the wear band is different, and the edges can be irregular. In the case of the lightest wear, it was a thin “scratch” which was impossible to get good focus on.

The Drive’s “Medical Records”

As I wasn’t home to watch over the drive as it failed, I relied on my installation of CrystalDiskInfo to record the drive’s information. As with most of my systems with actively spinning drives, CDI runs in the background collecting health status data every 10 minutes and writes it to a set of CSV files (one per SMART attribute) stored in the smart sub-folder inside the installation directory.

There had been a number of arguments that SMART is absolutely hopeless in predicting drive failure. However, there is a possibility that the viewpoint is biased to those who have suffered unexpected failures “out-of-the-blue”. Was this one of the latter?

Upon examining the SMART data, I plotted only the variables which changed with relation to time up to the event, with the exception of temperature. CDI only records “raw” data for some attributes, and SMART normalized value for others, so I can only use what is left.

Based on the power on hours count, it ran for about 474 hours (although the last week of data does not appear to be recorded for this variable, and is something I’ve seen happen in the past). The time from commissioning into the “new” system for my brother to failure was 146.3 days. I left for my holidays where the “flat” section in the curve was around 27th June, as my brother was also on holidays at the time and the system sat unused.

The “vital signs” show that on the 7th of September, the drive did show a sign of potential impending failure with sky-rocketing pending sector count. In general, any pending sector counts cause me to investigate the drive, as it’s a sign the drive knows of sectors on the surface it cannot read. Two weeks later, it finally has its first reallocation event of one sector. The drive lived almost three weeks or about 85 power-on-hours between the first sign of failure and total failure, and this means that SMART did give some warning. Whether acting on the first warning would have been sufficient to rescue most of the data, or whether failure was already assured at that stage, we cannot determine for sure.

SMART normalized values for Raw and Soft Read-Error Rate both degraded slightly, but not sufficiently to trip any alarms. It does, however, correlate with increasing difficulties experienced by the drive.

Interestingly the Spin-Up Time, a parameter which is normally an indication of mechanical problems, showed increased variance about four days after the first sign of failure. The variance trends towards lower SMART normalized values, indicating decreasing health. This suggests that possibly the roughness had built up in the landing zone area so much as to oppose the motor “ramping up to speed”, delaying “take-off” and also increasing wear due to prolonged contact period with the surface.

If only someone had noticed the warnings as CDI would have popped up alerts, and taken appropriate action, as there appears to have been a window where at least partial recovery may have been possible.

However, it seems that SMART on its own may not have sounded any alarms without any “third party” monitoring software installed, as the thresholds are set quite high, and the amount of pending sectors and spin-up variations may not have tripped them. In fact, looking at the SMART parameter listing from another drive, it seems that of the attributes that showed change, 01 Raw Read-Error Rate would have to fall below 51 to trigger an error, and it only fell to 97. Likewise, attribute 03 Spin-Up Time has a threshold of 11, and the worst value recorded was about 33, and thus no warning would have been issued. Attributes for Current Pending Sector Count, Reallocated Event Count and Reallocated Sectors Count are all set to a threshold of 0, indicating they cannot trigger a SMART alarm despite being one of the most indicative attributes of impending failure.

As a result, there seems to be some merit to the viewpoint that SMART is pretty useless, however, if you interpret the SMART data more closely (rather than blindly following manufacturer-set thresholds), the drive does provide some prior warning of its unhealthy status.

Cause of Death?

Given all of the data, to find the cause of failure is almost downright impossible. I’m no hard drive expert, but I have a few hypotheses.

Modern hard drives have operational head-flying clearances in the single to double-digits of nanometers range. That’s a gap smaller than the wavelength of light. This impressive feat is achieved by a mixture of technologies including a head-heater for active fly height control. The heater changes the position of the head “on demand”, so as to bring it close to the disk when needed for reads and writes. With such small clearances, contact is occasionally inevitable due to embedded surface defects from manufacturing and particles in the chamber.

When a head makes contact with a defect or contamination on the disk, this would be traditionally known as a head-crash and can cause damage to the disk surface and the head. However, there is a more technical term for this – thermal asperity. This is because the head heats up as a result of hitting a defect, and this has a secondary effect of shifting the baseline of the analog voltage produced by the GMR heads, making data recovery difficult. It appears that it’s probably impossible to make perfect platters, so having some thermal asperities is something we need to cope with in real life.

As a result, the platter is coated with a diamond-like coating and a lubricant layer to minimise the damage of a head-to-disk contact. The sliders and head components are also made of resilient materials. Unfortunately, even with this, it is possible to damage the head slider through such events, resulting in scoring of the slider surface with signal changes and potential fly-height control difficulties as the aerodynamics of the slider are affected. The shape of the defect has an important bearing on how damaging it is, with the smallest defects being “mowed down” by the head without damage and rounded defects resulting in a deviation in fly height but with no damage. The defects that sit in the middle are potentially dangerous.

As a result, I think the cause of failure was probably some platter defect in a specific location. Perhaps this was already extant at the time of platter manufacturing, or caused through mechanical shock while in operation or with heads not parked in the landing zone. If not, another cause could potentially be a very unusual data access pattern (unlikely due to the low duty cycle of “home” storage use) resulting in the head resting over a track for prolonged periods causing localized lubricant loss and reduced fly-heights.

This may have very slightly damaged the slider every time the location was used, but not enough to render the drive critically damaged. Over time, possibly due to where the data was stored on the drive and being accessed, in our use, the drive probably did rest its heads over the defective area and accumulated damage to its sliders. This, then reciprocally damaged the platter surface.

By the time the damage was done to the platters, even microscopically, it seems quite likely that the damage would have rapidly accelerated as the defects affect the flying performance of the head slider assembly which would become more likely to make contact with the surface, further damaging both head and platter. The stiffness of the suspension may have something to do with the “washboarding” patterns seen, as there could have been some “resonance” occurring.

The fact that damage was coincident on the three recording surfaces to varying degrees suggests the possibility of coincident damage (external shock for example), embedded defects in similar locations on all surfaces (unlikely) or potential for cross-coupled damage. I suspect that this could happen, say if any “resonance” is caused, the vibrations could be sent up the whole HGA and thus cause the other heads in the stack to fly in an oscillatory manner, also damaging their surfaces due to increased probability of contact.

However, as I’m no expert on the tribology of hard drive mechanics, I’ll leave all of the above as just a hypothesis. However, I’d have to say that Youyi Fu’s PhD thesis (2016) on Tribological Study of Contact Interfaces in Hard Disk Drives done at UC San Diego was quite an interesting read, especially for someone who is not an expert in the area.

Conclusion

Unfortunately, data was lost in this circumstance, but it seems extremely unlikely that any affordable service would have been able to restore it after-the-fact. The damage to the platters was extensive, and the cause is not conclusively determined. However, it seems that SMART did provide some evidence of the drive’s unhealthy state prior to its total failure (about 3 weeks or 85 power-on hours based on the recorded data), although whether this is enough warning to make a successful or mostly-successful recovery is not known. If assessing SMART based on manufacturer-provided thresholds, this drive does not appear to have tripped any thresholds, thus would not have triggered warnings from the BIOS or “inbuilt” features in most operating systems. Only more detailed third-party software with a keen eye on the SMART raw values themselves would have uncovered the developing issues.

About lui_gough

I’m a bit of a nut for electronics, computing, photography, radio, satellite and other technical hobbies. Click for more about me!

This entry was posted in Computing, Obituary and tagged , , , , , , . Bookmark the permalink.

One Response to Moments to Disaster: Hard Drive Failure & SMART Data

  1. I have found all the SMART readings (except power-on time) gobbledegook and despair of getting any useful info from them. IS there any software out there that gives info useful to non-experts?

Error: Comment is Missing!