The holiday season is a great time to relax and take some time off. That is, until something goes wrong and then you have to attend to it. This was one of those situations where potentially, things could have gone a lot worse.
The Call – The Situation
Just before Christmas, bad signs began to brew in the HP Microserver N36L that runs the services at my other house. I began to get some SMART alerts that the boot drive, a Samsung HD204UI was beginning to show signs of potential failure.
It was no big surprise. In fact, that server was a bit of an experiment to see just how long these mechanical drives would last. The N36L was purchased in 2011 (seven years old) and replaced a floodwater damaged server, with the Samsung HD204UI migrating into the N36L along with two other drives, the “brother” HD204UI that was bought at the same time but used elsewhere, and a Western Digital WD30ZRX 3TB drive which was incidentally my first 3TB class drive. All have been running 24/7 nearly non-stop, which is not easy for mechanical drives.
The drive that was in trouble was the younger HD204UI, having 49,257 hours (or about 5.6 years of operation) under the clock when pending sectors began to appear. This indicates data which could not be read back but hasn’t been reallocated yet. My first stop is to try and read the drive back to see if they could be self-cleared with an automatic drive rewrite.
When discovered, the pending count was five. Attempting to scrub the drive provoked the pending sector count to climb to 19, before self-rectifying back to two. This seems to indicate the drive isn’t in good health – any more “remote” interventions could make things worse.
Of note is that its brother drive has 65,559 hours or 7.5 years of operation under the clock and still reads relatively healthy. It’s now on failure-watch as well as I don’t expect it to have too much longer to live.
The other drive in the box has 51,927 hours under the clock, or 5.9 years of service, making all of the drives in the box more long lived than I had ever expected. In fact, the whole computer itself seems to have lived longer than expected.
A Little Ingenuity?
I decided to attend to the drive when I went over to the house on a visit to attend to other things. In order to do the data migration, I decided to bring a spare drive that was larger (that I could image to). Unfortunately, as the machine is strictly BIOS boot, it doesn’t support >2TB boot drives, so I had to bring a spare drive that was smaller. While I could upgrade it to SSD, I saw no reason as it was infrequently booted (mostly idling/serving) and the amount of data was just over 600GB. I found a relatively “young” Seagate 7200.12 1TB hard drive and decided to have that step in – even though both were manufactured in 2010.
To do the transfer, I bought along my laptop and two USB 3.0 to SATA bridge boards, hoping to make a straight ddrescue clone. Unfortunately, I didn’t check and was caught out by the power supplies. I bought supplies with a 2.1mm tip, but the bridge boards had 2.5mm pins, so now I was on location without the right tools.
I then thought I would use the existing Microserver to do the recovery and clone. But I had only a monitor and mouse attached to the unit – no keyboard, so no access to BIOS, etc.
There was nothing too special about the offending drive, but I was keen to take it out of service as it seemed to be physically related to the Samsung HD501IJ which failed in my brother’s computer without much warning, taking some data with it.
Using a little daring ingenuity, I shoved my bootable Ubuntu USB key into the internal USB port and removed all hard drives from the internal non-hot-swap rack, forcing it to boot using the USB key. As soon as the BIOS finished detecting the key, I hotplugged the drives (against manufacturers’ claims that it was a non-hot-plug bay) so that they would be detected. From there, I would use the accessibility on-screen keyboard in Ubuntu to enter commands. Unfortunately, I found the special keys to be missing and the shift key not to work, which made working with the terminal slightly more of a chore.
The first hurdle was that there was no ddrescue, so I had to install gddrescue. To do that, I had to add the universe repository first and download it over the internet.
After a bit of scrubbing by ddrescue, no data was lost and a clone was made. I then used gparted to shrink the partitions to 700GB, then cloned that using dd across to the 1Tb drive, then re-expanded the partition to fill the drive.
After a reboot chkdsk, the unit came back to life on the Seagate hard disk, but there was still some tidying up that needed to be done.
Hard Drive Firmware Upgrade
The Samsung HD204UI itself had a firmware upgrade due to a data corruption bug that could occur if an ATA Identify command was received during a write operation and I remember applying it to both drives.
But rather interestingly, the Seagate 7200.12 1TB hard drive I had in spare that I had chosen to replace it with also needed a firmware upgrade from CC38 to CC49.
Owing to my luck, the firmware update utility can be executed in Windows and reboots into its own Linux-based environment to perform the update.
The update was performed successfully – indeed, it’s the first time I’ve ever done it on a Seagate drive, so I was glad it went smoothly. The drive’s vital signs look perfectly fine.
Analysing the SMART Data & Rehabilitation
This naturally led me to wonder whether the drive was really cactus, or whether this was a sign that there was a transient issue (e.g. power loss, interference, environmental causes) that could have caused the drive to become unreliable.
I decided to challenge the drive through a USB 3.0 cradle, and it was capable of a full erase and verify with random fill. This reset the SMART statistics with no pending sectors and no reallocations, suggesting that the fault was “soft” and transient.
But given the evidence from before that the drive continued to grow pending sectors, I suspect the drive may have issues nonetheless and it was good to take it out of critical service. To better understand this, I delved into the recorded SMART data database.
From the graph, we see that the drive entered server service at about 15,000 hours and has been rebooted relatively infrequently in that time, operating continuously.
Looking at the other vital signs that have changed since then, we see that the drive is subject to seasonal temperature changes, but mostly runs between 20 to 45 degrees Celsius. The Spin-Up time seems to vary up and down as if that is normal, whereas the G Sense Error Rate condition seems to fall (perhaps increasing vibration or knocks to the server) and an unknown parameter B5 was also showing declines in value to 96, whereas its older brother has a B5 value of 99 still. I’m not sure what that parameter is measuring, but its decline does suggest the drive may have suffered some wear.
The Samsung HD204UI that was a boot drive for an HP Microserver N36L developed pending sectors suggesting potential drive instability and future failure. Data was successfully migrated and the drive seemed to test okay after a complete wipe. SMART data variables show potential signs of wear, but the drive continues to operate even after having close to 50,000 hours on the counter. The other two drives in the chassis have an even greater number of power on hours recorded with no notable degradation in SMART parameters. It seems that hard drives with ramp load/unload and FDB motors can operate reliably well past their intended lifetime.
In the meantime, this HD204UI may be repurposed as a cold-store archive drive for unimportant data.