Late last month, I was contacted by a fellow reader of the blog with a sob story about his Comsol 8GB USB flash drive. Unfortunately, his one had failed in a way rendering the data inaccessible. He also claimed to have sought assistance from various IT related people who were not able to restore his data and was wondering if I’d like to have a crack at it especially as it had contained some rather important data.
At the time, judging from the information provided, I naturally assumed his chances were slim. Many lower-cost flash drives have failed suddenly (and even branded ones as well in my experience) and their inaccessibility often stems from complete flash failure or a failure of the controller to understand its physical format which results in incorrect capacities, failed readback and ultimately, no capability to restore the device without destroying all the data within it. These sorts of failures are failures which I am powerless to do anything about, not owning any specialist flash readout equipment. Even then, those who possess the equipment know just how much hassle it is to desolder the memory and track down the correct parameters to try and make sense of how data is arranged on the flash as many controllers exist and the strategies differ from vendor to vendor and chip to chip. Worse still, the Comsol drive has a very peculiar package which is a microSD card, but without being able to work as a microSD card even when desoldered, relying on the connections on the underside for direct flash access.
Despite this, I did say that if he was desperate and wanted me to have a go, that he should send it to me and I’ll see what I can do.
From the outside, there was nothing untoward. The drive looks like it was treated well, although used with regularity judging by the scoring on the metal outer of the USB connector. Its demise was definitely not related to any physical abuse.
Plugging it in revealed the dreaded dialogue that asks you to format the disk. I’m glad the reader did not go through with that, otherwise that’s a recipe for certain failure. Instead, this suggests that the disk doesn’t have a recognizable file system.
This required further investigation, so I grabbed the tool of choice – WinHex, a general purpose hex editor that also features direct device access, filesystem and partition table interpretation, file signature analysis, forensics, and other features.
Accessing the devices listing, we can see this case is somewhat special, as whatever had happened resulted in the drive still retaining its physical format information – notice how the drive was showing correct capacity, rather than 0Mb, 8Mb or some other bogus value which failed-flash candidates often report.
If we stopped here, we would have concluded that the drive had somehow wiped itself, and there wasn’t anything left to see.
Of course, looking at the first few sectors doesn’t tell you everything. Were all 7.4Gb of the drive readable? Was there anything left to it?
Imagine my surprise when at 0x4000 (16,384 bytes) into the drive, the familiar MBR partition table appeared. This should have been at 0x0 (0 bytes) but instead somehow landed here. Many of the following sectors were blank, but this gave some hope some data still existed.
In theory, if it was a simple “shift” in the data position (physical to logical mapping), then it is a simple case of having WinHex interpret this partition table instead. Using the Scan for lost partitions feature, we attempted to access this partition …
… but were ultimately unsuccessful, as the root directory appeared to be destroyed. It appears unlikely that complete successful recovery will be achieved, especially for files stored in the root directory.
In this case, it seems the cause is not user error. It’s probably down to a combination of things – bad luck, bad timing, bad controller design, and/or bad flash. The result seems to be corruption of the flash mapping table, leading to a partial blanking and/or scrambling of the data.
To explain this in an analogy would be to think of yourself as a secretary with a 100 page notebook. Each page represents a storage block, and you can write into empty or partially pages to store information. Each page can also be erased to change the information within it, but to change anything on that page, the whole page must be wiped. Each page can be written and erased a limited amount of times. Your boss expects you to be able to store his life-story, which is exactly 80 pages long, but he keeps changing his words from time to time and asking you to change the story.
A flash drive is kind of the same thing. Because of these constraints, if you’ve written down 61 pages of the story, but your boss requests a change on page 34, you will have to erase all of page 34 and write it again. This is highly inefficient, so you instead propose to copy the contents of page 34 with the correction onto page 62, and instead note down elsewhere that page 62 is the updated version of page 34.
Maybe your boss is very indecisive, and has decided to rewrite pages so many times that they have literally worn out. Those bad pieces of paper cannot be used either, so you have to keep track of that too.
By the end of the story, we find that the story is made up of many pages but their ordering isn’t sequential. Instead, you have an index somewhere, which tells you which pages to read in what order, to deliver the promised 80 page story from a 100 page notebook. The index also tells you which of the 20 spare pages are still usable, which ones are clean and ready for re-use, and which ones are worn out and not usable again.
Now that we have this analogy, the importance of the index should dawn on you. The index is just another page or set of pages in the book. It also follows the same rule – to change the index requires copying it somewhere else with the changes before erasing the original one. Without the index, the story would literally be in any order, with spare blank pages or pages with stale information littering the main storyline! To reassemble this story would be extremely difficult to impossible.
Bringing this back to flash memory – every flash drive has the same sort of constraints to work around. For this reason, they have overprovisioning (i.e. the capacity they tell you they can store is less than the actual capacity of the chips) so to have spare blocks to compensate for failure, improve performance, maintain the flash mapping table (analogous to the index) and even out the wear-out rate of the blocks, sometimes by copying data that doesn’t change much to a more worn out block so that that relatively fresh block can be used.
Designers are not stupid, of course, and they realize the importance of the flash mapping table and they take some good measures to try and protect it. One way to do it is to keep more than one copy of it – keeping multiple copies can allow for you to go back to an older table in case the latest one was damaged because the page was chewed up by a dog (failure in the memory). It also helps if power was removed during the updating of the table which would corrupt it, but this back-up copy would be stale and likely involves some data loss. Another way is to limit how often the table gets updated by keeping a log of differences on a fresh page, as adding to the page doesn’t require erasing the whole page, which limits the “window” in which the mapping table could get damaged.
In this case, it seems highly likely that something happened to this mapping table which resulted in:
- The pages being re-ordered somewhat
- Some logical sectors not returning their original data because they are pointing at the wrong blocks (e.g. erased blank blocks)
- Data scrambling, where the order of the data is not consistent with the way it was stored
It’s important to note that this happens in the controller at the physical level, whereas a filesystem overlays on top of this, and itself, organizes the files at logical blocks (clusters) which may be non-contiguous (fragmentation). Because the corruption had impacted on the filesystem, loss of this information negatively impacts the chance of recovery for files, especially large ones, which are likely to be split. You can think of this as a “double” scramble – if you had the file system (treasure map), you could dig where the data was as pointed by the map, but it doesn’t help if an earthquake had rearranged the ground (failure of the flash mapping table), because you’d be digging in all the wrong spots!
The (Partial) Solution
Regardless, getting some data is better than getting nothing at all. Because of the sensitivity of the data, I will not discuss the files themselves, and there will be no screenshots.
Because some data was still strewn across the drive, the whole device was imaged to protect its state and allow for faster analysis and recovery. All sectors claimed to read out successfully and no data errors were reported by the device.
A file system structure search was performed using Refine volume snapshot after the errant partition was identified. This allowed us to identify file-system data of the folders that existed on the drive, as those scraps of file-system data still existed. Using this data, the files that were seen in folders were copied out on the assumption that the filesystem was still valid. The files recovered in this manner had their original file names, because of the use of filesystem data.
A second strategy, called file signature recovery, was invoked for the most popular types of files. This involved scanning the drive for “sequences” which are characteristic of the type of file (e.g. GIF89a at the beginning of GIF files). This type of recovery only works properly for files which occupy contiguous spaces in the drive after the flash mapping disaster, and cannot correctly identify file sizes in all cases, as there is no reliable way to detect the end of certain files where there is no “end sequence”. Regardless, this worked reasonably well and turned up a lot of files, many which are noise.
Both methods do not have ways to differentiate between actual files and deleted files, and thus, lots of old data may also come up. To reduce the filtering burden, duplicate files between the two types of recovery were eliminated (with preference to the files with filenames) and a quick “walk” of the files were used to eliminate most corrupted files. The file signature recovery was especially helpful in bringing back smaller files which may have been stored at the root directory.
After all this was completed, the lot was ZIPed up with a password and sent back via Dropbox HTTPS share, which was promptly removed once the data was confirmed to be received. All copies of the data were then securely erased and destroyed.
Failure of any computer storage device can be extremely devastating because of the amount of reliance we have on digital files. That being said, important files should always be backed up on multiple forms of media in multiple locations. If you fail to backup, data loss may just be a matter of time.
In this case, the failure was particularly special. A partial failure of the flash mapping table led the device to still correctly report its physical geometry and read out the data, albeit shifted, with random “holes” of empty pages, and some rearrangement. While a full recovery in light of this was not possible (at least, with the skills and tools I have at my disposal), it was a better outcome than a geometry failure which would have led to the drive not returning any data to play with.
It’s important to remember, in the cost-competitive flash memory segment, especially the bottom end, devices are being built using planar TLC memory which is unsuited for long term storage. A desire to maximise capacity may lead to minimal spare sectors which can accelerate time to failure. Sub-optimal controller design may lead to such “bad timing” incidents where power was removed just as the table was being updated causing corruption that the controller could not catch and resolve. Instead, such devices are mostly suited for “interchange” purposes – copying files from one machine to another, rather than long term storage.
It’s also worth remembering that not every flash device is made equal – different flash memories have different failure rates because they can come of different grades with different amount of bad blocks out of the factory. Different devices can also use different controllers which have different strategies and capabilities. Further to this, devices such as SSDs have more sophisticated controllers with stronger error correction and more sophisticated algorithms to ensure data safety. Regardless, it seems from experience, flash devices while robust, seem to fail more often without warning due to electrical and or mechanical failures which further underscores the need for keeping back-ups, because they often won’t warn you before they suddenly stop working entirely.
In the end, the reader was very thankful for the return of the recovered data (if only partial), and a donation was received that will go towards keeping things running at goughlui.com. Thanks!