TIFU case-study: destroying a SSD with IO problems

I'm writing this after having 4 months of data being only theoretically recoverable. Partially as a warning and partially for digital closure.

Also learn from my mistakes. Don't be like me.

-1. The setup

  • WD Green 480 GB SSD + Hitachi 3TB

Both disks had LVM on LUKS. The SSD was split between /, cache[1], mostly, and a small /boot, /boot/EFI and swap. The HDD was just data part of the LVM cache and swap with a high priority.

The encryption key for the HDD was a file located on the SSD. This was the only encryption key for the HDD.

0. The cat-astrophe

One day, after waiting out a IO freeze, and after deciding to reboot, when the machine did not deal itself with the problem (4hrs vs. 5-10 minutes usually), I was welcomed by a SSD that decided to obliterate itself to 16KiB.

1. The road to dis-aster (a.k.a. all those small things that synergize in your face).

A incomplete backup solution.

For many reasons I was putting of fixing this for later™. Until the later™ bit me in the ass.

A lot of important files were saved by Syncthing to another machine I'm working on in parallel, but this is not a backup.

Ignoring high IO-related freezes and their impact on the SSD caching.

This was the main issue, for which I have tried multiple solutions. Some of them are written in this section.

My system was plagued by freezes caused probably by excessive swapping. When the physical RAM was exhausted the system responsiveness went out of the window and in a hurry.

Usually this happened when I've wanted to play modded Minecraft and did not close Firefox (do note, I have enough tabs and extensions to make Chrome look slim and well-mannered at the table).

If I could observe the system (as SSHd was not responding to new connections) I would see in iotop mostly JBD2 operations, the RAM-intensive processes and often Syncthing doing high IO [2]. This plagued me even before buying the SSD and at that time having swap on a separate drive.

As one of the symptoms was intensive HDD work noise I've disregarded the potential impact on the SSD. The only thing I've tried was adding "commit=15" to the SSD file systems, but I did not see any results.

While it did not trash my drive the system reacted to Magic SysRq. But...

Trying to wait out the problem.

When the IO problem occurred at first I've decided to REISUB the problem. Just add boot time, restart the apps and you can get back to whatever you were doing.

But at some point Pidgin or libpurple-hangouts on my machine decided to stop remembering Hangouts OAuth keys (on a second machine the same setup works flawlessly(!)) and I had to renew them on each reboot. I've decided at some point just to wait out the freeze to avoid the time lost. It was just 5-10 minutes. [3]

The last time I've did this the system tried to sort itself for a very long time (I think ~4 hours) and stopped reacting to the Magic SysRq.

Buying an DRAM-less SSD.

This is uncertain as far as my knowledge goes, but maybe, just maybe, the SSD firmware would not write all those transactions as fast as they occurred. Instead they could've been written in bulk or allow the LVM caching to write them directly to the HDD.

To my defense it was hard to find a definitive answer if a not-top-tier disk has DRAM or not (unless it's hidden under the term "cache").

Having the swap on the SSD.

I was a hard opposer of this, only to have one. I thought that priorities would save me from problems and this swap would save me in hard times of memory usage. Now I may never know if it worked at all.

In the end I will chalk it up to a unnecessary risk.

Having only one encryption key to the HDD or not using a password manager to backup it.

This is why my data remains theoretically recoverable. I only need to find a 1KiB (if I remember correctly) of random data that was my key.

And I was very sure that I have a second password on that drive. Turns out I had a second password. ...for different disk...

The password manager/GPG encryption/printer part goes without saying.

Buying the SSD from a lame-ass, crooAHEM... vendor.

Finally I could find a tiny bit of solace in getting a RMA'ing for new SSD. But As I've decided to go a little bit too cheap the vendor got a drive from outside of EMEA and the WD site says I can go and f*ck myself.

Also the WD site has a broken the registration form and I can't even get any support without solving some dumb HTML/JS form problems.

2. Conclusions

Detailed conclusions may not interest you. Well those are simple: recover what you can, delete what you can't, get a system running.

But I will at least offer a few words that could help you avoid crashing and burning completely.

  • your memory can lie to you[4]; that's why password managers are your friends, especially combined with Syncthing,
    • an online password manager is better that no password manager (unless you use a password sharing site that disguises itself as a password manager),
    • diceware is a good help but may require longer phrases these days,
  • there are two kinds of people: those that have recoverable backups and everybody else,
    • bonus irony: I work in a disaster recovery team and regular DR tests are our work; but as we say in Poland: the shoemaker walks barefoot,
  • always have a backup plan (other than your backup),
    • in other words if you plan something think about a contingency plan how you'll get back to speed if your plan fails,
  • fix your sh!t, or at least work around it,
    • "Once is happenstance. Twice is coincidence. Three times is enemy action."
  • a written plan is better that the one in your head (see memory lying).

TL;DR: overlooked a big, but "manageable" problem with disk activity and that problem eroded my SSD. Then a few mistakes chained up to being left with a perfectly fine, but encrypted with no password HDD with my data.

3. Footnotes

[1] More or less like described under the Usage paragraph: http://man7.org/linux/man-pages/man7/lvmcache.7.html

[2] This is still a mystery to me: why Syncthing showed up so often in this case. "Write at any cost"?

[3] A bonus middle finger to Google here.

[4] Seriously, this is a psychological phenomenon: https://en.wikipedia.org/wiki/False_memory As a bonus story I once had a phrase-based password and a simple drawing to give me hints that, in assumption, only I could use to re-remember the passphrase. Unfortunately due to inflections in the polish language I could not remember the proper forms that went with the picture. ...or the order that things went. ...or maybe I've interpreted wrong words?

I've made this password two years ago. To this day I don't know what it is.