2013-06-09 Today is a bad day. A local workstation sent off DHCP requests. It received offers but no leases. Having a look at the home server, which acts as DHCP server, I discovered that it's root filesystem was mounted read-only, because of errors. What do you do if you have errors in the filesystem? Running fsck(8) seems to be a natural repair action. After all, how many of us can out-guess fsck when it comes to repairing internal file system structures? Probably the top 10 developers for each file system, which leaves the other 99.99% of us with the -y switch. [0] Well, the filesystem is consistent again, ... but the files are not anymore where they should be! What's better: A slightly inconsistent filesystem with the files where they should be or a consistent filesystems with hundreds of inumber-named files in lost+found (and a system unable to re- boot)? Six years ago, I've had the same problem already. [1] Have I learned anything? ... rather not. I could have split the soft-RAID1 and work on one disk only, keeping the other as it was. I could have bought better hard disks that survive more than six years of non-stop power-on. I could have exchanged the hard disks every five years. But maybe the problem wasn't the hardware but the software ... That's the problem of computer stuff, you never know. It's an il- lusion that the systems work as expected over a long time. They never do. Computer systems are for short-time use only. If you want something enduring, better not use computers. It doesn't matter how many RAIDs you have or how many backups, or how many mirrors, computers fail and you have to fix them, constantly. How do I do it better next time? How do I build a system that won't fail? The problem is not so much the static data that you store on a system, that can be backupped quite well, it's more the operating system that you've personalized, that's data that lives. You can't easily back that up. Equally, you can't simply create a personalized clone of the system as a ``respawn point''. The system lives, thus it changes constantly. As you can't be sure if your software (there you might be able to influence things yourself) or your hardware (there you can hardly change anything) will fail, one system cannot ensure itself. You need two, or better three or five or seven systems so that the failure (software or hardware) of one does not matter to the overall ser- vice. In the best case, these systems have different hardware. Different software is hardly possible as such a setup is only feasible if managing seven systems is no more work than managing one. It must be a plug'n'play approach, a cluster. If one machine fails (who tells me which), I unplug it and add a new (arbitrary) machine to the grid, or two more if I like. There should be no more work to do. Like a RAID with spare devices. Such approaches are existing for the simple storage of files. If the file on some system is different to the files on many other systems, the one system has likely a damaged file. To store files, the underlying structure/software/filesystem is unimpor- tant, it can be different on some of the machines. The operating system is different in this respect. It is in execution, its files are much more dynamic. The filesystem the operating system works on must never be inconsistent. How can the case of an in- consistent filesystem be solved? Let's take Plan 9. There you have generic CPU-nodes (machines) that you can simply plug'n'play. The same for all kinds of peri- pheral devices. The only critical part is the file server. That's the non-generic node. Well, you have WORM (through Venti). As long as that works, the data can be backupped easily. But you need a filesystem on top of it, too: fossil. If that is incon- sistent or loses some scores, you have the same problem, haven't you? So, maybe you could put your data on a bunch of Venti servers that sync automatically. They could be on non-self-controlled machines as well, because you can't get the data out of Venti if you don't know the scores. With the use of different storage backends, you would have quite good data-safety. This is not the critical part. The one and only critical part appears to be the filesystem. (In the Plan 9 case: Fossil.) How do you care for that one? Again by having multiple instances of it? But in this case you need to have them under your control, because the filesystem is the door to your data. In the end, you realize that there is no way around running a couple of servers under your own control. (Don't ask how high their power consumption is!) Also, the overall system is neces- sarily quite complex because of the redundancy and thus syncing you need and want. Eventually, you end with doing a lot of management overhead work whereas you wanted to avoid exactly that. Is it worth all that? Should I rather move important information into the analog domain and stop caring to preserve digital information at all? Digital for now, analog forever. [2] Digital information last forever - or five years, whichever comes first. [3] [0] http://lwn.net/Articles/248180/ [1] http://marmaro.de/lue/txt/2007-09-26.txt [2] Popular slogan in the archiving community [3] Jeff Rothenberg http://marmaro.de/lue/ markus schnalke