Neohapsis is currently accepting applications for employment. For more information, please visit our website www.neohapsis.com or email firstname.lastname@example.org
[Dailydave] The Small Company's Guide to Hard Drive Failure and Linux
From: Dave Aitel (daveimmunitysec.com)
Date: Thu Nov 18 2004 - 08:49:09 CST
So I learned a lesson about reliability and I thought I'd share.
Recently, the main hard drive, a little 40 gigger that runs
www.immunitysec.com, which also happens to be mail.immunitysec.com and
dns.immunitysec.com, started to display read errors in the kernel logs
(viewable via "dmesg" if you are root). This also caused data
corruption in a few cases, and some other badness such as long pauses
during writes. Jeremy says that he's never had a hard drive fail on him,
but it happens all the time, so it'll probably happen to you, probably
the day after you sign a big contract with someone and need to do
something other than mess with your hard drive.
Once this starts happening, you have a few options. The first one is to
use fsck -c to mark blocks bad (this requires you to boot to single user
mode). This is a stop-gap measure while you go prepare a new drive,
since the bad sectors are a sign that your drive is about to die
So your real option is to backup the hard drive and replace it. It's
nearly impossible to maintain a "GOLD" hard drive which you can
insta-replace your linux box with. Lots of things happen on a
minute-to-minute basis - mail, cvs checkins, log entries, etc. These all
need to be backed up and restored perfectly on the new drive. So some
downtime is probably necessary.
One might think you could use dd to duplicate your drive. I initially
tried this, and my results were not good. I did remember to use dd
if=/dev/hda of=/dev/hdc conv=noerror (the noerror flag is important).
However, this takes forever and a day. Basically it'll take all night.
So be prepared for that, even on a small drive and a fast system. They
sell devices that can do it faster, I think, but you don't have one, do you?
The other issue with dd is that typically replaces every sector on the
disk. So you'll need a disk EXACTLY the same size as the previous disk.
My disk was one meg smaller (40.0 Meg instead of 40.9 Meg). This was an
So instead of that, one nice option is just to get your new drive, make
the partitions manually with fdisk on it to replicate the original
drive, and then use tar to copy the contents across. I used knoppix
The command for tar is:
cd /mnt/hda1 #drive to copy from
tar cf - . | (cd ../hdd1; tar xf -)
This will maintain the users and permissions and stuff. After you're
done copying all the partitions (you don't do the swap partitions,
obviously), you then double check them to make sure they're right.
Mounting them and doing a df -k is useful so you can make sure they look
the same. (This method will allow you to expand or resize partitions as
well, and is faster than dd, although still slow).
The next step is to re-lilo (or grub) your hd. (hahaha, that sentance
made no sense, but bear with me here). Anyways, you want to use the lilo
on your hd, not the lilo that comes with knoppix, which doesn't seem to
work. The trick to this is to use lilo -r /mnt/hda1 (your root
partition, which has /etc/ and /boot/ on it.). However, this won't work
with knoppix, since the default mount points have the "nodev" option
set. You'll need to remount them with the dev option set before you can
run lilo on them.
After that, the drive should be bootable. I'd move it to the "master"
drive, if you haven't already, and test that part out.
One thing I did that you might also do is just go to the co-lo (your
hardware IS at a co-lo, right?) take the drive out, and bring it home.
Like many cheap boxes, the box in the co-lo didn't even have a cdrom,
and wasn't the fastest box. Bringing it home actually saved time in the
long run, since I could use a real desktop (with swapable drive bays -
always get those) to do the work. You'll want to get a few spare drives,
since the first drive I tried to restore onto was bad as well. This is
an important note - never buy recertified drives. Always get spankin'
new drives. It might be fun to do a strings on some of those recertified
drives, but I didn't have time.
One thing my lilo did that was weird was rewrite the fstab to use
"LABEL=/" instead of /dev/hda1. If you happen to be hosted at Pilosoft
(or another co-lo that is run by someone on the local linux users group
list - very good idea!) they might jump in and save your butt when you
load it up and it doesn't work and you're too tired to figure out why.
The next step after doing all this is typically to make a plan that
involves not having to ever do this again. For those of you not in the
know - you want a hardware supported (get a good modern motherboard)
RAID-1 solution and you want to be able to swap out one of your two
drives (mirrored) when Linux tells you that one is bad. You also want to
have some sort of backup solution running (of course), and you want to
have a secondary DNS server and a backup machine somewhere in another
state (or country) that can take over if your main CO-LO goes under or
something. Something that can provide basic mail and web services is
nice. It might be good to hire an admin who is not you.
It's not uncommon for a linux machine not to work properly when you
reboot. This is because it's probably been a few years since you
rebooted, and you probably redid a lot of libraries in the meantime,
some of which arn't in the right places. So after you do all this, it's
good to test out all your services and make sure they are, in fact,
doing what you think they are. Your co-lo (whom you bought the computer
from, most likely) often provides a guarantee of hard drives (Pilosoft
does). Don't save the 50 bucks by taking them up on this, since that bad
hard drive still has all your corporate data on it. You should be
getting swamped with email now, since your mail was down and now it's up
and the interweb is resending things for you.
Oh, and always run grsecurity kernels. SELinux is just a pale imitation.
Recently grsecurity has added things like brute force prevention, which
detects exploit attempts and stops them. I assume he's doing a check to
see if eip is pointing to a writable page, and if so, counting down from
5 or something, and if 0, then not allowing a fork or exec. Either way,
it's a good thing.
Dailydave mailing list