Category Archives: Problems & Fixes

Fix mdadm SparesMissing event detected

Sometimes when you create a RAID array using mdadm and do not specifically specify there are no spare drives available for the array, the mdadm monitoring daemon will report to you in daily emails that:

A SparesMissing event had been detected on md device /dev/md*

If you check your /etc/mdadm/mdadm.conf file, you’ll most likely find the ARRAY assembly line specifies an “spares” parameter that is non-zero.

To stop this daily nagging email message, just change the “spares” parameter to 0.

ARRAY /dev/md0 metadata=1.2 spares=0 name=<mymachine>:0 UUID=<myUUID>

The reload the monitoring daemon. In Debian:

service mdadm reload

Currently, that command will actually stop and start the daemon, rather than reloading it, but hey, who knows what the future modern world will bring?

Also, this is all assuming that you really do have no spares in your array, and the mdadm system just couldn’t believe such audacity, and concluded that you must have a least one that you just didn’t tell it about.

Fix for MRTG Generating SNMP_Session Error in Debian Wheezy (and possibly Ubuntu)

Lately, after an upgrade from Debian Squeeze to Debian Wheezy, MRTG is sending emails every few minutes when it runs from the crontab. This is quickly filling up my Inbox. There error message is as follows:

Subroutine SNMP_Session::pack_sockaddr_in6 redefined at /usr/share/perl/5.14/Exporter.pm line 67.
 at /usr/share/perl5/SNMP_Session.pm line 149

After waiting for a while to see if a fix came from Debian, I decided to look around on my own. It seems there is a patch that works, but it has not been propagated out to the repositories. Luckily, this error problem is easy to fix.

You can apply the patch at the link above, or you can just edit a file and make 2 quick changes: Edit the file

/usr/share/perl5/SNMP_Session.pm

Change line #149:

old: import Socket6;
new: Socket6->import(qw(inet_pton getaddrinfo));

Then change line #609:

old: import Socket6;
new: Socket6->import(qw(inet_pton getaddrinfo));

That seems to fix the problem quite well. Hopefully the Debian maintainer will get that change in sooner that later so others don’t have to bother!

Note: Someone commented that the line numbers listed were a bit off from version to version of Debian. Not entirely unexpected. It’s the change to calling the class directly that counts.

How To Deal With Udev Children Overwhelming Linux Servers With Large Memory

A few months ago a client purchased a new server and asked me to set them up with Linux-based virtual machines to consolidate their disparate hardware. It was a big server, with 128 Gigs of memory. I chose to use Debian Wheezy and QEMU KVM for the virtualization. We’d be running a couple Windows Server instances, and a few Debian GNU/Linux instances on it.

Unfortunately, we ended up encountering some strange problems during heavy disk IO. The disk subsystem is a very fast SAS array, also on board the server, which is a Dell system. The VM’s disks are created as logical volumes on the array using LVM.

Each night, in addition to the normal backups they do within the Windows servers, they also wanted a snapshot of the disk volumes sent off to a backup server. No problem with LVM, even with the VM’s running, when you use the LVM snapshot feature. This does tend to create a lot of disk IO, though.

What ended up happening was that occasionally, every week or two, the larger of the Windows server instances would grind slowly to a halt, and eventually lock up. The system logs on the real server would begin filling with timeouts from Udev – about it not being able to kill children. This would, in turn, effect the whole system – making a reboot of the whole server necessary. Very, very ugly, and very, very embarrassing.

I tried a couple off-the-cuff fixes that were shots in the dark, hoping for an easy fix. But the problem didn’t go away. So I had to dig in and research the problem.

It turns out that Udisk (which is part of Udev) decides how many children it will allow based upon the amount of memory the server is running. In our case, 128G – which is quite a lot. This number of allowed children was a simple one-to-one ratio, based upon memory. However, with this much memory, that many children seemed to be overloading the threaded IO capacity of this monster server, causing blocks, during live LVM snapshots being copied.

What I ended up doing was manually specifying that the maximum number of allowed children for Udev would be 32 instead of the obscene number the inadequate calculation in the Udev code came up with. Since doing this, the server has run perfectly, without a hitch, for a good, long time.

So this is for anyone who may have run into a similar problem. I could find no information about this on the Internet at the time. But I did manage to find how to effect the number of children Udev allows. You can do it while the system is running (which will un-happen once the server is rebooted) or you can put in a kernel boot parameter to effect it, until the Udev developers fix their code to provide a sane value the maximum number of children allowed in systems with a large amount of memory.

At the command line, this is how. I used 32. You might like something different, of course.

udevadm control --children-max=32

And, as a permanent thang, the Linux kernel boot parameter is “udev.children-max=”.

Hopefully this will save some of you some of my headache.