nagios, mdadm and snmp

I found this script while looking for a simple script to monitor mdadm arrays. The script is fine, but it has a subtle bug – it will never report an error because the –detail parameter is missing in the call to mdadm. I modified the script a bit, like so:

#!/bin/sh
# (c) 2008 Jasper Spaans 

worst=0
msg=""

for dev in /dev/md?* ; do \
  mdadm --misc -t --detail $dev >/dev/null
  status=$?
  if [ $status == 0 ]; then
    msg="${msg} ${dev}: ok"
  elif [ $status == 1 ] ; then
    if [ worst != 2 ] ; then
      worst=1
    fi
    msg="${msg} ${dev}: degraded"
  elif [ $status == 2 ] ; then
    worst=2
    msg="${msg} ${dev}: degraded - unusable"
  fi
done

echo "mdadm:$msg"
exit $worst

which I saved as /usr/local/bin/check-mdadm.sh.

Add in a bit of snmpd.conf config (and set up sudo accordingly, of course):

...
exec   mdadm /usr/bin/sudo /usr/local/bin/check-mdadm.sh

and a small script on the nagios side (/usr/local/bin/nagios-check-mdadm):

#!/bin/sh

SNMP=`snmpwalk -v1 -c YOUR-PUBLIC $1 extOutput |grep mdadm`
TMP1=`echo $SNMP |grep degraded`
TMP2=`echo $SNMP |sed -e 's/^.*mdadm: //'`

if [ "$TMP1" = "" ]; then
  echo "OK: $TMP2"
  return 0
else
  echo "ERROR: $TMP2"
  return 2
fi

add a bit of nagios config:

define command {
       command_name check_mdadm
       command_line /usr/local/bin/nagios-check-mdadm $HOSTADDRESS$
}
define service {
       use      defaults
       name     check_mdadm
       description   MDADM
       check_command check_mdadm
}

And voila, nagios notifications when disks fall out of the array.

Posted in Sysadmin | 1 Comment

a new home server

I’ve been running an old Shuttle with a 2.4GHz celeron CPU, 512MB of ram and two 500GB disks in raid-1 as home server for the past 5 years or so. Well, I upgraded the disks in May 2007, before that it had 2x 200GB in raid-1. The thing has no UPS and runs in the closet here at home. And yet:

13:40:34 up 569 days, 17:04,  2 users,  load average: 1.26, 0.94, 0.45

Yeah, home power is pretty reliable around here.

This machine serves as the central network storage for our home, and I also use it to back up a bunch of servers that live at a nearby colo facility, with the rather fantastic BackupPC. The Shuttle has served well over the years but it is getting a bit old – I was starting to expect it to fail. Its power draw is rather high: 78W while idle (that’s after applying all of powertop’s suggestions), and a whopping 100W while doing heavy disk activity.

I was running out of disk space again, so I bought two 1TB ‘green’ WD drives (WD10EADS-00L) that are rated at 5.4W active, 2.8W idle, and 0.4W standby/sleep.

Next – a replacement for the Shuttle. First I looked at a QNAP TS-219p which is a rather awesome little NAS device. It’s based on Marvell’s Kirkwood ARM core, which is the same as the one used in the Sheevaplug, clocked at 1.2GHz. This thing is pretty fast. Its power specs are also impressive:

Sleep mode: 5W
In operation: 21W (with 2 x 500GB HDD installed)

I was of course looking to run Debian on it, which is perfectly possible. People like the firmware that the thing comes with, but it’s proprietary so I’d rather not use that. Plus, I need to be able to run BackupPC.

The major downside is price – the TS-219P costs about $400, without disks. Since the Sheevaplug costs about $100, I would have thought a price in the $200-250 range for the TS-219P would have been reasonable.

Meanwhile I came across some really good NAS reviews over at SmallNetBuilder, and in particular their price/performance NAS chart.

Looking at that chart, the MSI Wind PC performance is pretty much on par with the TS-219P, for a fraction of the price. Extra bonus: it does not come with proprietary software preinstalled, because the Wind is really a bare-bones PC. The Wind has one 3.5″ bay, and one 5.15″ bay. It also has an on-board CF adapter. It has a dual-core Intel Atom 230 (1.6GHz).

I purchased

$134.99    MSI Wind PC
 $26.99    G.SKILL 2GB 200-Pin DDR2 SO-DIMM DDR2 533
 $43.99    Transcend 16GB Compact Flash (CF) Flash Card Model TS16GCF133
  $9.99    StarTech BRACKET Metal 3.5" to 5.25" Drive Adapter Bracket

Total: $215.96 + shipping

The drive bay adaptor turned out to be not only severely overpriced, but also not practical for the Wind – I had to drill a few holes in the damn thing to make the second hard drive fit in the Wind. Don’t buy this kind, or don’t pay $10 for it!

I installed Debian on the CF card (leaving it read-only during normal operation) and use the two disks purely for data – in raid-1 of course. If I did this again I’d buy a smaller CF card – 8GB would be plenty, even 4GB would be enough for the non-volatile bits of /.

Power use, as tested: idle 27W, with heavy disk activity 33W. In other words, this will take 50-70W off our household power budget, which should work out to a savings of $7 to $10/month.

Posted in Hardware, Sysadmin | Tagged | 1 Comment

over to x25-m

I bought an Intel X-25M SSD drive for my laptop in early June. I got the 80GB version, and this was to replace a Hitachi 7K200-160 – a 160GB 7200rpm drive. Note that the X25-M is generation 1; Intel has since released an updated, cheaper version of the drive that is slightly faster and uses a little less power. The Intel SSD drives generally blow the competition out of the water in real world applications because they have extraordinary random write performance. SSDs are typically marketed with their theoretical maximum sequential write performance, which – barring a few very specific use cases – is far less important than random write performance for your average desktop or server.

The difference in performance between the old mechanical drive and this SSD is *spectacular*. The machine feels 100 times faster. In other words, you want one of these for your desktop/laptop :)

As for servers – I can’t wait until these things become big enough/cheap enough to replace mechanical drives there, too…

Here are some bonnie++ results:

Hitachi 7K200-160:

$ bonnie++ 
Writing with putc()...done
Writing intelligently...done
Rewriting...done
Reading with getc()...done
Reading intelligently...done
start 'em...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version 1.03c       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
countzero        4G 14755  55 15887  20 13568  13 18775  58 31430  11  72.4   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 21341  83 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
countzero,4G,14755,55,15887,20,13568,13,18775,58,31430,11,72.4,0,16,21341,83,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++

Intel X25-M gen 1:

$ bonnie++
Writing with putc()...done
Writing intelligently...done
Rewriting...done
Reading with getc()...done
Reading intelligently...done
start 'em...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version 1.03c       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
countzero        4G 51217  98 76395  31 42853  25 43666  80 150391  45 14762  61
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
countzero,4G,51217,98,76395,31,42853,25,43666,80,150391,45,14762.2,61,16,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++
Posted in Hardware | Leave a comment

capistrano, svn and webserver timeouts

A customer reported problems with capistrany deploys that would just die like this:

 ** [XXX.XXX.XXX :: err] svn: REPORT request failed on '/!svn/vcc/default'
 ** svn: REPORT of '/!svn/vcc/default': Chunk delimiter was invalid (http://XXX.XXX.XXX)
    command finished

After disabling gzip compression on the server for text/xml documents, the error became

 ** [XXX.XXX.XXX :: err] svn: REPORT request failed on '/!svn/vcc/default'
 ** svn: REPORT of '/!svn/vcc/default': Could not read response body: connection was closed by server (http://XXX.XXX.XXX)

The server side logs said:

[Fri Jul 03 10:53:57 2009] [error] [client XX.XX.XX.XX] Provider encountered an error while streaming a REPORT response.  [500, #0]
[Fri Jul 03 10:53:57 2009] [error] [client XX.XX.XX.XX] A failure occurred while driving the update report editor  [500, #190004]

Googling was not very helpful – there are many reports of these errors going back years, and many different solutions, none of which applied to my setup. In general, these errors seem to mean that there was some sort of network problem.

I tried to reproduce the problem by running the offending svn command manually. Out of hundreds of tries, I only managed to make it fail like that just once. And yet running cap deploy, which in turn calls the svn command, it would happen much more often.

I finally tracked this down to an agressive send/receive timeout in Apache’s config. It was set to 3 seconds to prevent too many inactive connections from taking up server resources. Apparently the subversion client sometimes takes a while to get back to the http server its talking to – in this particular situation when run via capistrano, more than 3 seconds. So the server would disconnect the svn client, which would then just fall over with that obscure error message.

In other words, check your server timeouts if you see this kind of intermittent error…

Posted in Sysadmin | Tagged , , , , , | Leave a comment

monit, mongrel_rails and ENV["HOME"]

So your mongrels are humming along happily and you have monit monitoring them with a definition like this:

check process mongrel_8010 with pidfile /path/to/current/log/mongrel.8010.pid
  start program = "/usr/bin/mongrel_rails cluster::start -C /path/to/current/config/mongrel_cluster.yml --clean --only 8010"
  stop program = "/usr/bin/mongrel_rails cluster::stop -C /path/to/current/config/mongrel_cluster.yml --clean --only 8010"

  if failed host 127.0.0.1 port 8010
    with timeout 10 seconds
    then restart

  if totalmem > 128 Mb then restart
  if cpu is greater than 60% for 2 cycles then alert
  if 3 restarts within 5 cycles then timeout
  group mongrel

The start/stop commands work perfectly from the command line, but somehow not when monit’s calling them. Sure enough, you find this in the mongrel.8010.log file:

/usr/lib/ruby/gems/1.8/gems/rubyforge-1.0.3/lib/rubyforge.rb:15:in `expand_path': couldn't find HOME environment -- expanding `~' (ArgumentError)

The line in question is

  HOME        = ENV["HOME"] || ENV["HOMEPATH"] || File::expand_path("~")

Monit does not set a HOME environment variable, nor HOMEPATH.

The documentation for File::expand_path says:

Converts a pathname to an absolute pathname. Relative paths are referenced from the 
current working directory of the process unless dir_string is given, in which case it 
will be used as the starting point. The given pathname may start with a ``~’’, which 
expands to the process owner‘s home directory (the environment variable HOME must 
be set correctly). ``~user’’ expands to the named user‘s home directory. 

Ouch. So if HOME is not set, File::expand_path(“~”) is guaranteed to fail. That’s a bug in the rubyforge gem I think.

I worked around this by setting ENV["HOME"] to a fallback value before the

require 'rubygems'

line in /usr/bin/mongrel_rails.

I filed a bug. It took me a while to figure out that the rubyforge gem is part of the codeforpeople project on rubyforge.

Posted in Rails | Tagged , , , , , , | Leave a comment

Monticello municipal fiber now really a go

I wrote about Monticello, Minnesota and its fight with the local incumbent telco TDS last fall. At the time, TDS had its lawsuit against the city thrown out for lack of merit. No big surprise, since the gist of the suit was basically “they are going to compete with us, and they are going to offer better service for less money!”. TDS was considering an appeal to the Minnesota supreme court.

TDS did appeal in the end, and thankfully lost. This is a win for municipal internet rollouts all across the US. It will hopefully make those greedy telcos think twice about trying to stop municipalities from providing their constituents with proper internet access.

A TDS spokesman said the decision “endangers the appropriate relationship between municipalities and private enterprise”. Presumably he means the relationship where municipalities have to allow private telcos like TDS to charge citizens an arm and a leg for sub-par broadband service, because there is no competition. To which I say – good riddance!

Construction starts in 2 weeks, and the first customers should be hooked up sometime this fall.

Posted in Broadband | Tagged , , , | Leave a comment

df and zettabytes

This is a very confused filesystem. But check it out – df supports zettabytes!

# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/md0               19G  -64Z   22G 101% /

Surprisingly, the machine is up and runs just fine.

Posted in Sysadmin | Tagged , , , , | 4 Comments

on the G1′s bluetooth support

I got a G1 the other day, and have it upgraded to android 1.5 (cupcake). Turns out the bluetooth abilities of this phone are rather … limited, particularly compared to my trusty nokia E70:

G1:

$ sdptool browse 00:22:A5:XX:XX:XX
Browsing 00:22:A5:XX:XX:XX ...
Service Name: Audio Source
Service RecHandle: 0x10000
Service Class ID List:
  "Audio Source" (0x110a)
Protocol Descriptor List:
  "L2CAP" (0x0100)
    PSM: 25
  "AVDTP" (0x0019)
    uint16: 0x100
Profile Descriptor List:
  "Advanced Audio" (0x110d)
    Version: 0x0100

Service Name: AVRCP TG
Service RecHandle: 0x10001
Service Class ID List:
  "AV Remote Target" (0x110c)
Protocol Descriptor List:
  "L2CAP" (0x0100)
    PSM: 23
  "AVCTP" (0x0017)
    uint16: 0x100
Profile Descriptor List:
  "AV Remote" (0x110e)
    Version: 0x0100

Service Name: Voice Gateway
Service RecHandle: 0x10002
Service Class ID List:
  "Headset Audio Gateway" (0x1112)
  "Generic Audio" (0x1203)
Protocol Descriptor List:
  "L2CAP" (0x0100)
  "RFCOMM" (0x0003)
    Channel: 11
Profile Descriptor List:
  "Headset" (0x1108)
    Version: 0x0100

Service Name: Voice Gateway
Service RecHandle: 0x10003
Service Class ID List:
  "Handsfree Audio Gateway" (0x111f)
  "Generic Audio" (0x1203)
Protocol Descriptor List:
  "L2CAP" (0x0100)
  "RFCOMM" (0x0003)
    Channel: 10
Profile Descriptor List:
  "Handsfree" (0x111e)
    Version: 0x0105

Nokia E70:

$ sdptool browse 00:12:D1:XX:XX:XX
Browsing 00:12:D1:XX:XX:XX ...
Service Name: AVRCP Target
Service Description: Audio Video Remote Control
Service Provider: Symbian Software Ltd.
Service RecHandle: 0x10000
Service Class ID List:
  "AV Remote" (0x110e)
Protocol Descriptor List:
  "L2CAP" (0x0100)
    PSM: 23
  "AVCTP" (0x0017)
    uint16: 0x100
    uint16: 0xf00

Service Name: Hands-Free Audio Gateway
Service RecHandle: 0x10001
Service Class ID List:
  "Handsfree Audio Gateway" (0x111f)
  "Generic Audio" (0x1203)
Protocol Descriptor List:
  "L2CAP" (0x0100)
  "RFCOMM" (0x0003)
    Channel: 28
Language Base Attr List:
  code_ISO639: 0x454e
  encoding:    0x6a
  base_offset: 0x100
Profile Descriptor List:
  "Handsfree Audio Gateway" (0x111f)
    Version: 0x0101

Service Name: Headset Audio Gateway
Service RecHandle: 0x10002
Service Class ID List:
  "Headset Audio Gateway" (0x1112)
  "Generic Audio" (0x1203)
Protocol Descriptor List:
  "L2CAP" (0x0100)
  "RFCOMM" (0x0003)
    Channel: 29
Language Base Attr List:
  code_ISO639: 0x454e
  encoding:    0x6a
  base_offset: 0x100
Profile Descriptor List:
  "Headset" (0x1108)
    Version: 0x0100

Service Name: SyncMLClient
Service RecHandle: 0x10003
Service Class ID List:
  UUID 128: 00000002-0000-1000-8000-0002ee000002
Protocol Descriptor List:
  "L2CAP" (0x0100)
  "RFCOMM" (0x0003)
    Channel: 10
  "OBEX" (0x0008)
Language Base Attr List:
  code_ISO639: 0x454e
  encoding:    0x6a
  base_offset: 0x100
Profile Descriptor List:
  "" (0x00000002-0000-1000-8000-0002ee000002)
    Version: 0x0100

Service Name: OBEX File Transfer
Service RecHandle: 0x10004
Service Class ID List:
  "OBEX File Transfer" (0x1106)
Protocol Descriptor List:
  "L2CAP" (0x0100)
  "RFCOMM" (0x0003)
    Channel: 11
  "OBEX" (0x0008)
Language Base Attr List:
  code_ISO639: 0x454e
  encoding:    0x6a
  base_offset: 0x100
Profile Descriptor List:
  "OBEX File Transfer" (0x1106)
    Version: 0x0100

Service Name: Nokia OBEX PC Suite Services
Service RecHandle: 0x10005
Service Class ID List:
  UUID 128: 00005005-0000-1000-8000-0002ee000001
Protocol Descriptor List:
  "L2CAP" (0x0100)
  "RFCOMM" (0x0003)
    Channel: 12
  "OBEX" (0x0008)
Language Base Attr List:
  code_ISO639: 0x454e
  encoding:    0x6a
  base_offset: 0x100
Profile Descriptor List:
  "" (0x00005005-0000-1000-8000-0002ee000001)
    Version: 0x0100

Service Name: SyncML DM Client
Service RecHandle: 0x10006
Service Class ID List:
  UUID 128: 00000004-0000-1000-8000-0002ee000002
Protocol Descriptor List:
  "L2CAP" (0x0100)
  "RFCOMM" (0x0003)
    Channel: 13
  "OBEX" (0x0008)
Language Base Attr List:
  code_ISO639: 0x454e
  encoding:    0x6a
  base_offset: 0x100
Profile Descriptor List:
  "" (0x00000004-0000-1000-8000-0002ee000002)
    Version: 0x0100

Service Name: Nokia SyncML Server
Service RecHandle: 0x10007
Service Class ID List:
  UUID 128: 00005601-0000-1000-8000-0002ee000001
Protocol Descriptor List:
  "L2CAP" (0x0100)
  "RFCOMM" (0x0003)
    Channel: 14
  "OBEX" (0x0008)
Language Base Attr List:
  code_ISO639: 0x454e
  encoding:    0x6a
  base_offset: 0x100
Profile Descriptor List:
  "" (0x00005601-0000-1000-8000-0002ee000001)
    Version: 0x0100

Service Name: SIM Access
Service RecHandle: 0x10008
Service Class ID List:
  "SIM Access" (0x112d)
  "Generic Telephony" (0x1204)
Protocol Descriptor List:
  "L2CAP" (0x0100)
  "RFCOMM" (0x0003)
    Channel: 8
Language Base Attr List:
  code_ISO639: 0x454e
  encoding:    0x6a
  base_offset: 0x100
Profile Descriptor List:
  "SIM Access" (0x112d)
    Version: 0x0101

Service Name: OBEX Object Push
Service RecHandle: 0x10009
Service Class ID List:
  "OBEX Object Push" (0x1105)
Protocol Descriptor List:
  "L2CAP" (0x0100)
  "RFCOMM" (0x0003)
    Channel: 9
  "OBEX" (0x0008)
Language Base Attr List:
  code_ISO639: 0x454e
  encoding:    0x6a
  base_offset: 0x100
Profile Descriptor List:
  "OBEX Object Push" (0x1105)
    Version: 0x0100

Service Name: Dial-Up Networking
Service RecHandle: 0x1000a
Service Class ID List:
  "Dialup Networking" (0x1103)
Protocol Descriptor List:
  "L2CAP" (0x0100)
  "RFCOMM" (0x0003)
    Channel: 2
Language Base Attr List:
  code_ISO639: 0x454e
  encoding:    0x6a
  base_offset: 0x100
Profile Descriptor List:
  "Dialup Networking" (0x1103)
    Version: 0x0100

Service RecHandle: 0x1000b
Protocol Descriptor List:
  "L2CAP" (0x0100)
  "RFCOMM" (0x0003)
    Channel: 1

Service Name: Imaging
Service RecHandle: 0x1000c
Service Class ID List:
  "Imaging Responder" (0x111b)
Protocol Descriptor List:
  "L2CAP" (0x0100)
  "RFCOMM" (0x0003)
    Channel: 15
  "OBEX" (0x0008)
Language Base Attr List:
  code_ISO639: 0x454e
  encoding:    0x6a
  base_offset: 0x100
Profile Descriptor List:
  "Imaging" (0x111a)
    Version: 0x0100

What I’m missing in particular is the dialup access. Sometimes I still need to dial into a remote modem for out of band system access… I wonder how hard it would be to add that bluetooth profile.

I like most other things about the G1. Keyboard + 3G data == awesome. Battery life is not going to be nearly as good as the E70, though…

Posted in Android | Tagged , , , , , | Leave a comment

IIS taking a nosedive

Netcraft’s June 2009 Web Server Survey is very interesting. Check out the IIS line on this graph (red):

That sharp drop is a reduction from 29,049,223 (May) to 21,898,527 (June) active sites. Netcraft explains the drop like this:

A reduction in activity at Microsoft Live Spaces was responsible for the large drop in the number of Microsoft-IIS sites detected.

This makes me wonder exactly how many of those IIS-hosted active sites are actually run by Microsoft (or its partners). The fact that just one of Microsoft’s services was responsible for over 7 million “active sites” – or 25% of the total number of active sites detected as running IIS in May 2009 – makes me wonder how valid the IIS numbers are in the webserver survey. I think this suggests IIS use is far less prominent outside the Microsoft campus than the ‘active sites’ numbers indicate.

Posted in Everything else | Tagged , , , , , | Leave a comment

on the importance of gem cleanup

I have a monit config that tries to stop/start mongrel instances like this:

  start program = "/usr/bin/mongrel_rails cluster::start -C path-to-mongrel_cluster.yml --clean --only PORT"
  stop program = "/usr/bin/mongrel_rails cluster::stop -C path-to-mongrel_cluster.yml --clean --only PORT"

I have the latest mongrel_cluster gem installed (1.0.5), and yet mongrel_rails kept throwing errors about –clean and –only:

invalid option: --clean for command 'cluster::start'
invalid option: --only for command 'cluster::start'

Turns out I had an older mongrel_cluster gem installed as well:

$ sudo gem cleanup mongrel_cluster
Cleaning up installed gems...
:0:Warning: Gem::SourceIndex#search support for Regexp patterns is deprecated
Attempting to uninstall mongrel_cluster-0.2.1
Successfully uninstalled mongrel_cluster-0.2.1
Clean Up Complete

After running gem cleanup, the mongrel_rails commands above started working.

This kind of code behaviour irks me – it’s not intuitive. It does not help that ‘gem list’ suggests that having multiple versions of a gem installed is not a problem – and it usually is not. I guess the mongrel_cluster gem is an exception. File this one under ‘good to know’…

Posted in Rails, Sysadmin | Leave a comment