nagios, mdadm and snmp

I found this script while looking for a simple script to monitor mdadm arrays. The script is fine, but it has a subtle bug – it will never report an error because the –detail parameter is missing in the call to mdadm. I modified the script a bit, like so:

#!/bin/sh
# (c) 2008 Jasper Spaans 

worst=0
msg=""

for dev in /dev/md?* ; do \
  mdadm --misc -t --detail $dev >/dev/null
  status=$?
  if [ $status == 0 ]; then
    msg="${msg} ${dev}: ok"
  elif [ $status == 1 ] ; then
    if [ worst != 2 ] ; then
      worst=1
    fi
    msg="${msg} ${dev}: degraded"
  elif [ $status == 2 ] ; then
    worst=2
    msg="${msg} ${dev}: degraded - unusable"
  fi
done

echo "mdadm:$msg"
exit $worst

which I saved as /usr/local/bin/check-mdadm.sh.

Add in a bit of snmpd.conf config (and set up sudo accordingly, of course):

...
exec   mdadm /usr/bin/sudo /usr/local/bin/check-mdadm.sh

and a small script on the nagios side (/usr/local/bin/nagios-check-mdadm):

#!/bin/sh

SNMP=`snmpwalk -v1 -c YOUR-PUBLIC $1 extOutput |grep mdadm`
TMP1=`echo $SNMP |grep degraded`
TMP2=`echo $SNMP |sed -e 's/^.*mdadm: //'`

if [ "$TMP1" = "" ]; then
  echo "OK: $TMP2"
  return 0
else
  echo "ERROR: $TMP2"
  return 2
fi

add a bit of nagios config:

define command {
       command_name check_mdadm
       command_line /usr/local/bin/nagios-check-mdadm $HOSTADDRESS$
}
define service {
       use      defaults
       name     check_mdadm
       description   MDADM
       check_command check_mdadm
}

And voila, nagios notifications when disks fall out of the array.

This entry was posted in Sysadmin. Bookmark the permalink.

One Response to nagios, mdadm and snmp

  1. I adapted a similar (maybe it was based on the same actually) script to use under Zabbix. I however discovered that certain default mdadm installs tend to do a full check on the array on a weekly basis, which triggers a false positive.

    if [ "$status" = 0 ] && \
    [ $(cat /sys/block/${md}/md/degraded) = 1 ] && \
    $( echo $mdadmoutput | grep -e State.*resyncing -e State.*recovering >/dev/null )

    then your device isn’t really degraded, but just doing a full check.

    full script see http://support.ginsys.be/wsvn/scripts/zabbix/mdcheck.sh

Leave a Reply