List Info

Thread: Nagios check_bioctl available




Nagios check_bioctl available
user name
2006-07-29 00:07:14
I have written a perl script that parses the output from
bioctl and
returns it in a format that Nagios can use.  

check_bioctl is avaliable here:
http://openbsd.somedomain.net/nagios/check_bioctl-1.
3.tar.gz

It is useful to me, and so I thought it might be useful to
someone else.  

I wrote this on OpenBSD 3.9 and tested on Dell PERC 3/DC
controllers
using the ami driver. It should work just fine on other
versions of
OpenBSD as well as with other cards and drivers. If you do
run into
trouble, send me the output from bioctl on the system you
are having
trouble with and I can try to help. Patches to fix problems
would be
even better.


One thing I ran into is that bioctl needs to run as root to
get access
to /dev/bio, even for read only access.  Is there a way to
query bioctl
without needing root?


Also, in biovar.h, both a raid volume and a disk can be
"Offline".
However, I am not sure what that means.  Currently it is a
WARNING, but
I don't know what status it should be set to.

http://
www.openbsd.org/cgi-bin/cvsweb/src/sys/dev/biovar.h?rev=1.25
&content-type=text/x-cvsweb-markup

If anyone knows what the "Offline" status means,
I would sure like to
know.


An additional useful feature is that you can specify
multiple devices to
check in a single check

/usr/local/libexec/nagios/check_bioctl -d ami0 -d ami1


Output is similar to below, except with NAGIOS_OUTPUT set to
1 in the
source (as it usually is) all output is on a single line
separated with
<br> and it hides any devices that are OK because
Nagios has a limit on
the length of a response.

CRITICAL (1):
   ami0 sd1 Degraded
WARNING (1):
   ami0 0:8.0 Rebuild <QUANTUM ATLAS10K2-TY184JDA40>
OK (7):
   ami0 sd0 Online
   ami0 0:0.0 Online <IBM     DMVS09M         0220>
   ami0 0:1.0 Online <IBM     DRVS09D         0140>
   ami0 0:3.0 Online <QUANTUM ATLAS10K2-TY184JDA40>
   ami0 0:4.0 Online <QUANTUM ATLAS10K2-TY184JDA40>
   ami0 0:5.0 Online <QUANTUM ATLAS10K2-TY184JDA40>
   ami0 0:2.0 Hot spare <IBM     DRVS09D         0140>


I currently configure it something like this:

$ grep check_bioctl /etc/sudoers /etc/nrpe.cfg
/etc/sudoers:_nrpe   ALL =
NOPASSWD:/usr/local/libexec/nagios/check_bioctl -d ami0
/etc/nrpe.cfg:command[check_bioctl]=/usr/bin/sudo
/usr/local/libexec/nagios/check_bioctl -d ami0


Also available is check_hw_sensors for checking of sysctl
hw.sensors
from Nagios.

http://openbsd.
somedomain.net/nagios/

l8rZ,
-- 
andrew - ICQ# 253198 - JID: afresh1jabber.org

BOFH excuse of the day: YOU HAVE AN I/O ERROR ->
Incompetent Operator
    error

Nagios check_bioctl available
user name
2006-07-29 02:17:28
andrew fresh wrote:
> I have written a perl script that parses the output
from bioctl and
> returns it in a format that Nagios can use.  

Sweet 

> 
> check_bioctl is avaliable here:
> http://openbsd.somedomain.net/nagios/check_bioctl-1.
3.tar.gz
> 
> It is useful to me, and so I thought it might be useful
to someone else.  
> 
> I wrote this on OpenBSD 3.9 and tested on Dell PERC
3/DC controllers
> using the ami driver. It should work just fine on other
versions of
> OpenBSD as well as with other cards and drivers. If you
do run into
> trouble, send me the output from bioctl on the system
you are having
> trouble with and I can try to help. Patches to fix
problems would be
> even better.
> 
> 
> One thing I ran into is that bioctl needs to run as
root to get access
> to /dev/bio, even for read only access.  Is there a way
to query bioctl
> without needing root?

No!

> 
> 
> Also, in biovar.h, both a raid volume and a disk can be
"Offline".
> However, I am not sure what that means.  Currently it
is a WARNING, but
> I don't know what status it should be set to.

If 2 or more physical disks of a RAID 5 are offline a volume
will be 
marked offline as well.  An offline RAID 5 is obviously a
critical 
event.  Hope this makes sense since I am not exactly sure
what you are 
asking.

> 
> http://
www.openbsd.org/cgi-bin/cvsweb/src/sys/dev/biovar.h?rev=1.25
&content-type=text/x-cvsweb-markup
> 
> If anyone knows what the "Offline" status
means, I would sure like to
> know.
> 
> 
> An additional useful feature is that you can specify
multiple devices to
> check in a single check
> 
> /usr/local/libexec/nagios/check_bioctl -d ami0 -d ami1
> 
> 
> Output is similar to below, except with NAGIOS_OUTPUT
set to 1 in the
> source (as it usually is) all output is on a single
line separated with
> <br> and it hides any devices that are OK because
Nagios has a limit on
> the length of a response.
> 
> CRITICAL (1):
>    ami0 sd1 Degraded
> WARNING (1):
>    ami0 0:8.0 Rebuild <QUANTUM
ATLAS10K2-TY184JDA40>
> OK (7):
>    ami0 sd0 Online
>    ami0 0:0.0 Online <IBM     DMVS09M        
0220>
>    ami0 0:1.0 Online <IBM     DRVS09D        
0140>
>    ami0 0:3.0 Online <QUANTUM
ATLAS10K2-TY184JDA40>
>    ami0 0:4.0 Online <QUANTUM
ATLAS10K2-TY184JDA40>
>    ami0 0:5.0 Online <QUANTUM
ATLAS10K2-TY184JDA40>
>    ami0 0:2.0 Hot spare <IBM     DRVS09D        
0140>
> 
> 
> I currently configure it something like this:
> 
> $ grep check_bioctl /etc/sudoers /etc/nrpe.cfg
> /etc/sudoers:_nrpe   ALL =
NOPASSWD:/usr/local/libexec/nagios/check_bioctl -d ami0
> /etc/nrpe.cfg:command[check_bioctl]=/usr/bin/sudo
/usr/local/libexec/nagios/check_bioctl -d ami0
> 
> 
> Also available is check_hw_sensors for checking of
sysctl hw.sensors
> from Nagios.
> 
> http://openbsd.
somedomain.net/nagios/
> 
> l8rZ,

Nagios check_bioctl available
user name
2006-07-29 02:24:12
andrew fresh wrote:
> I have written a perl script that parses the output
from bioctl and
> returns it in a format that Nagios can use.  

Sweet 

> 
> check_bioctl is avaliable here:
> http://openbsd.somedomain.net/nagios/check_bioctl-1.
3.tar.gz
> 
> It is useful to me, and so I thought it might be useful
to someone else.  
> 
> I wrote this on OpenBSD 3.9 and tested on Dell PERC
3/DC controllers
> using the ami driver. It should work just fine on other
versions of
> OpenBSD as well as with other cards and drivers. If you
do run into
> trouble, send me the output from bioctl on the system
you are having
> trouble with and I can try to help. Patches to fix
problems would be
> even better.
> 
> 
> One thing I ran into is that bioctl needs to run as
root to get access
> to /dev/bio, even for read only access.  Is there a way
to query bioctl
> without needing root?

No!

> 
> 
> Also, in biovar.h, both a raid volume and a disk can be
"Offline".
> However, I am not sure what that means.  Currently it
is a WARNING, but
> I don't know what status it should be set to.

If 2 or more physical disks of a RAID 5 are offline a volume
will be
marked offline as well.  An offline RAID 5 is obviously a
critical
event.  Hope this makes sense since I am not exactly sure
what you are
asking.

> 
> http://
www.openbsd.org/cgi-bin/cvsweb/src/sys/dev/biovar.h?rev=1.25
&content-type=text/x-cvsweb-markup
> 
> If anyone knows what the "Offline" status
means, I would sure like to
> know.
> 
> 
> An additional useful feature is that you can specify
multiple devices to
> check in a single check
> 
> /usr/local/libexec/nagios/check_bioctl -d ami0 -d ami1
> 
> 
> Output is similar to below, except with NAGIOS_OUTPUT
set to 1 in the
> source (as it usually is) all output is on a single
line separated with
> <br> and it hides any devices that are OK because
Nagios has a limit on
> the length of a response.
> 
> CRITICAL (1):
>    ami0 sd1 Degraded
> WARNING (1):
>    ami0 0:8.0 Rebuild <QUANTUM
ATLAS10K2-TY184JDA40>
> OK (7):
>    ami0 sd0 Online
>    ami0 0:0.0 Online <IBM     DMVS09M        
0220>
>    ami0 0:1.0 Online <IBM     DRVS09D        
0140>
>    ami0 0:3.0 Online <QUANTUM
ATLAS10K2-TY184JDA40>
>    ami0 0:4.0 Online <QUANTUM
ATLAS10K2-TY184JDA40>
>    ami0 0:5.0 Online <QUANTUM
ATLAS10K2-TY184JDA40>
>    ami0 0:2.0 Hot spare <IBM     DRVS09D        
0140>
> 
> 
> I currently configure it something like this:
> 
> $ grep check_bioctl /etc/sudoers /etc/nrpe.cfg
> /etc/sudoers:_nrpe   ALL =
NOPASSWD:/usr/local/libexec/nagios/check_bioctl -d ami0
> /etc/nrpe.cfg:command[check_bioctl]=/usr/bin/sudo
/usr/local/libexec/nagios/check_bioctl -d ami0
> 
> 
> Also available is check_hw_sensors for checking of
sysctl hw.sensors
> from Nagios.
> 
> http://openbsd.
somedomain.net/nagios/
> 
> l8rZ,

Nagios check_bioctl available
user name
2006-07-30 01:03:26
2006/7/29, andrew fresh <andrewmad-techies.org>:
> One thing I ran into is that bioctl needs to run as
root to get access
> to /dev/bio, even for read only access.  Is there a way
to query bioctl
> without needing root?

Well, I think you only need the status of the drives and
that is
availlable using sysctl hw.sensors in current (you already
mentioned
sysctl). A monitoring system should not use the capabilities
of
bioctl, it just needs to know the status and report that.

Now that I think of it, I should add support to the upwatch
monitoring
system too, but I am not that lucky to have hardware to
actually test
it 

Wijnand

Nagios check_bioctl available
user name
2006-07-31 20:52:17
On Fri, Jul 28, 2006 at 09:17:28PM -0500, Marco Peereboom
wrote:
> andrew fresh wrote:
> >I have written a perl script that parses the output
from bioctl and
> >returns it in a format that Nagios can use.  
> 
> Sweet 

Thanks!

> >One thing I ran into is that bioctl needs to run as
root to get access
> >to /dev/bio, even for read only access.  Is there a
way to query bioctl
> >without needing root?
> 
> No!

dang! oh well, sudo is a good enough solution then.  

> >Also, in biovar.h, both a raid volume and a disk
can be "Offline".
> >However, I am not sure what that means.  Currently
it is a WARNING, but
> >I don't know what status it should be set to.
> 
> If 2 or more physical disks of a RAID 5 are offline a
volume will be 
> marked offline as well.  An offline RAID 5 is obviously
a critical 
> event.  Hope this makes sense since I am not exactly
sure what you are 
> asking.

I will change Offline to be a CRITICAL error.  

and here is the new version:
http://openbsd.somedomain.net/nagios/check_bioctl-1.
4.tar.gz

However, I guess my question is what would cause a disk to
be Offline?

There is a separate status for Failed, and I could see the
RAID being
Offline if too many disks had Failed.


Are there any other status that should be different?  They
seemed to be
fairly straight forward, but there may be good arguments for
them to be
changed.

my %Status_Map = (
	Online      => 'OK',
	Offline     => 'CRITICAL',
	Degraded    => 'CRITICAL',
	Failed      => 'CRITICAL',
	Building    => 'WARNING',
	Rebuild     => 'WARNING',
	'Hot spare' => 'OK',
	Unused      => 'OK',
	Scrubbing   => 'WARNING',
	Invalid     => 'CRITICAL',
);

l8rZ,
-- 
andrew - ICQ# 253198 - JID: afresh1jabber.org

BOFH excuse of the day: Windows 95 undocumented
"feature"

Nagios check_bioctl available
user name
2006-07-31 21:09:21
On Sun, Jul 30, 2006 at 03:03:26AM +0200, Wijnand Wiersma
wrote:
> 2006/7/29, andrew fresh <andrewmad-techies.org>:
> >One thing I ran into is that bioctl needs to run as
root to get access
> >to /dev/bio, even for read only access.  Is there a
way to query bioctl
> >without needing root?
> 
> Well, I think you only need the status of the drives
and that is
> availlable using sysctl hw.sensors in current (you
already mentioned
> sysctl). A monitoring system should not use the
capabilities of
> bioctl, it just needs to know the status and report
that.

If that is the case, then this check will become obsolete. 
That would
be nice!  I will have to go put -current on my test box and
try it out.  


As it is, on my 3.9-stable box, the output from sysctl if it
is
available does not seem very reliable:

hw.sensors.29=esm0, Drive 0, drive, online
hw.sensors.30=esm0, Drive 1, drive, online
hw.sensors.31=esm0, Drive 2, drive, unknown
hw.sensors.32=esm0, Drive 3, drive, unknown
hw.sensors.33=esm0, Drive 4, drive, online
hw.sensors.34=esm0, Drive 5, drive, online
hw.sensors.35=esm0, Drive 6, drive, unknown
hw.sensors.36=esm0, Drive 7, drive, unknown

$ sudo bioctl ami0
Password:
Volume  Status     Size           Device
 ami0 0 Online         8984199168 sd0     RAID1
      0 Online         8984199168 0:0.0   safte0 <IBM    
DRVS09D 0140>
      1 Online         8984199168 0:1.0   safte0 <IBM    
DRVS09D 0140>
 ami0 1 Online        36234592256 sd1     RAID10
      0 Online        18117296128 0:3.0   safte0 <QUANTUM
ATLAS10K2-TY184JDA40>
      1 Online        18117296128 0:4.0   safte0 <QUANTUM
ATLAS10K2-TY184JDA40>
      2 Online        18117296128 0:5.0   safte0 <QUANTUM
ATLAS10K2-TY184JDA40>
      3 Online        18117296128 0:8.0   safte0 <QUANTUM
ATLAS10K2-TY184JDA40>
 ami0 2 Hot spare      8984199168 0:2.0   safte0 <IBM    
DMVS09M 0220>
 ami0 3 Hot spare     18117296128 0:9.0   safte0 <QUANTUM
ATLAS 10K 18SCA UCHD>


The rest of the sensors seem mostly correct though, and
there are sure
enough of them!

$ sysctl hw.sensors | tail -1
hw.sensors.99=safte0, temp1, OK, temp, 27.78 degC / 82.00
degF


Also, on another box that has external disk box connected
with ses, I
don't get any status for those disks in sysctl.  The disks
that are
actually in the server are using safte and those show up in
sysctl.  I
don't know why, so now I have this check 


> Now that I think of it, I should add support to the
upwatch monitoring
> system too, but I am not that lucky to have hardware to
actually test
> it 

If the information is available in sysctl in 4.0, that would
be the
check to integrate.

l8rZ,
-- 
andrew - ICQ# 253198 - JID: afresh1jabber.org

BOFH excuse of the day: dynamic software linking table
corrupted

Nagios check_bioctl available
user name
2006-07-31 21:51:07
dmesg please

On Mon, Jul 31, 2006 at 02:09:21PM -0700, andrew fresh
wrote:
> On Sun, Jul 30, 2006 at 03:03:26AM +0200, Wijnand
Wiersma wrote:
> > 2006/7/29, andrew fresh <andrewmad-techies.org>:
> > >One thing I ran into is that bioctl needs to
run as root to get access
> > >to /dev/bio, even for read only access.  Is
there a way to query bioctl
> > >without needing root?
> > 
> > Well, I think you only need the status of the
drives and that is
> > availlable using sysctl hw.sensors in current (you
already mentioned
> > sysctl). A monitoring system should not use the
capabilities of
> > bioctl, it just needs to know the status and
report that.
> 
> If that is the case, then this check will become
obsolete.  That would
> be nice!  I will have to go put -current on my test box
and try it out.  
> 
> 
> As it is, on my 3.9-stable box, the output from sysctl
if it is
> available does not seem very reliable:
> 
> hw.sensors.29=esm0, Drive 0, drive, online
> hw.sensors.30=esm0, Drive 1, drive, online
> hw.sensors.31=esm0, Drive 2, drive, unknown
> hw.sensors.32=esm0, Drive 3, drive, unknown
> hw.sensors.33=esm0, Drive 4, drive, online
> hw.sensors.34=esm0, Drive 5, drive, online
> hw.sensors.35=esm0, Drive 6, drive, unknown
> hw.sensors.36=esm0, Drive 7, drive, unknown
> 
> $ sudo bioctl ami0
> Password:
> Volume  Status     Size           Device
>  ami0 0 Online         8984199168 sd0     RAID1
>       0 Online         8984199168 0:0.0   safte0
<IBM     DRVS09D 0140>
>       1 Online         8984199168 0:1.0   safte0
<IBM     DRVS09D 0140>
>  ami0 1 Online        36234592256 sd1     RAID10
>       0 Online        18117296128 0:3.0   safte0
<QUANTUM ATLAS10K2-TY184JDA40>
>       1 Online        18117296128 0:4.0   safte0
<QUANTUM ATLAS10K2-TY184JDA40>
>       2 Online        18117296128 0:5.0   safte0
<QUANTUM ATLAS10K2-TY184JDA40>
>       3 Online        18117296128 0:8.0   safte0
<QUANTUM ATLAS10K2-TY184JDA40>
>  ami0 2 Hot spare      8984199168 0:2.0   safte0
<IBM     DMVS09M 0220>
>  ami0 3 Hot spare     18117296128 0:9.0   safte0
<QUANTUM ATLAS 10K 18SCA UCHD>
> 
> 
> The rest of the sensors seem mostly correct though, and
there are sure
> enough of them!
> 
> $ sysctl hw.sensors | tail -1
> hw.sensors.99=safte0, temp1, OK, temp, 27.78 degC /
82.00 degF
> 
> 
> Also, on another box that has external disk box
connected with ses, I
> don't get any status for those disks in sysctl.  The
disks that are
> actually in the server are using safte and those show
up in sysctl.  I
> don't know why, so now I have this check 
> 
> 
> > Now that I think of it, I should add support to
the upwatch monitoring
> > system too, but I am not that lucky to have
hardware to actually test
> > it 
> 
> If the information is available in sysctl in 4.0, that
would be the
> check to integrate.
> 
> l8rZ,
> -- 
> andrew - ICQ# 253198 - JID: afresh1jabber.org
> 
> BOFH excuse of the day: dynamic software linking table
corrupted

[1-7]

about | contact  Other archives ( Real Estate discussion Medical topics )