bcnu - web based monitor

Version 1.22

Summary of Features

What is bcnu?
bcnu is a Web Based system management tool which delivers information on the status of networked systems in a simple and easy to use manner. It uses a web browser to display information about hosts in a tabular form. Coloured icons show the status of monitored conditions and clicking on these will bring up detail about the state of the system.
Its main features:

easy to use
simple agent based monitoring with flexible thresholds
multiple views of the same information
includes agent scheduler for total flexibility
automatic logging and resend of messages

Standard agents include:

network monitor, ping, http, ftp etc
log file scanning for messages and warnings
checking for required processes to be running
filesystem mount status and space usage
processor usage and system up time
virtual memory and paging

Special agents include:

volume management checks for solaris,aix and hpux
system availibility

The bcnu display is designed to be very easy to use. It uses a web browser to graphically display the status of systems which are being monitored.

Monitor Page layout

Heading

The layout of the bcnu monitor pages is very simple. At the top of each page there is a heading area.

The status box is the most important area, it describes the status of the systems being monitored on the page. If one or more systems has logged an error, then the background of the box will be red. If one or more systems has logged a warning then the background of the box will be orange. The status box will display an information message, e.g. Errors detected or Warning

If all systems are ok then the background will be green and the message will be No problems reported

Below the status message will be a line detailing which view is currently being monitored e.g. All systems.

Below this line is the date and time that the status page was last updated.

Main monitor area

The main monitor area normally consists of a table showing the status of all systems which are being monitored along with the system name.
This is :

a normal status (green)

This an error condition (red)

This a warning condition (amber)

This means that any errors will be ignored, background will also be purple

A black background on a cell means that the bcnu system has been shut down, i.e. the bcnu process on the system has been stopped. This normally only happens when the system is shut down. It can also be black if an agent is outwith it's run hours

A grey background on a cell means that the monitor has not received a report recently from the system which may indicate an error.

A purple background on a cell means that any errors or warnings for that system are being ignored because they have been switched off by the administrator.

Holding the mouse pointer over the icon for a system will show the last reported status of the system. Clicking on the icon or the system name will bring up more detail about the system.

Available menu options

The dynamic version of the monitor has the most options. These are listed below:

Summary

This shows the status of all systems which are being monitored. It will show the most severe status found for a host.

Detail

This shows detailed information for all agents of all systems which are being monitored. It will show the most severe status found for a host.

Status Only

As it's name implies, shows only the status for all systems i.e. Normal, Warning etc, but not all of the agent statuses or the system details. If there are any error or warnings, it will show the affected systems.

Query (dynamic only)

This is one of the most powerful and resource intensive options. It presents a query screen which allows you to view information on either all or specific systems and agents for a specific date. It allows filtering for errors, warnings or normal messages.

All of the bcnu data logging files can be queried. e.g. if I had a problem with a system 2 weeks ago and I am keeping my logs for more than 2 weeks then I could carry out a query to show me all errors logged for that system on that date. If I knew what agent had logged the error I could refine it even further.

If the actual data files which are sent when an agent logs an error are retained then these will also be displayed automatically.

Help
Displays this document

Agents

bcnu uses simple shell scripts called agents to handle all of the checking and alerting required in the system. Some of these agents are run on the monitored (client) system. Others are run from another system and send results back to the monitor.

The av agent shows system availability. It is updated when the bcnu service on the box is brought up or down. It also sends an update every hour.

The command detail shows a listing of the current agents and their config file. It also shows the version of the operating system installed.

net- family

The net agents are actually a family of agents, which test the availability of various network facilities.

This agent is normally run from another system to test connectivity over the network. E.g. system A checks that ftp is working on system B and reports back to the monitor (system C). Meanwhile system B checks that ftp on system A is working etc.

In it's most basic form the net-ping agent will check that a system is responding over the network.

Other net- agents are:

net-bcnu Checks that the bcnu service is responding

net-telnet Checks that the users can login via remote terminals

net-ftp Checks that file transfers are possible

net-smtp Checks that the system will accept email

net-http Checks that a web server is responding

net-oracle Checks that Oracle SQL*net is available

net-printer Checks for remote printer service

net-squid Checks that the internet proxy is available

Normal status will be:

net-ping ok - up and available

net-nnnn ok - Service nnnn available

Error status will be:

net-nnnn error - Service nnnn not responding (critical systems)

Warning status will be:

net-nnnn warning - Service nnnn not responding (for non-critical systems)

The procs agent is used to check for required processes to be running on the system. e.g. inetd , oracle etc.

Normally an error is sent if the process is not running. You may also specify that a minimum and or maximum number of these processes are running, e.g. a web server.

Normal status will be:

procs ok - all processes running

Error status will be:

procs error - process pname not running

procs error - nn pname processes (should be at least mm)

procs error - nn pname processes (should be less than mm)

Warning status will be: None

The fs agent will check each filesystem that is specified in the agent config file. It will check if a filesystem is mounted and for a warning and error level of space used. Each filesystem can have a separate warning and error level for percentage of space used.

A listing of df output will be displayed for all conditions, when an error or warning is found there will also be a du listing to help in tracking down the cause of the problem.

Normal status will be:

fs ok - filesystems stable

Error status will be:

fs error - filesystem fsname is not mounted

fs error - filesystem fs >threshold% full (current%)

Warning status will be:

fs warning - filesystem fs >threshold% full (current%)

logs

The logs agent will check log files for errors and warnings. When a message is found it will be displayed as an attachement.

Normal status will be:

logs ok – pattern not found

Error status will be:

logs error – pattern found in logfile name

Warning status will be:

logs warning - pattern found in logfile name

uptime

The uptime agent will check various performance/availability figures for the system.

If the system is up for less than the number of days specified, it will report an error. This should catch times when the system crashed and rebooted itself.

If the 5 minute load average is exceeded, a warning will be reported. If the 15 minute load average is exceeded an error will be reported. This is an indication of heavy processor usage and possible poor performance.

A listing of uptime output will be displayed for all conditions, when an error or warning is found there will also be a ps listing to help in tracking down the cause of the problem.

Normal status will be:

uptime ok - uptime stable

Error status will be:

uptime error - load average 15 minutes (nn hundredths)

Warning status will be:

uptime warning - load average 5 minutes (nn hundredths)

vmstat

The vmstat agent will check various processor/memory performance figures for the system.

Warnings here can be an indication of heavy system usage and possible poor performance.

A listing of vmstat output will be displayed for all conditions, when an error or warning is found there will also be a ps listing to help in tracking down the cause of the problem.

Normal status will be:

vmstat ok - vmstat stable

Error status will be:

vmstat error - more than swapwaitthresh processes waiting (nn)

vmstat error - more than swapthresh processes swapped (nn)

Warning status will be:

vmstat warning - less than pthreshold % processor idle (nn)

vmstat warning - more than pagethresh pages swapped (nn)

vmstat warning - less than freeswap free swap (nn)

vmgt

Disk volume management monitor. Will check for error conditions in Solstice Disksuite on solaris, or LVM on AIX and HPUX.

Errors could be broken or stale mirrors, Hot spares in use etc.

A listing of the appropriate command will be output to allow checking of what the problem actually is.

Normal status will be:

vmgt ok - disk volumes stable

Error status will be:

vmgt error - disk volume requires maintenance

Warning status will be: None