This a warning condition (amber)
|
This means that any errors will be ignored, background will also be purple
|
A black background on a cell means that the bcnu system has been shut down, i.e. the bcnu process on the system has been stopped. This normally only happens when the system is shut down. It can also be black if an agent is outwith it's run hours
A grey background on a cell means that the monitor has not received a report recently from the system which may indicate an error.
A purple background on a cell means that any errors or warnings for that system are being ignored because they have been switched off by the administrator.
Holding the mouse pointer over the icon for a system will show the last reported status of the system. Clicking on the icon or the system name will bring up more detail about the system.
Available menu options
The dynamic version of the monitor has the most options. These are listed below:
Summary
This shows the status of all systems which are being monitored. It will show the most severe status found for a host.
Detail
This shows detailed information for all agents of all systems which are being monitored. It will show the most severe status found for a host.
Status Only
As it's name implies, shows only the status for all systems i.e. Normal, Warning etc, but not all of the agent statuses or the system details. If there are any error or warnings, it will show the affected systems.
Query (dynamic only)
This is one of the most powerful and resource intensive options. It presents a query screen which allows you to view information on either all or specific systems and agents for a specific date. It allows filtering for errors, warnings or normal messages.
All of the bcnu data logging files can be queried. e.g. if I had a problem with a system 2 weeks ago and I am keeping my logs for more than 2 weeks then I could carry out a query to show me all errors logged for that system on that date. If I knew what agent had logged the error I could refine it even further.
If the actual data files which are sent when an agent logs an error are retained then these will also be displayed automatically.
Help
Displays this document
Agents
bcnu uses simple shell scripts called agents to handle all of the checking and alerting required in the system. Some of these agents are run on the monitored (client) system. Others are run from another system and send results back to the monitor.
av
The av agent shows system availability. It is updated when the bcnu service on the box is brought up or down. It also sends an update every hour.
The command detail shows a listing of the current agents and their config file. It also shows the version of the operating system installed.
net- family
The net agents are actually a family of agents, which test the availability of various network facilities.
This agent is normally run from another system to test connectivity over the network.
E.g. system A checks that ftp is working on system B and reports back to the monitor (system C). Meanwhile system B checks that ftp on system A is working etc.
In it's most basic form the net-ping agent will check that a system is responding over the network.
Other net- agents are:
net-bcnu |
Checks that the bcnu service is responding |
net-telnet |
Checks that the users can login via remote terminals |
net-ftp |
Checks that file transfers are possible |
net-smtp |
Checks that the system will accept email |
net-http |
Checks that a web server is responding |
net-oracle |
Checks that Oracle SQL*net is available |
net-printer |
Checks for remote printer service |
net-squid |
Checks that the internet proxy is available |
Normal status will be:
net-ping |
ok - up and available |
net-nnnn |
ok - Service nnnn available |
Error status will be:
net-nnnn |
error - Service nnnn not responding (critical systems) |
Warning status will be:
net-nnnn |
warning - Service nnnn not responding (for non-critical systems) |
The procs agent is used to check for required processes to be running on the system. e.g. inetd , oracle etc.
Normally an error is sent if the process is not running. You may also specify that a minimum and or maximum number of these processes are running, e.g. a web server.
Normal status will be:
procs |
ok - all processes running |
Error status will be:
procs |
error - process pname not running |
procs |
error - nn pname processes (should be at least mm) |
procs |
error - nn pname processes (should be less than mm) |
Warning status will be: None
fs
The fs agent will check each filesystem that is specified in the agent config file. It will check if a filesystem is mounted and for a warning and error level of space used. Each filesystem can have a separate warning and error level for percentage of space used.
A listing of df output will be displayed for all conditions, when an error or warning is found there will also be a du listing to help in tracking down the cause of the problem.
Normal status will be:
fs |
ok - filesystems stable |
Error status will be:
fs |
error - filesystem fsname is not mounted |
fs |
error - filesystem fs >threshold% full (current%) |
Warning status will be:
fs |
warning - filesystem fs >threshold% full (current%) |
logs
The logs agent will check log files for errors and warnings. When a message is found it will be displayed as an attachement.
Normal status will be:
logs |
ok pattern not found |
Error status will be:
logs |
error pattern found in logfile name |
Warning status will be:
logs |
warning - pattern found in logfile name |
uptime
The uptime agent will check various performance/availability figures for the system.
If the system is up for less than the number of days specified, it will report an error. This should catch times when the system crashed and rebooted itself.
If the 5 minute load average is exceeded, a warning will be reported. If the 15 minute load average is exceeded an error will be reported. This is an indication of heavy processor usage and possible poor performance.
A listing of uptime output will be displayed for all conditions, when an error or warning is found there will also be a ps listing to help in tracking down the cause of the problem.
Normal status will be:
uptime |
ok - uptime stable |
Error status will be:
uptime |
error - load average 15 minutes (nn hundredths) |
Warning status will be:
uptime |
warning - load average 5 minutes (nn hundredths) |
vmstat
The vmstat agent will check various processor/memory performance figures for the system.
Warnings here can be an indication of heavy system usage and possible poor performance.
A listing of vmstat output will be displayed for all conditions, when an error or warning is found there will also be a ps listing to help in tracking down the cause of the problem.
Normal status will be:
vmstat |
ok - vmstat stable |
Error status will be:
vmstat |
error - more than swapwaitthresh processes waiting (nn) |
vmstat |
error - more than swapthresh processes swapped (nn) |
Warning status will be:
vmstat |
warning - less than pthreshold % processor idle (nn) |
vmstat |
warning - more than pagethresh pages swapped (nn) |
vmstat |
warning - less than freeswap free swap (nn) |
vmgt
Disk volume management monitor. Will check for error conditions in Solstice Disksuite on solaris, or LVM on AIX and HPUX.
Errors could be broken or stale mirrors, Hot spares in use etc.
A listing of the appropriate command will be output to allow checking of what the problem actually is.
Normal status will be:
vmgt |
ok - disk volumes stable |
Error status will be:
vmgt |
error - disk volume requires maintenance |
Warning status will be: None