Warning: Can't synchronize with the repository (/nfs/projects/capforge.org/trac/cap does not appear to be a Subversion repository.). Look in the Trac log for more information.

Monitoring Trix

Monitoring tools have evolved greatly, but when mis-configured they can forget how to do intelligent, non-intrusive and accomplish just-in-time alerts. Non-intrusive collection windows arise during bootup, idle system time, before major workload events and after these events finish. Batch schedulers allow health checking during these opportune times. Most just-in-time alerts arrive via system logs and out-of-band queries that can then trigger appropriate actions. However, abusive out-of-band queries may interrupt normal operational activities.

Some vendor and open implementations have been heavyweight in watch-dogging at a brutal cost on computation as systems start scaling to thousands of nodes. Configuring tools to query intelligently during certain opportunities and running only necessary daemons helps to meet monitoring goals. These tools and daemons can include HP's hpasm, Dell's OMSA, supermon, lm_sensors, nagios, ganglia, logsurfer/syslog-ng, torque health checks. Share your monitoring stories and learn how triggers implemented scale to 4000+ node systems.

OLD

Monitoring includes a mix bag of tools, but implementation of these tools should think about the goals of being intelligent, non-intrusive and just-in-time.

Meeting all these goals can occur depending on select monitoring openings and the type of monitoring used to gether relevant information.

Toolsets have light and heavy weight daemons that

now types ...

yied in goals, then windows, then types ...

elaborate on goals, windows, then types

conclude with toolset listing to discuss

hpasm, dell omsa, supermon, lm_sensors, nagios, ganglia, logsurfer, syslog,

Some windows of opportunity include bootup, when the system is idle, before a major process starts and after this process finishes.

Most current batch schedulers implement this capability today via some order of health checks.

Should these checks fail, proper alerts and actions are implemented. While the wiz bang graphical interface is pretty, smart

types of monitoring

  1. graphical
  2. polling
  3. state/event driven
  4. syslog
    1. logsurfer triggers

Examples

  1. nagios plugin overabuse
    1. monitor
  2. pbsnodes -a monitor
  3. snmpd

old stuff

  • job
  • nagios
  • ganglia
  • scheduling health checks