Warning: Can't synchronize with the repository (/nfs/projects/capforge.org/trac/cap does not appear to be a Subversion repository.). Look in the Trac log for more information.

See the benchmark fix it!

Benchmarks have long been used to measure performance of various optimization methods. However, frustration can overtake any successful effort when running on unqualified nodes and networks. To achieve a successful effort, open source tools like Cbench and the fixit123 break fix process evolved to help stabilize small to large scale systems.

Cbench is a straightforward collection of tests, benchmarks, applications, and utilities woven into a framework that facilitates scalable testing, benchmarking and analysis of HPC systems. Cbench’s utilization includes testing and analyses that include interconnect performance, scalability utilizing common benchmarks and server component testing.

The fixit123 break fix process adds scriptable functions and leverages Cbench to qualify "production" hardware. Each step has a role in detecting firmware and software configuration issues and ensuring single node hardening and network testing fit within a reasonable standard deviation of system performance.

Share your benchmarking stories and see how Cbench can fixit for you!

old 3

Benchmarks have long been used to measure performance through various optimization methods. However, frustration can quickly overtake any successful effort when missing node and network qualification. To achieve a successful effort, open source tools like Cbench and the fixit123 break fix process evolved to help stabilize small to large scale systems. Cbench is a straightforward collection of tests, benchmarks, applications, and utilities woven into a framework that facilitates scalable testing, benchmarking and analysis of HPC systems. Cbench’s utilization includes testing and analyses that include interconnect performance, scalability utilizing common benchmarks and server component testing.

The fixit123 break fix process adds scriptable functions and leverages Cbench to qualify "production" hardware. Each step has a role in detecting firmware and software configuration issues and ensuring single node hardening and network testing fit within a reasonable standard deviation of system performance. Share your benchmarking stories and see how Cbench can fixit for you!

old

Benchmarks have long been used to test system performance through various tuning methods and optimizations. As systems begin to scale open source tools like the cbench framework and fixit123 break fix process have evolved to help stabilize small to large scale systems. Cbench strives to be a relatively straightforward collection of tests, benchmarks, applications, and utilities woven into a framework that facilitates scalable testing, benchmarking and analysis of a HPC systems. Cbench’s utilization includes testing and analysis that include interconnect performance, scalability utilizing common benchmarks and node hardware testing.

The fixit123 break fix process adds scriptable functions abd leverages cbench to qualify "production" nodes. Each step has a role in looking at hardware/firmware/software problems, node level hardnening and network testing.

In addition, STatistical Analysis of Benchmarks (STAB!) dplot statistical analysis has been integrated as well as future work with other government national labs to assist in datamining. Future work will encompass integration of monitoring triggers that will note what step we left cbench to fixit!

old

benchmarking to system harden c-bench fixit using open/closed benchmarks and gluing breakfix qualification frameworks

describe cbench and framework

integrations w/ egan stuff

STAB - STatistical Analysis of Benchmarks - dplot statistical analysis

Cbench strives to be a relatively straightforward collection of tests, benchmarks, applications, and utilities woven into a framework that facilitates scalable testing, benchmarking and analysis of a Linux clusters. It grew out of frustration in reworking scripts as clusters were being integrated into production. As this toolkit has grown, it opens doors to more sophisticated system integration, testing and characterization capabilities. Cbench’s utilization includes stress testing and analyzing:

  • cluster interconnect performance and characteristics using multiple bandwidth, latency and collective tests
  • cluster scalability utilizing common benchmarks: Linpack/HPCC/NAS Parallel Benchmarks/Intel MPI Benchmarks/IOR/YourBenchMarkHere
  • cluster file systems with varying job sizes and flavors
  • clusters after maintenance with 100s to 1000s of various jobs sizes and flavors
  • cluster scheduler and resource manager
  • nodes for hardware stability and conformance to performance profile of homogeneous hardware

The following example scenarios demonstrate Cbench usage and will be discussed:

  • run 1000s of parallel/serial jobs ranging from 1 to N processors using different tests/benchmarks and analyze what happened upon completion
  • plot results in semi real-time (using gnuplot) for supported tests as 100s of scheduled jobs continue to run
  • for a set of supported test jobs, analyze success/failure ratios according to job size (number of processes)
  • run node-level (i.e. on a single node w/o worrying about MPI) overnight burn-in tests on a 4000+ nodes, analyze the results, and generate statistical performance and fault profile for nodes
  • cease to worry about the hassles of changing parallel job launchers and batch schedulers

describe fixit123

nodefix library

Node breakfix processes vary widely and can miss necessary repairing, checking and testing in production environments. The three step fixit framework adds scriptable functions and leverages cbench to qualify "production" nodes. Step1 is entered when a node functions incorrectly either due to hardware/firmware/software problems. Node hardware is serviced, vendor diagnostics are run and various firmware/software versions are checked/set to correct versions. Step2 uses the cbench node level testset that stress individual components and determine if results are within acceptable bounds. Step3 requires at least 2 nodes and runs cbench testsets across communication/interconnect networks.

Step1 and Step2 states can be run during bootup processes and check for firmware/driver versions as well as run a minimal node level test. Failure results in an "offlined" node. The fixit framework is used on 4000+ node systems and integrated into scheduling environments. Node breakfix can be addressed in three easy steps, just "fixit!"

next steps ... data mining

collaboration w/ other communities (gazebo tri-lab combine fixit123 w/ monitoring trix