Warning: Can't synchronize with the repository (/nfs/projects/capforge.org/trac/cap does not appear to be a Subversion repository.). Look in the Trac log for more information.

Welcome to CAPforge!

The Cluster Administration Package (CAP) is meant to ease integration, configuration, and systems management for clustering. It is the "glue" that allows cluster administrators to leverage existing technologies in a functional framework. This is done by delivering functionality into three component categories.

The first is "Information Management." From this category functions can be written to generate standard unix configuration files or to produce a file arrangement suitable for other clustering technologies.

The next category is "Control." This category allows an administrator to control their cluster as one system using common methods one uses to control a single system, such as power and console. With the addition of management devices, a cluster administrator can also control uids or gather system sensor data when possible.

The last category is "Installation." When installing a multitude of systems a cluster administrator wants a common method or set of methods to ensure a set of functionality if delivered to each node in their cluster. By leveraging existing technologies, CAP can ensure it can adapt to fit the needs of other installation methods available. clusters.

Objective:

Deliver the right tool sets and methods for managing and maintaining a diverse set of High Performance Computing (HPC) Linux Cluster configurations and architectures.

CAP Mantra!

Many organizations have used leadership analogies in describing systems. The Sandia Cplant Team used the concept of leaders and an SMU. ANL used the concept of towns and mayors. We are going to settile on the business org chart model but wanted to give credit to where we've taken this idea from. Thanks guys!

Type Business Org Chart Logical Mapping
CAP Configuration/Installation Root Director
Scheduling/Logging Associate Director
Admin/Boot Manager
Login/Compute/Viz/IO Node Workers

This is a logical mapping of services. It doesn't mean you need a node for every role. In general we have used the following:

Worker Node Count Configuration
256 - 512 2 managers (one being a director as well)
1024 1 director node, 4 manager nodes
2048 1 director node, 2 associate director node, 8 manager nodes
4096 2 director nodes, 2 associate director nodes, 16 manager nodes

In general we try to match 1 manager to at most 256-384 workers and figure a department needs to split when approaching 1024 workers. (SMU/Town ;-)) We believe this method allows us to grow into large complex configurations and still be run in an efficient manner.In general a node that is a manager only can't tell another department what todo. If only real world org charts worked this way!

Below is a diagram that outlines how we perceive workflow for this layout.

  #!graphviz
  digraph orgchart { 
     node [shape=box]; Directors AssociateDirectors;
     node [shape=circle]; Workers Managers
     Directors -> AssociateDirectors
     Directors -> Managers
     AssociateDirectors -> Managers
     Managers -> Workers
     }

Here is a layout for a 4096 node system with:

  • 4 departments (A-D) of 1024 workers each
  • each group of 256 workers being managed by one manger (4 managers to a department)
  #!graphviz
  digraph orgchart { 
     node [shape=box]; Director1 Director2 AssociateDirector1 AssociateDirector2;
     node [shape=circle]; A_Managers B_Managers C_Managers D_Managers;
     node [shape=circle]; A_Workers B_Workers C_Workers D_Workers;
     Director1 -> AssociateDirector1
     Director1 -> AssociateDirector2
     Director2 -> AssociateDirector1
     Director2 -> AssociateDirector2
     AssociateDirector1 -> A_Managers
     AssociateDirector1 -> B_Managers
     AssociateDirector1 -> C_Managers
     AssociateDirector1 -> D_Managers
     AssociateDirector2 -> A_Managers
     AssociateDirector2 -> B_Managers
     AssociateDirector2 -> C_Managers
     AssociateDirector2 -> D_Managers
     A_Managers -> A_Workers
     B_Managers -> B_Workers
     C_Managers -> C_Workers
     D_Managers -> D_Workers
     }

Supported Platforms

CAP is not dependent upon a specific vendor platform; however, it can take advantage of features available only from certain vendors, such as:

  1. CLI interface on HP Integrity servers
  2. Integrated Lights Out Edition (ilo) on HP Proliant servers
  3. Dell Remote Access Card (DRAC) on Dell PowerEdge? servers
  4. Advanced Systems Management Service Processor Network on IBM Netfinity servers
  5. IPMI interfaces

CAP is meant to run indepedently of the architecture. It has been run on the following types:

  1. x86_64: em64t, opteron AMD64
  2. IA32: Pentium II, III, 4, XEON
  3. IA64: Itanium I, II

License

CAP is licensed under the GPL v2 license. More information is available via: http://www.gnu.org/copyleft/gpl.html

Enjoy!
The CAPforge Team

Starting Points

Other information

Upcoming Meetings

Error: Failed to load processor WikiCalendar
No macro or processor named 'WikiCalendar' found

cafe express

Shop for the Perfect Gift