Topology
SLURM version 2.0 can be configured to support topology-aware resource allocation to optimize job performance. There are two primary modes of operation, one to optimize performance on systems with a three-dimensional torus interconnect and another for a hierarchical interconnect.
SLURM's native mode of resource selection is to consider the nodes as a one-dimensional array. Jobs are allocated resources on a best-fit basis. For larger jobs, this minimizes the number of sets of consecutive nodes allocated to the job.
Three-dimension Topology
Some larger computers rely upon a three-dimensional torus interconnect. The IBM BlueGene computers is one example of this which has highly constrained resource allocation scheme, essentially requiring that jobs be allocated a set of nodes logically having a rectangular shape. SLURM has a plugin specifically written for BlueGene to select appropriate nodes for jobs, change network switch routing, boot nodes, etc as described in the BlueGene User and Administrator Guide.
The Sun Constellation and Cray XT systems also have three-dimensional torus interconnects, but do not require that jobs execute in adjacent nodes. On those systems, SLURM only needs to allocate resources to a job which are nearby on the network. SLURM accomplishes this using a Hilbert curve to map the nodes from a three-dimensional space into a one-dimensional space. SLURM's native best-fit algorithm is thus able to achieve a high degree of locality for jobs. For more information, see SLURM's documentation for Sun Constellation and Cray XT systems.
Hierarchical Networks
SLURM can also be configured to allocate resources to jobs on a hierarchical network to minimize network contention. The basic algorithm is to identify the lowest level switch in the hierarchy that can satisfy a job's request and then allocate resources on its underlying leaf switches using a best-fit algorithm. Use of this logic requires a configuration setting of TopologyPlugin=topology/tree.
At some point in the future SLURM code may be provided to gather network topology information directly. Now the network topology information must be included in a topology.conf configuration file as shown in the examples below. The first example describes a three level switch in which each switch has two children. Note that the SwitchName values are arbitrary and only used to bookkeeping purposes, but a name must be specified on each line. The leaf switch descriptions contain a SwitchName field plus a Nodes field to identify the nodes connected to the switch. Higher-level switch descriptions contain a SwitchName field plus a Switches field to identify the child switches. SLURM's hostlist expression parser is used, so the node and switch names need not be consecutive (e.g. "Nodes=tux[0-3,12,18-20]" and "Swithces=s[0-2,4-8,12]" will parse fine).
An optional LinkSpeed option can be used to indicate the relative performance of the link. The units used are arbitrary and this information is currently not used. It may be used in the future to optimize resource allocations.
The first example shows what a topology would look like for an eight node cluster in which all switches have only two children as shown in the diagram (not a very realistic configuration, but useful for an example).
# topology.conf # Switch Configuration SwitchName=s0 Nodes=tux[0-1] SwitchName=s1 Nodes=tux[2-3] SwitchName=s2 Nodes=tux[4-5] SwitchName=s3 Nodes=tux[6-7] SwitchName=s4 Switches=s[0-1] SwitchName=s5 Switches=s[2-3] SwitchName=s6 Switches=s[4-5]

The next example is for a network with two levels and each switch has four connections.
# topology.conf # Switch Configuration SwitchName=s0 Nodes=tux[0-3] LinkSpeed=900 SwitchName=s1 Nodes=tux[4-7] LinkSpeed=900 SwitchName=s2 Nodes=tux[8-11] LinkSpeed=900 SwitchName=s3 Nodes=tux[12-15] LinkSpeed=1800 SwitchName=s4 Switches=s[0-3] LinkSpeed=1800 SwitchName=s5 Switches=s[0-3] LinkSpeed=1800 SwitchName=s6 Switches=s[0-3] LinkSpeed=1800 SwitchName=s7 Switches=s[0-3] LinkSpeed=1800

Last modified 24 March 2009