Use Infiniband Network

1. What is Infiniband?

Infiniband is a network technology to address performance problems associated with data movement. It minimises the network communication overhead by offloading the CPU I/O interruption to Infiniband hardware - the Infiniband HCA. Eddie is deployed with InfiniPath Infiniband devices which is a PathScale product providing industry-leading performance as a cluster interconnect. The 4x Infiniband network could potencially give 8Gb/s (1GB/s) valid message-passing rate between PCIe buses on different nodes.

2. How is the actual performance of Infinipath on Eddie?

The OSU benchmark results reveal that the latency in synchronized message-passing operations is 2.11us, and bandwidth in asynchronized message-passing operations is up to 956 MB/s (about 8 times normal ethernet bandwidth)

The HPL (High Performance Linpack) benchmark acquired over 98.5% parallel speedup efficiency on Eddie. Ethernet can rarely achieve more than 80% using highly optimised ethernet.

3. How many nodes on Eddie have infiniband devices?

Infiniband is available on 60 out of 128 worker nodes. In Phase 2 (late 2007) we may equip the entire cluster with Infiniband if there is sufficient demand.

Note: In light of above, your MPI jobs must be no more than 240-way to use the Infiniband network

4. Which MPI environments on Eddie support the Infiniband network?

Infinipath MPI - QLogic's MPI implementation derived from MPICH 1.2.6. Infinipath MPI libraries have been highly tuned for the InfiniPath interconnect. and will not run over other interconnects.

We are working on building other MPI implementations (OpenMPI, MVAPICH, MVAPICH2) to support the Infiniband network.

5. Do I need rebuild my application codes to use Infiniband?

Unfortunately, yes. You must rebuild your code with the Infinipath MPI to use the Infiniband interconnect.

6. How can I identify nodes in infiniband network?

Use the same node name as you used in ethernet network. Infinipath MPI will use ethernet at the very beginning of your MPI job to identify the nodes. After that, all the data communication will be performed between processes via the Infiniband interconnect.


The following steps show how to run an MPI job with infiniband on Eddie

login to eddie

ssh eddie.ecdf.ed.ac.uk

add infinipath module to environment

infinipath/core/gcc/2.1
infinipath/ofed/2.1

build your application codes

mpicc -o foo foo.c

create a job script - a template script.sh

#!/bin/sh
###################################################
#                                                 #  
# A SGE batch job template for ECDF Cluster eddie #
#                                                 #
# by ECDF System Team                             # 
# ecdf-systems-team@lists.ed.ac.uk                #
#                                                 #
###################################################

# Grid Engine options

#$ -cwd
#$ -l h_rt=00:05:00
#$ -N job_name


# initialise environment module

. /etc/profile

# use Infinipath MPI with Intel (icc,ifort) compiler

#module add intel/cce
#module add intel/fce
#module add infinipath/core/intel/2.1
#module add infinipath/ofed/2.1

# use Infinipath MPI with GNU (gcc,g++,g77,gfortran) compiler

module add infinipath/core/gcc/2.1
module add infinipath/ofed/2.1

# run MPI job, replace 'foo' by your executable file

# The option -m specify on which node(s) the code to be run.
# file $TMPDIR/machines is generated and understood by Grid Engine, users 
# do not need to change it. 

NSLOTS=`expr $NSLOTS - 4`

mpirun -m $TMPDIR/machines -np $NSLOTS ./foo

submit job to Grid Engine - with parallel environment option -pe infinipath <nslots>

<nslots> must be multiple 4 (4,8,12,16 etc)

qsub -R y -pe infinipath <nslots> script.sh

Note: Currently Infinipath MPI has a little bug - the MPI jobs fail to start on mpi master node, you have to run jobs in non-local mode. The Parallel Environment will automatically remove mpi-master from machine list. We suggest you request 4 extra slots to guarantee your jobs getting enough slots left.

i.e. If you want to run a 16-way MPI job, you should submit this job as following

qsub -pe infinipath 20 script.sh

Use infiniband network (last edited 2007-10-05 14:50:25 by Yuan WAN)