To run on a cluster or any machine with distributed memory, DiagHam relies on MPI. MPI support has to be enabled through the --enable-mpi option of configure when installing DiagHam. By default, DiagHam will for the mpiCC fo the MPI C++ compiler. If your MPI C++ has a different name (such as mpic++ using openmpi for Ubuntu), you can set it adding the --with-mpi-cxx="mympic++compiler" configure option.
Two modes are available when using MPI : a MPI only mode and a MPI/SMP mixed mode. Notice that running DiagHam on a single machine with multiple cores or CPUs does not require MPI. In that case, the SMP mode provides the parallel mode support.
Compiling with MPI
DiagHam has been extensively tested with openMPI. There should be no issue if you use this version of MPI.
If you use another MPI library (such as the MPI library), you can force the MPI compiler name using an option such as --with-mpi-cxx="mpiicpc" when configuring DiagHam.The Intel MPI compiler seems to be pretty picky. So if you have error messages like
/afs/@cell/common/soft/intel/ics2013/impi/4.1.3/intel64/include/mpicxx.h(95): error: #error directive: "SEEK_SET is #defined but must not be for the C++ binding of MPI. Include mpi.h before stdio.h" #error "SEEK_SET is #defined but must not be for the C++ binding of MPI. Include mpi.h before stdio.h"
You should append CPPFLAGS="-DMPICH_IGNORE_CXX_SEEK" in front of the configure line. A full blown example of a configure line (including the support of MKL and scalack) could look like
CPPFLAGS="-DMPICH_IGNORE_CXX_SEEK" ../configure --enable-fqhe --enable-fti --enable-spin --enable-intelmkl --enable-mpi --enable-debug CC="icc" CXX="icpc" --enable-lapack --with-intelmkl-libdir="/afs/@cell/common/soft/intel/ics2016.2/16.0/linux/mkl/lib/intel64" --with-blas-libdir="-L/afs/@cell/common/soft/intel/ics2016.2/16.0/linux/mkl/lib/intel64" --with-blas-libs="-lmkl_blas95_ilp64" --with-lapack-libs="-lmkl_lapack95_ilp64" --with-mpi-cxx="mpiicpc" --enable-scalapack --with-scalapack-libs="-lmkl_scalapack_ilp64 -lmkl_blacs_intelmpi_ilp64"
MPI only mode
The MPI only mode can be turned on using the --mpi option.
The MPI/SMP mixed mode allows the advantage of SMP (like shared memory) on a given cluster node. In that case, only one instance of MPI has to run on each node. The MPI/SMP mode is enabled through the --mpi-smp option. This option requires a text file that describes the cluster.
Full cluster description
The simplest and most tunable case is to give the precise cluster description. The typical file that you have to provide should look like:
nostromo1 6 0.11 60000 nostromo2 6 0.15 50000 nostromo3 4 0.14 60000
The first column is the node hostname, the second column is the number of CPU/cores that can be used on each node, and the fourth column is the amount of memory (in Mb) that can be used. The fourth column is optional and is just a way to specify a per-node --memory option (like the one of FQHECheckerboardLatticeModel). The third column is also optional. It allows to do static load balancing and indicate the relative performance of a single core/CPU on a given node.
Since MPI/SMP mode assume only one MPI instance per node, the MPI hostfile has to be such as
nostromo1 slots=1 nostromo2 slots=1 nostromo3 slots=1
Other possible cluster descriptions
Such a way to describe the cluster might not be the most convenient in several cases. For example, if your cluster uses a batch queue, it is quite difficult to know which nodes will be used. Two other options are provided. A simple cluster description can be
master 1 0.1 100 default 2 0.15 400
In that case all the node will have the same configuration (i.e number of CPUs, performance index and memory) than the node described by "default" hostname. An additional line can be provided to give a different configuration for the master node, using "master" as hostname.
The second method allows to tune a group of nodes having the same prefix. If the cluster has two types of machines : "nostromoX" and "discoveryY" where X and Y are unique for each node (can be any string, integers,...). Then the configuration can be set with a file such as
nostromo* 4 0.2 10000 discovery* 8 0.15 60000
There, each nostromoX node will have the configuration 4 cores, 10Gb of memory and a performance of 0.2, while each discoveryY will have the configuration 8 cores, 60Gb of memory and a performance of 0.15. This second method can be mixed with the more general configuration file: For example, one can have a generic configuration for the nostromo nodes and a specific configuration for each discovery node.
DiagHam can write a log file that describes how much time is spent in any operation that relies on MPI and or SMP. The name of the log file has to be provided using the --cluster-profil option. Such a log file looks like
number of nodes = 4.29181 cluster description : node 0 : hostname=nostromo1 cpu=6 perf. index=0.148611 mem=60000Mb node 1 : hostname=nostromo2 cpu=6 perf. index=0.21176 mem=50000Mb node 2 : hostname=nostromo3 cpu=4 perf. index=0.222019 mem=60000Mb --------------------------------------------- profiling --------------------------------------------- node 0: VectorHamiltonianMultiply core operation done in 3.002 seconds node 1: VectorHamiltonianMultiply core operation done in 7.273 seconds node 2: VectorHamiltonianMultiply core operation done in 8.863 seconds node 0: VectorHamiltonianMultiply sum operation done in 0.534 seconds
Dynamic load balancing is not yet available on DiagHam. But the information provided by such a log file allows to optimize the static load balancing. The PERL script scripts/ClusterProfiling.pl can help to do that
scripts/ClusterProfiling.pl cluster.log cluster.desc
where cluster.log is the profiling log file and cluster.dec the cluster description. The typical output is
VectorHamiltonianMultiply sum operation : global total time = 469.393999999999s total CPU time = 469.393999999999s mean time = 0.399144557823129s number of calls = 1176 total number of calls = 1176 per node informations : node 0 : total time = 469.393999999999s number of calls = 1176s mean time per call = 0.399144557823129+/-0.00547376435672348s min time per call = 0.387s max time per call = 0.534s ------------ VectorHamiltonianMultiply core operation : global total time = 15968.545s total CPU time = 51840.2419999999s mean time = 8.81636768707482s number of calls = 1176 total number of calls = 5880 per node informations : node 0 : total time = 3517.615s number of calls = 1176s mean time per call = 2.99116921768708+/-0.0516665897986875s min time per call = 2.941s max time per call = 3.587s node 1 : total time = 8613.44299999998s number of calls = 1176s mean time per call = 7.32435629251699+/-0.181785287943176s min time per call = 7.164s max time per call = 9.276s node 2 : total time = 10513.155s number of calls = 1176s mean time per call = 8.93975765306121+/-0.482395743143856s min time per call = 8.739s max time per call = 11.764s ------------ ------------ optimized performance index ------------ node 0 : perf. index=0.361140846848426 node 1 : perf. index=0.2101556663922 node 2 : perf. index=0.180522373455297 nostromo1 6 0.361140846848426 60000 nostromo2 6 0.2101556663922 50000 nostromo3 4 0.180522373455297 60000
The end part below optimized performance index can be used as a new optimized cluster description.