"ABINIT in Parallel" tutorial : Parallelism inside ABINIT


What is this tutorial about ?

There are situations where a sequential code is not enough, often because it would take too much time to get a result. There are also cases where you just want things to go as fast as your computational resources allow it. To this end, it is possible to build a parallel version of the main binary, called abinit.

This tutorial offers you a little reconnaissance tour inside the complex world that emerges as soon as you want to use more than one processor. From now on, we will suppose that you are already familiar with ABINIT and that you have gone through all basic tutorials. If this is not the case, we strongly advise you to do so, in order to truly benefit from this tutorial.

Copyright (C) 2005-2010 ABINIT group (YP)
This file is distributed under the terms of the GNU General Public License, see ~abinit/COPYING or http://www.gnu.org/copyleft/gpl.txt .
For the initials of contributors, see ~abinit/doc/developers/contributors.txt .

Goto : ABINIT home Page | Suggested acknowledgments | List of input variables | Tutorial home page | Bibliography
Help files : New user's guide | Abinit (main) | Abinit (respfn) | Mrgddb | Anaddb | AIM (Bader) | Cut3D | Optic

Table of contents


1. Preliminary step

There are so many ways to set-up a parallel environment that, when you look at the different solutions which have been developed here or there, it seems quite like a jungle. Not only do you have to consider the software part, but also do the hardware part - and the combination of both as well - play a major role in the complexity of the whole landscape.

Thus, before actually starting this tutorial, we strongly advise you to get familiar with your parallel environment. Take some time to determine how you can launch a job in parallel, what are the resources available and the limitations as well, and do not hesitate to discuss with your system administrator if you feel that something is not clear to you.

We will suppose in the following that you know how to run a parallel program and that you are familiar with the peculiarities of your system. Please remember that, as there is no standard way of setting up a parallel environment, we are not able to provide you with support beyond ABINIT itself.


2. Relevancy of parallelism

Running a parallel program is not as straightforward as running a sequential one, in particular if you have never done this before. Therefore, a question which is worth to ask yourself is: "How relevant is it for me to use the parallel version of ABINIT? Is it truly worth the effort?". Here are a few hints, that take into account the present status of parallelization in ABINIT.

It will be particularly relevant for you to use abinit in parallel if:

It will not be relevant at all if:


3. Popular parallel environments

PVM

The general goals of the PVM (Parallel Virtual Machine) project are to investigate issues in, and develop solutions for, heterogeneous concurrent computing. PVM offers an integrated set of software tools and libraries that emulates a general-purpose, flexible, heterogeneous concurrent computing framework on interconnected computers of varied architecture [PVM1].

As concern ABINIT, PVM has a similar purpose to MPI (see later), and the latter was preferred. So PVM has never been available to ABINIT, and we do not intend to change strategies in a near future.

[PVM1] http://www.netlib.org/pvm3/book/node17.html

OpenMP

The OpenMP Application Program Interface (API) supports multi-platform shared-memory parallel programming in C/C++ and Fortran on all architectures, including Unix platforms and Windows NT platforms. Jointly defined by a group of major computer hardware and software vendors, OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer [OMP1].

As of June, 2005, the OpenMP is at its version 2.5 specification [OMP2].

There exist some support for OpenMP within ABINIT, but it has never been made really efficient. Although this might change in the future, we prefer not to describe these in this tutorial.

[OMP1] http://www.openmp.org
[OMP2] http://www.mpi-forum.org/drupal/mp-documents/spec25.pdf

MPI

MPI stands for Message Passing Interface. The goal of MPI, simply stated, is to develop a widely used standard for writing message- passing programs. As such the interface attempts to establish a practical, portable, efficient, and flexible standard for message passing.

In designing MPI the MPI Forum sought to make use of the most attractive features of a number of existing message passing systems, rather than selecting one of them and adopting it as the standard. Thus, MPI has been strongly influenced by work at the IBM T. J. Watson Research Center, Intel's NX/2, Express, nCUBE's Vertex, p4, and PARMACS. Other important contributions have come from Zipcode, Chimp, PVM, Chameleon, and PICL.

The main advantages of establishing a message-passing standard are portability and ease-of-use. In a distributed memory communication environment in which the higher level routines and/or abstractions are build upon lower level message passing routines the benefits of standardization are particularly apparent. Furthermore, the definition of a message passing standard provides vendors with a clearly defined base set of routines that they can implement efficiently, or in some cases provide hardware support for, thereby enhancing scalability [MPI1].

At some point in its history MPI has reach a critical popularity level, and a bunch of projects have popped-up like daisies in the grass. Now the tendency is back to gathering and merging. For instance, Open MPI is a project combining technologies and resources from several other projects (FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI) in order to build the best MPI library available. Open MPI is a completely new MPI2-compliant implementation, offering advantages for system and software vendors, application developers and computer science researchers [MPI2].

MPI support is currently where we are focusing most efforts on for the parallel version of ABINIT.

[MPI1] http://www.mpi-forum.org
[MPI2] http://www.open-mpi.org


4. What parts of ABINIT are parallel?

Parallelizing a code is a very delicate and complicated task, thus do not expect that things will systematically go faster just because you are using a parallel version of ABINIT. Please keep also in mind that in some situations, parallelization is simply impossible, and that there is no systematic way to predict whether a part of a code may be parallelized. At the present time, the parts of ABINIT that have been parallelized so far comprise:

At this point, the available ABINIT-specific documentation is still very scarce, mainly because the parallelization is still an effort in progress requiring many more developments. This tutorial is a step taken in order to provide you with useful insights and practical information. You might find interesting additional information in the ~abinit/doc/users/paral_use file, and we suggest you to read it now. Another file that would be worth reading for developers is ~abinit/doc/developers/FFT_in_parallel. We will not demonstrate in this tutorial the parallel FFT possibilities, mainly because they are still evolving, and cannot be used in production. By contrast, the other parallelizations can be used in production, and even concurrently in some cases.


5. Compiling abinit for a parallel environment

INFO
 
We will now suppose that you have already successfully compiled the sequential binaries of ABINIT at least, and that the ABINIT build system is not completely unknown to you.

Finding MPI on your system

The first step before compiling abinit is to find where the components of the MPI library are located, which can be non-trivial. MPI is usually found much more easily on Unix-compatible systems than on others, because Unix defines standard locations and strict naming conventions. Moreover, most common Linux distributions (Debian, Redhat, Slackware, etc.) propose several implementations of MPI as binary packages. They are called mpich or lam, for instance. Specially tuned versions of BLACS (Basic Linear Algebra Common Subprograms) and SCALAPACK (SCAlable Linear Algebra PACKage) are often provided as well. You should consult the documentation of your system for more information.

If you are using Linux, the MPI library files will likely be in a subdirectory of /usr/lib and/or /usr/include. On MACOSX, you should have a look inside /sw first. On all Unix systems, looking inside /usr/local/lib, /usr/local/include, and /opt would be a good idea. If you still don't find them, check whether MPI is installed, then maybe ask your system administrator.

For ABINIT, the important points are:

Please write them down carefully. You might want to reuse them for other versions of ABINIT.

IMPORTANT NOTE
 
If you have several implementations of MPI installed on your machine, please take care of not mixing the files.

Tuning compilation parameters

For versions of ABINIT prior to 5.0, all compilation information was stored in a file called makefile_macros. This is no more the case now, as the compilation is driven by Autoconf.

In many situations, the MPI library is detected at configure-time. Therefore, if you have compiled ABINIT, you should already have an abinit binary inside src/main. If this did not occur, type the following commands:

$ mkdir build && cd build && ../configure --with-mpi-prefix= $ make

where you replace by the directory you have written down before, removing the final "lib/" component. After a while, you should get abinit. If this is still not the case, please have a look at the doc/build/MPI-Tricks file and solve the issue.

Now that you have got a shiny brand-new abinit binary, let's play a little bit with this new toy.


6. Taking benefit from the parallelism

Running a job

Before starting, you might consider working in a different subdirectory as for the other lessons. Why not "Work_paral" ?

Copy the files file and the input file from the ~abinit/tests/tutorial directory to your work directory. They are named tparal_1.files and tparal_1.in. You can start immediately a sequential run, to have a reference CPU time. On a 2.8GHz PC, it needs about one minute to be done.

Contrary to the sequential case, it is worth to have a look at the "files" file, and to modify it for the parallel execution, as one should avoid unnecessary network communications. If every node has its own temporary or scratch directory, you can achieve this by providing a path to a local disk for the temporary files in abinit.files. Supposing each processor has access to a local temporary disk space named /scratch/user, then you might modify the 5th line of the files file so that it becomes:

tparal_1.in
tparal_1.out
tparal_1i
tparal_1o
/scratch/user/tparal_1
../../Psps_for_tests/HGH/82pb.4.hgh

Note that determining precisely the resources you will need for your run will save you a lot of time if you are using a batch queue system.

Parallelism over the k-points

The most favorable case for a parallel run is to treat the k-points concurrently, since it can be done independently for each one of them.

Actually, tparal_1.in corresponds to the investigation of a fcc crystal of lead, which requires a high number of k-points if one wants to get an accurate description of the ground state. Examine this file. Note that the cut-off is realistic, as well as the grid of k-points (giving 60 k points in the irreducible Brillouin zone). However, the number of SCF steps, nstep, has been set to 3 only. This is to keep the CPU time reasonable for this tutorial, without affecting the way parallelism on the k points will be able to increase the speed. Once done, your output files have likely been produced. Examine the timing in the output file (the last line gives the overall CPU and Wall time), and keep note of it.

Now you should run the parallel version of ABINIT. On a PC, with the MPICH implementation of MPI, you have to set up a file with the addresses of the different CPUs. Let's suppose you call it cluster. For a PC bi-processor machine, this file could have only one line, like the following :

sleepy.pcpm.ucl.ac.be:2
(In this case, the processor 0 was not to be specified). For a cluster of four machines, you might have something like:
tux0
tux1
tux2
tux3
More possibilities are mentioned in the file ~abinit/doc/users/paral_use.

Then, you have to issue the run command for your MPI implementation, and mention the number of processors you want to use, as well as the abinit command.

On a PC bi-processor machine, this gives the following:

mpirun -np 2 -machinefile cluster ../../src/main/abinit < tparal_1.files >& tparal_1.log &

Now, examine the corresponding output file. If you have kept the output from the sequential job, you can make a diff between the two files. You will notice that the numerical results are quite identical. You will also see that the value of mkmem is 60 for the sequential case, and 30 for the parallel case : the set of 60 k-points has been split into two sets, each of 30 k-points treated by one of the two processors.

The timing can be found at the end of the file. Here is an example:

- Proc.   0 individual time (sec): cpu=         28.3  wall=         28.3

================================================================================

 Calculation completed.
 Delivered    1 WARNINGs and   1 COMMENTs to log file.
+Overall time at end (sec) : cpu=         56.6  wall=         56.6

This corresponds effectively to a speed-up of the job by a factor of two. Let's examine it. The line beginning with Proc. 0 corresponds to the CPU and Wall clock timing seen by the processor number 0 (the other is number 1): 28.3 sec of CPU time, and the same amount of Wall clock time. The line that starts with +Overall time corresponds to the sum of CPU times and Wall clock timing for all processors. The summation is quite meaningful for the CPU time, but not so for the wall clock time: the job was finished after 28.3 sec, and not 56.6 sec.

Now, you might try to increase the number of processors, and see whether the CPU time is shared equally amongst the different processors, so that the Wall clock time seen by each processor decreases. At some point (depending on your machine, and the sequential part of ABINIT), you will not be able to decrease further the Wall clock time seen by one processor. It is not worth to try to use more processors. You should get a curve similar to this one:

The red curve materializes the speed-up achieved, while the green one is the "y = x" line. The shape of the red curve will vary depending on your hardware configuration.

One last remark: the number of k-points needs not be a multiple of the number of processors. As an example, you might try to run the above case with 16 processors: most of the processors will treat 4 k points, but four of them will only treat 3 k points. The maximal speed-up will only be 15 (=60/4), instead of 16.

Parallelism over the spin

The parallelization over the spin is done along with the one over the k-points, so it works exactly the same way. The file ~abinit/tests/tutorial/tparal_2.in and ~abinit/tests/tutorial/tparal_2.files treat a spin-polarized system (distorted FCC Iron) with only one k-point in the Irreducible Brillouin Zone. This is quite unphysical, but has the sole purpose to show the spin parallelism with as few as two processors : the k-point parallelism has precedence over the spin parallelism, so that with 2 processors, one needs at most one k-point to see the spin parallelism.
If needed, modify the files file, to provide a local temporary disk space. Run this test case, in sequential, then in parallel.

While the jobs are running, read the input and files file. Then, look closely at the output and log files. They are quite similar. With a diff, you will see the only obvious manifestation of the parallelism in the following:

< P newkpt: treating     40 bands with npw=    2698 for ikpt=   1
< P newkpt: treating     40 bands with npw=    2698 for ikpt=   1
---
> P newkpt: treating     40 bands with npw=    2698 for ikpt=   1 by node    0
> P newkpt: treating     40 bands with npw=    2698 for ikpt=   1 by node    1
In the second case (parallelism), node 0 is taking care of the up state for k-point 1, while node 1 is taking care of the down state for k-point 1. The timing analysis is very similar to the k-point parallelism case.

If you have more than 2 processors at hand, you might increase the value of ngkpt, so that more than one k-point is available, and see that the k-point and spin parallelism work indeed concurrently.

Parallelism over the bands

The parallelism over bands in the ground-state case is controlled by the wfoptalg and nbdblock input variables.
By contrast, for response-function jobs, the band parallelism is automatically activated when needed.

For this type of parallelism, we will use the example of a gold dimer in a supercell (non-spin-polarized, with only one k point). You can copy the files ~abinit/tests/tutorial/Input/tparal_3.in and ~abinit/tests/tutorial/Input/tparal_3.files in your Work_paral directory. If needed, change the files file to provide local disk space. Look at the file Tutorial/tparal_3.in. You will see that nbdblock=2, so that two processors at most can be used to parallelize the job.

Run the job sequentially (about one minute on a 2.8 GHz PC), save the output file, then run in parallel with two processors. Note that the results of sequential and parallel jobs are numerically identical.

However, note that the numerical results are not completely identical to the case where nbdblock=1, the usual default value for nbdblock. Indeed, in order to parallelize over the bands, some changes in the algorithm to find the wavefunctions are needed, that make it less stable. Such unstability can become severe when the ratio between nbdblock and nband becomes large compared to the dielectric function. For this reason, band-parallelism beyond nbdblock=4 is often not very interesting.

If you compare now the sequential and parallel timing in the output files, it might happen that using two processors does not give you any speed up. Band-parallelism in the ground state requires much more communications, and also has more sequential parts, compared to k-point- or spin-parallelism. You have to test somehow your setup in order to see whether you can benefit from this parallelism.

Let us examine this in more detail for a dual processor (SMP) PC 2.8 GHz. The run with only one processor takes 59.9 secs, while with two processors, it takes 39.2 secs. The speed up is 1.53 , not really close to two, but still significant. The communication is quite fast, since this is a SMP machine. Thus, for this case, this speed-up of 1.53 is likely close to the optimal. We can examine the timing, and see that the sequential parts indeed deteriorate the possible speed-up, irrespective of the communication speed. The following is the timing section for the one processor case :

- For major independent code sections, cpu and wall times (sec),
- as well as % of the total time and number of calls

- routine                 cpu     %       wall     %      number of calls
-                                                          (-1=no count)
- fourwf(pot)           37.541  62.7     37.556  62.7            480
- nonlop(apply)          6.006  10.0      6.042  10.1            480
- fourwf(den)            2.888   4.8      2.889   4.8             66
- fourdp                 1.989   3.3      1.988   3.3             21
- cgwf-O(npw)            1.971   3.3      1.957   3.3             -1
- projbd                 1.842   3.1      1.839   3.1            768
- xc:pot/=fourdp         1.764   2.9      1.763   2.9              7
- nonlop(forces)         1.213   2.0      1.206   2.0             66
- forces                 0.926   1.5      0.926   1.5              6
- vtowfk(ssdiag)         0.353   0.6      0.351   0.6             -1
- 47   others            1.875   3.1      1.878   3.1

- subtotal              58.366  97.5     58.395  97.5

================================================================================

 Calculation completed.
 Delivered    2 WARNINGs and   2 COMMENTs to log file.
+Overall time at end (sec) : cpu=         59.9  wall=         59.9
The following is the cumulative timing section for the two processor case:
- For major independent code sections, cpu and wall times (sec),
- as well as % of the total time and number of calls

- routine                 cpu     %       wall     %      number of calls
-                                                          (-1=no count)
- fourwf(pot)           38.062  48.5     38.080  48.5            480
- nonlop(apply)          6.301   8.0      6.302   8.0            480
- cgwf-O(npw)            4.881   6.2      4.879   6.2             -2
- fourdp                 4.584   5.8      4.582   5.8             42
- xc:pot/=fourdp         3.854   4.9      3.856   4.9             14
- fourwf(den)            2.946   3.8      2.947   3.8             66
- projbd                 2.535   3.2      2.540   3.2            768
- forces                 1.989   2.5      1.991   2.5             12
- vtorho:MPI             1.546   2.0      1.550   2.0             12
- nonlop(forces)         1.256   1.6      1.253   1.6             66
- vtowfk(contrib)        1.224   1.6      1.224   1.6             12
- symrhg(no FFT)         0.856   1.1      0.856   1.1             12
- vtowfk(ssdiag)         0.851   1.1      0.847   1.1             -2
- vtorho(4)-mkrho        0.550   0.7      0.547   0.7             12
- vtorho  (1)            0.522   0.7      0.522   0.7             12
- setsym                 0.477   0.6      0.476   0.6              2
- vtorho-kpt loop        0.459   0.6      0.457   0.6             12
- 40   others            1.148   1.5      1.131   1.4

- subtotal              74.040  94.4     74.040  94.3
-
- Proc.   0 individual time (sec): cpu=         39.2  wall=         39.2

================================================================================

 Calculation completed.
 Delivered    2 WARNINGs and   2 COMMENTs to log file.
+Overall time at end (sec) : cpu=         78.5  wall=         78.5

The two most important timings, that is, fourwf(pot), nonlop(apply), are nearly the same in the sequential or the cumulative parallel case, so that we can infer that the load is well balanced among the two processors. However, some other timings, all taking only a few percent, namely, cgwf-O(npw), fourdp, xc:pot/=fourdp, more than double in the cumulative parallel case : they cannot benefit of the band-by-band parallelization (they stay sequential, and even have a small overload), and thus decrease the maximal attainable speed-up.

The case here treated, is not one of the most favourable cases for the band-by-band parallelism. Indeed, the number of bands was quite small, and hence, the cpu time of routines treating the density and potential was not so small (although less than 10 percent). For a much larger number of bands, their workload should be smaller, and hence, the maximal attainable speed-up should be bigger.

Finally, note that there is still some room for optimization with this type of parallelization, including some new algorithms presently under testing, with different value of wfoptalg.

Tuning input variables

In any case, balancing efficiently the load on the processors is a task that can be done neither straighforwardly nor systematically. Always try to put as much load as possible on each processor, first by using as few processors as possible.

When using k-point- and spin-parallelism, ideal numbers of processors to use divide the product of nsppol by nkpt (e.g. for nsppol*nkpt=12, it is quite efficient to use 2, 3, 4, 6 or 12 processors). For the other cases, you will have to manually find the best compromise, both depending on what you want to do and on your environment.

Evidencing overhead

Beyond a certain number of processors, the efficiency of parallelism saturates, and may even decrease. This is due to the inevitable overhead resulting from the increasing amount of communication between the processors. The loss of efficiency is highly dependent on the implementation and linked to the decreasing charge on each processor too.

Now take one of the previous examples and perform the calculation again, increasing progressively the number of processors. Plot execution time versus # of processors and observe what happens. You should get something like:

Again, the shape of the curve will depend on your hardware configuration.


7. Details of the implementation

The MPI toolbox in ABINIT

Most of the ABINIT-specific MPI routines are located in ~abinit/src/01managempi. They mainly consist in:

They are now used by a wide range of routines, more extensively in ~abinit/src/53_spacepar and in ~abinit/src/79_seqpar_mpi.

You might want to have a look at the routine headers for more detailed descriptions.

How to parallelize a routine: some hints

Here we will give you some advice on how to parallelize a subroutine of ABINIT. Do not expect too much, and remember that you remain mostly on your own for most decisions. Furthermore, we will suppose that you are familiar with ABINIT internals and source code. Anyway, you can skip this section without hesitation, as it is primarily intended for skilled developers.

First, every call to a MPI routine an every purely parallel section of your subroutine must be surrounded by the following preprocessing directives:

#if defined MPI
...
#endif

The first block of this type will likely appear in the "local variables" section, where you will declare all MPI-specific variables. Please note that some of them will be used in sequential mode as well, and thus will be declared outside this block (typically am_master, master, me_loc, etc.).

The MPI communications should be initialized at the very beginning of your subroutine. To do this, we suggest the following piece of code:

!Init mpi_comm
 call xcomm_world(spaceComm)
 am_master=.true.
 master = 0

!Init ntot proc max
 call xproc_max(nproc_loc,ierr)

!Define who i am
 call xme_whoiam(me_loc)

#if defined HAVE_MPI
 if (me_loc/=0) then
  am_master=.FALSE.
 endif

 write(message, '(a,i3,a)' ) '  ',nproc_loc,' CPU synchronized'
 call wrtout(std_out,message,'COLL')
 ...
#endif

Then keep in mind that file I/O cannot be currently parallelized. Thus, any I/O operation must be performed with great care. Use the "COLL" and "PERS" arguments of wrtout wisely. And don't forget that there cannot be nested levels of parallelization.

For more insights, we suggest you to look at some examples, e.g. the TDDFT routine.

 

Goto : ABINIT home Page | Suggested acknowledgments | List of input variables | Tutorial home page | Bibliography
Help files : New user's guide | Abinit (main) | Abinit (respfn) | Mrgddb | Anaddb | AIM (Bader) | Cut3D | Optic