[mpich-discuss] MPI_Allreduce increases the performance of MPI_Alltoallv?

Li, Lihua (UMSL-Student) ll9n8 at mail.umsl.edu
Tue Jul 3 18:50:53 CDT 2012

Hi all,

Thanks for the prompt response. I really appreciate the helps.

> The allreduce operations have barrier semantics because of the data dependencies.  It could be that these "barriers" are reducing the number of unexpected messages inside the alltoallv algorithm, which is causing the performance improvement.

To add to this, one thing you might try to do is set the environment variable "MPICH_ALLTOALL_THROTTLE" to different values.  Its default value is 4.  Setting it to 0 causes all irecvs/isends to be posted at once.

What version of MPICH2 are you using? (hopefully something modern)


I am using mpich2 1.4-gcc-4.1-shared as my compiler. I run my jobs on Fusion.  So I suppose it's MPICH2 1.4?

Could you send us the Jumpshot images you refer to?


Unfortunately I deleted the jumpshot log file that I refer to. But as I run more small experiments, I notice the "irregularities" also appear in experiments with MPI_Allreduce. In some iterations, all processes seem to be waiting for a particular process to proceed for no apparent reason. I guess that's a normal phenomenon in parallel execution and it is not caused by the absence of MPI_Allreduce.

I'll try the MPICH_ALLTOALL_THROTTLE trick and see how it goes.

Thank you again.

Tom Li

On Tuesday,Jul 3, 2012, at 1:39 PM, Li, Lihua (UMSL-Student) wrote:

Dear MPICH users,

Do any one happen to have such an experience, when MPI_Allreduce "seems to" make MPI_Alltoallv faster? I am currently stuck on this performance issue and could not figure out why. The project I am working on is a parallel version of conjugate gradient solver, which would rely on MPI calls to update the vectors(MPI_Alltoallv) and scalars (MPI_Allreduce) values. In the original and most simple implementation, MPI_Alltoallv calls interleaves MPI_Allreduce calls. We modified the algorithm a little bit to put the data transmitted in MPI_Allreduce into MPI_Alltoallv. So the new, modified algorithm would have the same number of MPI_Alltoallv calls, same pattern of data transmission, slightly larger trunk of data transmitted in MPI_Alltoallv (~100bytes more, original transmission 100Mbytes+) and no MPI_Allreduce calls. The result is a bit surprising to us. Instead of a performance gain, the modified algorithm shows a slight performance loss compared to the unmodified algorithm after repeated experiments. Further timing of different parts of the code shows the performance discrepancy lies in MPI_Alltoallv and MPI_Allreduce calls.

We looked into the Jumpshot images trying to find out the reason. The two jumpshots looks rather similar to each other, since the transmission pattern is the same. However, the Jumpshots of the unmodified algorithm looks more "regular": the time spent on MPI_Alltoallv is about the same for every process, while the MPI_Allreduce operations takes very little time. The Jumpshot image for the modified version has greater variation of MPI_Alltoallv time among processes, while there is no MPI_Allreduce operations. It gives me a weird feeling that MPI_Allreduce operations are "regulating" the behavior of MPI_Alltoallv.

Anyone got hint of what's going on?

Tom Li

mpich-discuss mailing list     mpich-discuss at mcs.anl.gov<mailto:mpich-discuss at mcs.anl.gov>
To manage subscription options or unsubscribe:

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120703/b83fc0ba/attachment-0001.html>

More information about the mpich-discuss mailing list