<html dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style id="owaParaStyle" type="text/css">P {margin-top:0;margin-bottom:0;}</style>
</head>
<body ocsi="0" fpstyle="1" style="word-wrap:break-word">
<div style="direction: ltr;font-family: Tahoma;color: #000000;font-size: 10pt;">Hi all,<br>
<br>
Thanks for the prompt response. I really appreciate the helps.<br>
<br>
<div style="margin-left: 80px;"><font size="2"><span style="font-size: 10pt;">> The allreduce operations have barrier semantics because of the data dependencies. It could be that these "barriers" are reducing the number of unexpected messages inside the alltoallv
algorithm, which is causing the performance improvement.</span></font><br>
<font size="2"><span style="font-size: 10pt;"></span></font><br>
<font size="2"><span style="font-size: 10pt;">To add to this, one thing you might try to do is set the environment variable "MPICH_ALLTOALL_THROTTLE" to different values. Its default value is 4. Setting it to 0 causes all irecvs/isends to be posted at once.</span></font><br>
<font size="2"><span style="font-size: 10pt;"></span></font><br>
<font size="2"><span style="font-size: 10pt;">What version of MPICH2 are you using? (hopefully something modern)</span></font><br>
<font size="2"><span style="font-size: 10pt;"></span></font><br>
<font size="2"><span style="font-size: 10pt;">-Dave<br>
<br>
</span></font></div>
I am using mpich2 1.4-gcc-4.1-shared as my compiler. I run my jobs on Fusion. So I suppose it's MPICH2 1.4?
<br>
<div style="direction: ltr;" id="divRpF979412"><font color="#000000" face="Tahoma" size="2"><span style="font-size: 10pt;"></span><br>
</font><br>
</div>
<div style="font-family: Times New Roman; color: #000000; font-size: 16px">
<div></div>
<div>
<div style="margin-left: 80px;">Could you send us the Jumpshot images you refer to?
</div>
<div style="margin-left: 80px;"><br>
</div>
<div style="margin-left: 80px;">Thanks,</div>
<div style="margin-left: 80px;">Rusty</div>
<div><br>
Unfortunately I deleted the jumpshot log file that I refer to. But as I run more small experiments, I notice the "irregularities" also appear in experiments with MPI_Allreduce. In some iterations, all processes seem to be waiting for a particular process to
proceed for no apparent reason. I guess that's a normal phenomenon in parallel execution and it is not caused by the absence of MPI_Allreduce.<br>
<br>
<font style="font-family: Tahoma;" size="3">I'll try the </font><font style="font-family: Tahoma;" size="3"><span style="font-size: 10pt;">MPICH_ALLTOALL_THROTTLE trick</span></font> and see how it goes.<br>
<br>
Thank you again.<br>
<br>
Tom Li<br>
<br>
<div>
<div>On Tuesday,Jul 3, 2012, at 1:39 PM, Li, Lihua (UMSL-Student) wrote:</div>
<br class="Apple-interchange-newline">
<blockquote type="cite"><span class="Apple-style-span" style="border-collapse:separate; font-family:Helvetica; font-style:normal; font-variant:normal; font-weight:normal; letter-spacing:normal; line-height:normal; orphans:2; text-indent:0px; text-transform:none; white-space:normal; widows:2; word-spacing:0px; font-size:medium">
<div>
<div style="direction:ltr; font-family:Tahoma; color:rgb(0,0,0); font-size:10pt">
Dear MPICH users,<br>
<br>
Do any one happen to have such an experience, when MPI_Allreduce "seems to" make MPI_Alltoallv faster? I am currently stuck on this performance issue and could not figure out why. The project I am working on is a parallel version of conjugate gradient solver,
which would rely on MPI calls to update the vectors(MPI_Alltoallv) and scalars (MPI_Allreduce) values. In the original and most simple implementation, MPI_Alltoallv calls interleaves MPI_Allreduce calls. We modified the algorithm a little bit to put the data
transmitted in MPI_Allreduce into MPI_Alltoallv. So the new, modified algorithm would have the same number of MPI_Alltoallv calls, same pattern of data transmission, slightly larger trunk of data transmitted in MPI_Alltoallv (~100bytes more, original transmission
100Mbytes+) and no MPI_Allreduce calls. The result is a bit surprising to us. Instead of a performance gain, the modified algorithm shows a slight performance loss compared to the unmodified algorithm after repeated experiments. Further timing of different
parts of the code shows the performance discrepancy lies in MPI_Alltoallv and MPI_Allreduce calls.<br>
<br>
We looked into the Jumpshot images trying to find out the reason. The two jumpshots looks rather similar to each other, since the transmission pattern is the same. However, the Jumpshots of the unmodified algorithm looks more "regular": the time spent on MPI_Alltoallv
is about the same for every process, while the MPI_Allreduce operations takes very little time. The Jumpshot image for the modified version has greater variation of MPI_Alltoallv time among processes, while there is no MPI_Allreduce operations. It gives me
a weird feeling that MPI_Allreduce operations are "regulating" the behavior of MPI_Alltoallv.<br>
<br>
Anyone got hint of what's going on?<span class="Apple-converted-space"> </span><br>
<br>
Regards,<br>
Tom Li<br>
<br>
<br>
</div>
_______________________________________________<br>
mpich-discuss mailing list <a href="mailto:mpich-discuss@mcs.anl.gov" target="_blank">mpich-discuss@mcs.anl.gov</a><br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>
</div>
</span></blockquote>
</div>
<br>
</div>
</div>
</div>
</div>
</body>
</html>