<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Feb 8, 2016 at 6:45 PM, Jed Brown <span dir="ltr"><<a href="mailto:jed@jedbrown.org" target="_blank">jed@jedbrown.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">Xiangdong <<a href="mailto:epscodes@gmail.com">epscodes@gmail.com</a>> writes:<br>

<br>

> iii) since the time ratios of VecDot (2.5) and MatMult (1.5) are still<br>

> high, I rerun the program with ipm module. The IPM summary is here:<br>

> <a href="https://drive.google.com/file/d/0BxEfb1tasJxhYXI0VkV0cjlLWUU/view?usp=sharing" rel="noreferrer" target="_blank">https://drive.google.com/file/d/0BxEfb1tasJxhYXI0VkV0cjlLWUU/view?usp=sharing</a>.<br>

> From this IPM reuslts, MPI_Allreduce takes 74% of MPI time. The<br>

> communication by task figure (1st figure in p4) in above link showed that<br>

> it is not well-balanced. Is this related to the hardware and network (which<br>

> the users cannot control) or can I do something on my codes to improve?<br>

<br>

</span>Here are a few functions that don't have any communication, but still<br>

have significant load imbalance.<br>

<span class=""><br>

  VecAXPY          1021815 1.0 2.2148e+01 2.1 1.89e+10 1.1 0.0e+00 0.0e+00 0.0e+00  2  4  0  0  0   2  4  0  0  0 207057<br>

</span><span class="">  VecMAXPY          613089 1.0 1.3276e+01 2.2 2.27e+10 1.1 0.0e+00 0.0e+00 0.0e+00  1  4  0  0  0   1  4  0  0  0 414499<br>

</span><span class="">  MatSOR            818390 1.0 1.9608e+02 1.5 2.00e+11 1.1 0.0e+00 0.0e+00 0.0e+00 22 40  0  0  0  22 40  0  0  0 247472<br>

<br></span></blockquote><div><br></div><div>For these functions, the flop ratios are all 1.1, while the time ratio are 1.5-2.2. So the amount of work are sort of balanced for each processes. Both runs on Stampede and my group cluster gave similar behaviors. Given that I only use 256 cores, do you think it is likely that my job was assigned cores with different speeds? How can I test/measure this since each time the job was assigned to different nodes?</div><div><br></div><div>Are there any other factors I should also look into for the behavior that flops ratio 1.1 but time ratio 1.5-2.1 for non-communicating functions?</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">

</span>You can and should improve load balance before stressing about network<br>

costs.  This could be that the nodes aren't clean (running at different<br>

speeds) or that the partition is not balancing data.<br>

</blockquote></div><br></div></div>