[petsc-users] log_summary time ratio and flops ratio

Wed Feb 10 08:31:23 CST 2016

On Wed, Feb 10, 2016 at 8:12 AM, Xiangdong <epscodes at gmail.com> wrote:

> On Mon, Feb 8, 2016 at 6:45 PM, Jed Brown <jed at jedbrown.org> wrote:
>
>> Xiangdong <epscodes at gmail.com> writes:
>>
>> > iii) since the time ratios of VecDot (2.5) and MatMult (1.5) are still
>> > high, I rerun the program with ipm module. The IPM summary is here:
>> >
>> https://drive.google.com/file/d/0BxEfb1tasJxhYXI0VkV0cjlLWUU/view?usp=sharing
>> .
>> > From this IPM reuslts, MPI_Allreduce takes 74% of MPI time. The
>> > communication by task figure (1st figure in p4) in above link showed
>> that
>> > it is not well-balanced. Is this related to the hardware and network
>> (which
>> > the users cannot control) or can I do something on my codes to improve?
>>
>> Here are a few functions that don't have any communication, but still
>> have significant load imbalance.
>>
>>   VecAXPY          1021815 1.0 2.2148e+01 2.1 1.89e+10 1.1 0.0e+00
>> 0.0e+00 0.0e+00  2  4  0  0  0   2  4  0  0  0 207057
>>   VecMAXPY          613089 1.0 1.3276e+01 2.2 2.27e+10 1.1 0.0e+00
>> 0.0e+00 0.0e+00  1  4  0  0  0   1  4  0  0  0 414499
>>   MatSOR            818390 1.0 1.9608e+02 1.5 2.00e+11 1.1 0.0e+00
>> 0.0e+00 0.0e+00 22 40  0  0  0  22 40  0  0  0 247472
>>
>>
> The result above is from a run with 256 cores (16 nodes * 16 cores/node).
> I did another run with 64 nodes * 4 cores/node. Now these functions are
> much better balanced ( a factor of 1.2-1.3, instead of 1.5-2.1).
>
> VecAXPY           987215 1.0 6.8469e+00 1.3 1.82e+10 1.1 0.0e+00 0.0e+00
> 0.0e+00  1  4  0  0  0   1  4  0  0  0 647096
> VecMAXPY          592329 1.0 6.0866e+00 1.3 2.19e+10 1.1 0.0e+00 0.0e+00
> 0.0e+00  1  4  0  0  0   1  4  0  0  0 873511
> MatSOR            790717 1.0 1.2933e+02 1.2 1.93e+11 1.1 0.0e+00 0.0e+00
> 0.0e+00 24 40  0  0  0  24 40  0  0  0 362525
>
> For the functions requires communication, the time ratio is about (1.4-1.6)
> VecDot            789772 1.0 8.4804e+01 1.4 1.46e+10 1.1 0.0e+00 0.0e+00
> 7.9e+05 14  3  0  0 40  14  3  0  0 40 41794
> VecNorm           597914 1.0 7.6259e+01 1.6 1.10e+10 1.1 0.0e+00 0.0e+00
> 6.0e+05 12  2  0  0 30  12  2  0  0 30 34996
>
> The full logsummary for this new run is here:
> https://googledrive.com/host/0BxEfb1tasJxhVkZ2NHJkSmF4LUU
>
> Can we say now the load imbalance is from the network communication,
> instead of memory bandwidth?
>

Actually now it looks even more like what Jed was saying. The 4 cores have
much more available bandwidth.

   Matt

> Thanks.
>
> Xiangdong
>
> You can and should improve load balance before stressing about network
>> costs.  This could be that the nodes aren't clean (running at different
>> speeds) or that the partition is not balancing data.
>>
>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20160210/1b6f03ea/attachment.html>