[petsc-users] log_summary time ratio and flops ratio

Wed Feb 10 08:35:21 CST 2016

Xiangdong <epscodes at gmail.com> writes:
>>   VecAXPY          1021815 1.0 2.2148e+01 2.1 1.89e+10 1.1 0.0e+00 0.0e+00
>> 0.0e+00  2  4  0  0  0   2  4  0  0  0 207057
>>   VecMAXPY          613089 1.0 1.3276e+01 2.2 2.27e+10 1.1 0.0e+00 0.0e+00
>> 0.0e+00  1  4  0  0  0   1  4  0  0  0 414499
>>   MatSOR            818390 1.0 1.9608e+02 1.5 2.00e+11 1.1 0.0e+00 0.0e+00
>> 0.0e+00 22 40  0  0  0  22 40  0  0  0 247472
>>
>>
> The result above is from a run with 256 cores (16 nodes * 16 cores/node). I
> did another run with 64 nodes * 4 cores/node. Now these functions are much
> better balanced ( a factor of 1.2-1.3, instead of 1.5-2.1).
>
> VecAXPY           987215 1.0 6.8469e+00 1.3 1.82e+10 1.1 0.0e+00 0.0e+00
> 0.0e+00  1  4  0  0  0   1  4  0  0  0 647096
> VecMAXPY          592329 1.0 6.0866e+00 1.3 2.19e+10 1.1 0.0e+00 0.0e+00
> 0.0e+00  1  4  0  0  0   1  4  0  0  0 873511
> MatSOR            790717 1.0 1.2933e+02 1.2 1.93e+11 1.1 0.0e+00 0.0e+00
> 0.0e+00 24 40  0  0  0  24 40  0  0  0 362525

So it's significantly faster in addition to being more balanced.  I
would attribute that to memory bandwidth.

> For the functions requires communication, the time ratio is about (1.4-1.6)
> VecDot            789772 1.0 8.4804e+01 1.4 1.46e+10 1.1 0.0e+00 0.0e+00
> 7.9e+05 14  3  0  0 40  14  3  0  0 40 41794
> VecNorm           597914 1.0 7.6259e+01 1.6 1.10e+10 1.1 0.0e+00 0.0e+00
> 6.0e+05 12  2  0  0 30  12  2  0  0 30 34996
>
> The full logsummary for this new run is here:
> https://googledrive.com/host/0BxEfb1tasJxhVkZ2NHJkSmF4LUU
>
> Can we say now the load imbalance is from the network communication,
> instead of memory bandwidth?

It is expected that synchronizing functions like these have higher "load
imbalance", but it doesn't necessarily mean the network is running at
different speeds for different nodes or some such.  Rather, you've
accumulated load imbalance over previous operations and now you have to
wait for the slowest process before anyone can continue.  So now the
process that was fastest before logs the longest time for the Norm or
Dot.  I see 100µs per VecDot above, which is reasonable.  If you get
more exact load balance in the local computation, you might be able to
improve it a bit.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 818 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20160210/a35af784/attachment.pgp>