[petsc-users] log_summary time ratio and flops ratio
Jed Brown
jed at jedbrown.org
Wed Feb 10 08:35:21 CST 2016
Xiangdong <epscodes at gmail.com> writes:
>> VecAXPY 1021815 1.0 2.2148e+01 2.1 1.89e+10 1.1 0.0e+00 0.0e+00
>> 0.0e+00 2 4 0 0 0 2 4 0 0 0 207057
>> VecMAXPY 613089 1.0 1.3276e+01 2.2 2.27e+10 1.1 0.0e+00 0.0e+00
>> 0.0e+00 1 4 0 0 0 1 4 0 0 0 414499
>> MatSOR 818390 1.0 1.9608e+02 1.5 2.00e+11 1.1 0.0e+00 0.0e+00
>> 0.0e+00 22 40 0 0 0 22 40 0 0 0 247472
>>
>>
> The result above is from a run with 256 cores (16 nodes * 16 cores/node). I
> did another run with 64 nodes * 4 cores/node. Now these functions are much
> better balanced ( a factor of 1.2-1.3, instead of 1.5-2.1).
>
> VecAXPY 987215 1.0 6.8469e+00 1.3 1.82e+10 1.1 0.0e+00 0.0e+00
> 0.0e+00 1 4 0 0 0 1 4 0 0 0 647096
> VecMAXPY 592329 1.0 6.0866e+00 1.3 2.19e+10 1.1 0.0e+00 0.0e+00
> 0.0e+00 1 4 0 0 0 1 4 0 0 0 873511
> MatSOR 790717 1.0 1.2933e+02 1.2 1.93e+11 1.1 0.0e+00 0.0e+00
> 0.0e+00 24 40 0 0 0 24 40 0 0 0 362525
So it's significantly faster in addition to being more balanced. I
would attribute that to memory bandwidth.
> For the functions requires communication, the time ratio is about (1.4-1.6)
> VecDot 789772 1.0 8.4804e+01 1.4 1.46e+10 1.1 0.0e+00 0.0e+00
> 7.9e+05 14 3 0 0 40 14 3 0 0 40 41794
> VecNorm 597914 1.0 7.6259e+01 1.6 1.10e+10 1.1 0.0e+00 0.0e+00
> 6.0e+05 12 2 0 0 30 12 2 0 0 30 34996
>
> The full logsummary for this new run is here:
> https://googledrive.com/host/0BxEfb1tasJxhVkZ2NHJkSmF4LUU
>
> Can we say now the load imbalance is from the network communication,
> instead of memory bandwidth?
It is expected that synchronizing functions like these have higher "load
imbalance", but it doesn't necessarily mean the network is running at
different speeds for different nodes or some such. Rather, you've
accumulated load imbalance over previous operations and now you have to
wait for the slowest process before anyone can continue. So now the
process that was fastest before logs the longest time for the Norm or
Dot. I see 100µs per VecDot above, which is reasonable. If you get
more exact load balance in the local computation, you might be able to
improve it a bit.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 818 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20160210/a35af784/attachment.pgp>
More information about the petsc-users
mailing list