[petsc-users] MPI Communication times

Manuel Valera mvalera-w at sdsu.edu
Fri Mar 22 18:52:24 CDT 2019


This is a 3D fluid dynamics code, it uses arakawa C type grids and
curvilinear coordinates in nonhydrostatic navier stokes, we also add
realistic stratification (Temperature / Density) and subgrid scale for
turbulence. What we are solving here is just a seamount with a velocity
forcing from one side and is just 5 pressure solvers or iterations.

PETSc is used via the DMDAs to set up the grids and arrays and do (almost)
every calculation in a distributed manner, the pressure solver is implicit
and carried out with the KSP module. I/O is still serial.

I am attaching the run outputs with the format 60mNP.txt with NP the number
of processors used. These are large files you can read with tail -n 140
[filename] for the -log_view part

Thanks for your help,



On Fri, Mar 22, 2019 at 3:40 PM Zhang, Junchao <jczhang at mcs.anl.gov> wrote:

>
> On Fri, Mar 22, 2019 at 4:55 PM Manuel Valera <mvalera-w at sdsu.edu> wrote:
>
>> No, is the same problem running with different number of processors, i
>> have data from 1 to 20 processors in increments of 20 processors/1 node,
>> and additionally for 1 processor.
>>
>
> That means you used strong scaling. If we combine VecScatterBegin/End,
> from 20 cores, to 100, 200 cores, it took 2%, 13%, 18% of the execution
> time respectively. It looks very unscalable. I do not know why.
> VecScatterBegin took the same time with 100 and 200 cores. My explanation
> is  VecScatterBegin just packs data and then calls non-blocking MPI_Isend.
> However, VecScatterEnd has to wait for data to come.
> Could you tell us more about your problem, for example, is it 2D or 3D,
> what is the communication pattern, how many neighbors each rank has. Also
> attach the whole log files for -log_view so that we can know the problem
> better.
> Thanks.
>
>>
>> On Fri, Mar 22, 2019 at 2:48 PM Zhang, Junchao <jczhang at mcs.anl.gov>
>> wrote:
>>
>>> Did you change problem size with different runs?
>>>
>>> On Fri, Mar 22, 2019 at 4:09 PM Manuel Valera <mvalera-w at sdsu.edu>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I repeated the timings with the -log_sync option and now i get for 200
>>>> processors / 20 nodes:
>>>>
>>>>
>>>> ------------------------------------------------------------------------------------------------------------------------
>>>> Event                         Count      Time (sec)     Flop
>>>>                  --- Global ---  --- Stage ---   Total
>>>>                                Max Ratio  Max     Ratio   Max  Ratio
>>>> Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>>>
>>>> ------------------------------------------------------------------------------------------------------------------------
>>>>
>>>> VecScatterBarrie    3014 1.0 5.6771e+01 3.9 0.00e+00 0.0 *0.0e+00
>>>> 0.0e+00 0.0e+00  5 * 0  0  0  0   5  0  0  0  0     0
>>>> VecScatterBegin    3014 1.0 3.1684e+01 2.0 0.00e+00 0.0 *4.2e+06
>>>> 1.1e+06 2.8e+01  4*  0 63 56  0   4  0 63 56  0     0
>>>> VecScatterEnd       2976 1.0 1.1383e+02 1.8 0.00e+00 0.0 *0.0e+00
>>>> 0.0e+00 0.0e+00 14*  0  0  0  0  14  0  0  0  0     0
>>>>
>>>> With 100 processors / 10 nodes:
>>>>
>>>> VecScatterBarrie    3010 1.0 7.4430e+01 5.0 0.00e+00 0.0 *0.0e+00
>>>> 0.0e+00 0.0e+00  7*  0  0  0  0   7  0  0  0  0     0
>>>> VecScatterBegin    3010 1.0 3.8504e+01 2.4 0.00e+00 0.0 *1.6e+06
>>>> 2.0e+06 2.8e+01  4 * 0 71 66  0   4  0 71 66  0     0
>>>> VecScatterEnd       2972 1.0 8.5158e+01 1.2 0.00e+00 0.0 *0.0e+00
>>>> 0.0e+00 0.0e+00  9 * 0  0  0  0   9  0  0  0  0     0
>>>>
>>>> And with 20 processors / 1 node:
>>>>
>>>> VecScatterBarrie    2596 1.0 4.0614e+01 7.3 0.00e+00 0.0*  0.0e+00
>>>> 0.0e+00 0.0e+00  4 * 0  0  0  0   4  0  0  0  0     0
>>>> VecScatterBegin     2596 1.0 1.4970e+01 1.3 0.00e+00 0.0 *1.2e+05
>>>> 4.0e+06 3.0e+01  1 * 0 81 61  0   1  0 81 61  0     0
>>>> VecScatterEnd       2558 1.0 1.4903e+01 1.3 0.00e+00 0.0*  0.0e+00
>>>> 0.0e+00 0.0e+00  1 * 0  0  0  0   1  0  0  0  0     0
>>>>
>>>> Can you help me interpret this? what i see is the End portion taking
>>>> more relative time and Begin staying the same beyond one node, also Barrier
>>>> and Begin counts are the same every time, but how do i estimate
>>>> communication times from here?
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> On Wed, Mar 20, 2019 at 3:24 PM Zhang, Junchao <jczhang at mcs.anl.gov>
>>>> wrote:
>>>>
>>>>> Forgot to mention long VecScatter time might also due to local memory
>>>>> copies. If the communication pattern has large local to local (self to
>>>>> self)  scatter, which often happens thanks to locality, then the memory
>>>>> copy time is counted in VecScatter. You can analyze your code's
>>>>> communication pattern to see if it is the case.
>>>>>
>>>>> --Junchao Zhang
>>>>>
>>>>>
>>>>> On Wed, Mar 20, 2019 at 4:44 PM Zhang, Junchao via petsc-users <
>>>>> petsc-users at mcs.anl.gov> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 20, 2019 at 4:18 PM Manuel Valera <mvalera-w at sdsu.edu>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for your answer, so for example i have a log for 200 cores
>>>>>>> across 10 nodes that reads:
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------------------------------------------------
>>>>>>> Event                           Count      Time (sec)     Flop
>>>>>>>                        --- Global ---  --- Stage ---   Total
>>>>>>>                                 Max Ratio  Max     Ratio   Max
>>>>>>> Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>>>>>>
>>>>>>> ----------------------------------------------------------------------------------------------------------------------
>>>>>>> VecScatterBegin     3014 1.0 4.5550e+01 2.6 0.00e+00 0.0 *4.2e+06
>>>>>>> 1.1e+06 2.8e+01  4*  0 63 56  0   4  0 63 56  0     0
>>>>>>> VecScatterEnd       2976 1.0 1.2143e+02 1.7 0.00e+00 0.0 *0.0e+00
>>>>>>> 0.0e+00 0.0e+00 14*  0  0  0  0  14  0  0  0  0     0
>>>>>>>
>>>>>>> While for 20 nodes at one node i have:
>>>>>>>
>>>>>>  What does that mean?
>>>>>>
>>>>>>> VecScatterBegin     2596 1.0 2.9142e+01 2.1 0.00e+00 0.0 *1.2e+05
>>>>>>> 4.0e+06 3.0e+01  2*  0 81 61  0   2  0 81 61  0     0
>>>>>>> VecScatterEnd       2558 1.0 8.0344e+01 7.9 0.00e+00 0.0  *0.0e+00
>>>>>>> 0.0e+00 0.0e+00  3*  0  0  0  0   3  0  0  0  0     0
>>>>>>>
>>>>>>> Where do i see the max/min ratio in here? and why End step is all
>>>>>>> 0.0e00 in both but still grows from 3% to 14% of total time? It seems i
>>>>>>> would need to run again with the -log_sync option, is this correct?
>>>>>>>
>>>>>>> e.g., 2.1, 7.9. MPI send/recv are in VecScatterBegin().
>>>>>> VecScatterEnd() only does MPI_Wait. That is why it has zero messages. Yes,
>>>>>> run with -log_sync and see what happens.
>>>>>>
>>>>>>
>>>>>>> Different question, can't i estimate the total communication time if
>>>>>>> i had a typical communication time per MPI message times the number of MPI
>>>>>>> messages reported in the log? or it doesn't work like that?
>>>>>>>
>>>>>>> Probably not work because you have multiple processes doing
>>>>>> send/recv at the same time. They might saturate the bandwidth. Petsc also
>>>>>> does computation/communication overlapping.
>>>>>>
>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 20, 2019 at 2:02 PM Zhang, Junchao <jczhang at mcs.anl.gov>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> See the "Mess   AvgLen  Reduct" number in each log stage.  Mess is
>>>>>>>> the total number of messages sent in an event over all processes.  AvgLen
>>>>>>>> is average message len. Reduct is the number of global reduction.
>>>>>>>> Each event like VecScatterBegin/End has a maximal execution time
>>>>>>>> over all processes, and a max/min ratio.  %T is sum(execution time of the
>>>>>>>> event on each process)/sum(execution time of the stage on each process). %T
>>>>>>>> indicates how expensive the event is. It is a number you should pay
>>>>>>>> attention to.
>>>>>>>> If your code is imbalanced (i.e., with a big max/min ratio), then
>>>>>>>> the performance number is skewed and becomes misleading because some
>>>>>>>> processes are just waiting for others. Then, besides -log_view, you can add
>>>>>>>> -log_sync, which adds an extra MPI_Barrier for each event to let them start
>>>>>>>> at the same time. With that, it is easier to interpret the number.
>>>>>>>> src/vec/vscat/examples/ex4.c is a tiny example for VecScatter
>>>>>>>> logging.
>>>>>>>>
>>>>>>>> --Junchao Zhang
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Mar 20, 2019 at 2:58 PM Manuel Valera via petsc-users <
>>>>>>>> petsc-users at mcs.anl.gov> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I am working on timing my model, which we made MPI scalable using
>>>>>>>>> petsc DMDAs, i want to know more about the output log and how to calculate
>>>>>>>>> a total communication times for my runs, so far i see we have "MPI
>>>>>>>>> Messages" and "MPI Messages Lengths" in the log, along VecScatterEnd and
>>>>>>>>> VecScatterBegin reports.
>>>>>>>>>
>>>>>>>>> My question is, how do i interpret these number to get a rough
>>>>>>>>> estimate on how much overhead we have just from MPI communications times in
>>>>>>>>> my model runs?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190322/744e0d0d/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: timing60m-032219.zip
Type: application/zip
Size: 142854 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190322/744e0d0d/attachment-0001.zip>


More information about the petsc-users mailing list