[petsc-users] Communication during MatAssemblyEnd
Ale Foggia
amfoggia at gmail.com
Mon Jun 24 06:19:29 CDT 2019
Yes, I used KNL nodes. I you can perform the test would be great. Could it
be that I'm not using the correct configuration of the KNL nodes? These are
the environment variables I set:
MKL_NUM_THREADS=1
OMP_NUM_THREADS=1
KMP_HW_SUBSET=1t
KMP_AFFINITY=compact
I_MPI_PIN_DOMAIN=socket
I_MPI_PIN_PROCESSOR_LIST=0-63
MKL_DYNAMIC=0
The code is in https://github.com/amfoggia/LSQuantumED and it has a readme
to compile it and run it. When I run the test I used only 32 processors per
node, and I used 1024 nodes in total, and it's for nspins=38.
Thank you
El vie., 21 jun. 2019 a las 20:03, Zhang, Junchao (<jczhang at mcs.anl.gov>)
escribió:
> Ale,
> Did you use Intel KNL nodes? Mr. Hong (cc'ed) did experiments on KNL
> nodes one year ago. He used 32768 processors and called MatAssemblyEnd 118
> times and it used only 1.5 seconds in total. So I guess something was
> wrong with your test. If you can share your code, I can have a test on our
> machine to see how it goes.
> Thanks.
> --Junchao Zhang
>
>
> On Fri, Jun 21, 2019 at 11:00 AM Junchao Zhang <jczhang at mcs.anl.gov>
> wrote:
>
>> MatAssembly was called once (in stage 5) and cost 2.5% of the total
>> time. Look at stage 5. It says MatAssemblyBegin calls BuildTwoSidedF,
>> which does global synchronization. The high max/min ratio means load
>> imbalance. What I do not understand is MatAssemblyEnd. The ratio is 1.0. It
>> means processors are already synchronized. With 32768 processors, there are
>> 1.2e+06 messages with average length 1.9e+06 bytes. So each processor sends
>> 36 (1.2e+06/32768) ~2MB messages and it takes 54 seconds. Another chance is
>> the reduction at MatAssemblyEnd. I don't know why it needs 8 reductions.
>> In my mind, one is enough. I need to look at the code.
>>
>> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages
>> --- -- Message Lengths -- -- Reductions --
>> Avg %Total Avg %Total Count
>> %Total Avg %Total Count %Total
>> 0: Main Stage: 8.5045e+02 13.0% 3.0633e+15 14.0% 8.196e+07
>> 13.1% 7.768e+06 13.1% 2.530e+02 13.0%
>> 1: Create Basis: 7.9234e-02 0.0% 0.0000e+00 0.0% 0.000e+00
>> 0.0% 0.000e+00 0.0% 0.000e+00 0.0%
>> 2: Create Lattice: 8.3944e-05 0.0% 0.0000e+00 0.0% 0.000e+00
>> 0.0% 0.000e+00 0.0% 0.000e+00 0.0%
>> 3: Create Hamilt: 1.0694e+02 1.6% 0.0000e+00 0.0% 0.000e+00
>> 0.0% 0.000e+00 0.0% 2.000e+00 0.1%
>> 5: Offdiag: 1.6525e+02 2.5% 0.0000e+00 0.0% 1.188e+06
>> 0.2% 1.942e+06 0.0% 8.000e+00 0.4%
>> 6: Phys quantities: 5.4045e+03 82.8% 1.8866e+16 86.0% 5.417e+08
>> 86.7% 7.768e+06 86.8% 1.674e+03 86.1%
>>
>> --- Event Stage 5: Offdiag
>> BuildTwoSidedF 1 1.0 7.1565e+01 148448.9 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 1 0 0 0 0 28 0 0 0 0 0
>> MatAssemblyBegin 1 1.0 7.1565e+01 127783.7 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 1 0 0 0 0 28 0 0 0 0 0
>> MatAssemblyEnd 1 1.0 5.3762e+01 1.0 0.00e+00 0.0 1.2e+06
>> 1.9e+06 8.0e+00 1 0 0 0 0 33 0100100100 0
>> VecSet 1 1.0 7.5533e-02 9.0 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>
>>
>> --Junchao Zhang
>>
>>
>> On Fri, Jun 21, 2019 at 10:34 AM Smith, Barry F. <bsmith at mcs.anl.gov>
>> wrote:
>>
>>>
>>> The load balance is definitely out of whack.
>>>
>>>
>>>
>>> BuildTwoSidedF 1 1.0 1.6722e-0241.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> MatMult 138 1.0 2.6604e+02 7.4 3.19e+10 2.1 8.2e+07 7.8e+06
>>> 0.0e+00 2 4 13 13 0 15 25100100 0 2935476
>>> MatAssemblyBegin 1 1.0 1.6807e-0236.1 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> MatAssemblyEnd 1 1.0 3.5680e-01 3.9 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> VecNorm 2 1.0 4.4252e+0174.8 1.73e+07 1.0 0.0e+00 0.0e+00
>>> 2.0e+00 1 0 0 0 0 5 0 0 0 1 12780
>>> VecCopy 6 1.0 6.5655e-02 2.6 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> VecAXPY 2 1.0 1.3793e-02 2.7 1.73e+07 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 41000838
>>> VecScatterBegin 138 1.0 1.1653e+0285.8 0.00e+00 0.0 8.2e+07 7.8e+06
>>> 0.0e+00 1 0 13 13 0 4 0100100 0 0
>>> VecScatterEnd 138 1.0 1.3653e+0222.4 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 1 0 0 0 0 4 0 0 0 0 0
>>> VecSetRandom 1 1.0 9.6668e-01 2.2 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>
>>> Note that VecCopy/AXPY/SetRandom which are all embarrassingly parallel
>>> have a balance ratio above 2 which means some processes have more than
>>> twice the work of others. Meanwhile the ratio for anything with
>>> communication is extremely in balanced, some processes get to the
>>> synchronization point well before other processes.
>>>
>>> The first thing I would do is worry about the load imbalance, what is
>>> its cause? is it one process with much less work than others (not great but
>>> not terrible) or is it one process with much more work then the others
>>> (terrible) or something in between. I think once you get a handle on the
>>> load balance the rest may fall into place, otherwise we still have some
>>> exploring to do. This is not expected behavior for a good machine with a
>>> good network and a well balanced job. After you understand the load
>>> balancing you may need to use one of the parallel performance visualization
>>> tools to see why the synchronization is out of whack.
>>>
>>> Good luck
>>>
>>> Barry
>>>
>>>
>>> > On Jun 21, 2019, at 9:27 AM, Ale Foggia <amfoggia at gmail.com> wrote:
>>> >
>>> > I'm sending one with a bit less time.
>>> > I'm timing the functions also with std::chronos and for the case of
>>> 180 seconds the program runs out of memory (and crushes) before the PETSc
>>> log gets to be printed, so I know the time only from my function. Anyway,
>>> in every case, the times between std::chronos and the PETSc log match.
>>> >
>>> > (The large times are in part "4b- Building offdiagonal part" or "Event
>>> Stage 5: Offdiag").
>>> >
>>> > El vie., 21 jun. 2019 a las 16:09, Zhang, Junchao (<
>>> jczhang at mcs.anl.gov>) escribió:
>>> >
>>> >
>>> > On Fri, Jun 21, 2019 at 8:07 AM Ale Foggia <amfoggia at gmail.com> wrote:
>>> > Thanks both of you for your answers,
>>> >
>>> > El jue., 20 jun. 2019 a las 22:20, Smith, Barry F. (<
>>> bsmith at mcs.anl.gov>) escribió:
>>> >
>>> > Note that this is a one time cost if the nonzero structure of the
>>> matrix stays the same. It will not happen in future MatAssemblies.
>>> >
>>> > > On Jun 20, 2019, at 3:16 PM, Zhang, Junchao via petsc-users <
>>> petsc-users at mcs.anl.gov> wrote:
>>> > >
>>> > > Those messages were used to build MatMult communication pattern for
>>> the matrix. They were not part of the matrix entries-passing you imagined,
>>> but indeed happened in MatAssemblyEnd. If you want to make sure processors
>>> do not set remote entries, you can use
>>> MatSetOption(A,MAT_NO_OFF_PROC_ENTRIES,PETSC_TRUE), which will generate an
>>> error when an off-proc entry is set.
>>> >
>>> > I started being concerned about this when I saw that the assembly was
>>> taking a few hundreds of seconds in my code, like 180 seconds, which for me
>>> it's a considerable time. Do you think (or maybe you need more information
>>> to answer this) that this time is "reasonable" for communicating the
>>> pattern for the matrix? I already checked that I'm not setting any remote
>>> entries.
>>> > It is not reasonable. Could you send log view of that test with 180
>>> seconds MatAssembly?
>>> >
>>> > Also I see (in my code) that even if there are no messages being
>>> passed during MatAssemblyBegin, it is taking time and the "ratio" is very
>>> big.
>>> >
>>> > >
>>> > >
>>> > > --Junchao Zhang
>>> > >
>>> > >
>>> > > On Thu, Jun 20, 2019 at 4:13 AM Ale Foggia via petsc-users <
>>> petsc-users at mcs.anl.gov> wrote:
>>> > > Hello all!
>>> > >
>>> > > During the conference I showed you a problem happening during
>>> MatAssemblyEnd in a particular code that I have. Now, I tried the same with
>>> a simple code (a symmetric problem corresponding to the Laplacian operator
>>> in 1D, from the SLEPc Hands-On exercises). As I understand (and please,
>>> correct me if I'm wrong), in this case the elements of the matrix are
>>> computed locally by each process so there should not be any communication
>>> during the assembly. However, in the log I get that there are messages
>>> being passed. Also, the number of messages changes with the number of
>>> processes used and the size of the matrix. Could you please help me
>>> understand this?
>>> > >
>>> > > I attach the code I used and the log I get for a small problem.
>>> > >
>>> > > Cheers,
>>> > > Ale
>>> > >
>>> >
>>> > <log.txt>
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190624/b11850ec/attachment.html>
More information about the petsc-users
mailing list