[petsc-users] Communication during MatAssemblyEnd

Fri Jun 28 04:01:13 CDT 2019

Junchao,
I'm sorry for the late response.

El mié., 26 jun. 2019 a las 16:39, Zhang, Junchao (<jczhang at mcs.anl.gov>)
escribió:

> Ale,
> The job got a chance to run but failed with out-of-memory, "Some of your
> processes may have been killed by the cgroup out-of-memory handler."
>

I mentioned that I used 1024 nodes and 32 processes on each node because
the application needs a lot of memory. I think that for a system of size
38, one needs above 256 nodes for sure (assuming only 32 procs per node). I
would try with 512 if it's possible.

I also tried with 128 core with ./main.x 2 ... and got a weird error
> message  "The size of the basis has to be at least equal to the number
>                              of MPI processes used."
>

The error comes from the fact that you put a system size of only 2 which is
too small.
I can also see the problem in the assembly with system sizes smaller than
38, so you can try with like 30 (for which I also have a log). In that case
I run with 64 nodes and 32 processes per node. I think the problem may also
fit in 32 nodes.

--Junchao Zhang
>
>
> On Tue, Jun 25, 2019 at 11:24 PM Junchao Zhang <jczhang at mcs.anl.gov>
> wrote:
>
>> Ale,
>>   I successfully built your code and submitted a job to the NERSC Cori
>> machine requiring 32768 KNL cores and one and a half hours. It is estimated
>> to run in 3 days. If you also observed the same problem with less cores,
>> what is your input arguments?  Currently, I use what in your log file,
>> ./main.x 38 -nn -j1 1.0 -d1 1.0 -eps_type krylovschur -eps_tol 1e-9
>> -log_view
>>   The smaller the better. Thanks.
>> --Junchao Zhang
>>
>>
>> On Mon, Jun 24, 2019 at 6:20 AM Ale Foggia <amfoggia at gmail.com> wrote:
>>
>>> Yes, I used KNL nodes. I you can perform the test would be great. Could
>>> it be that I'm not using the correct configuration of the KNL nodes? These
>>> are the environment variables I set:
>>> MKL_NUM_THREADS=1
>>> OMP_NUM_THREADS=1
>>> KMP_HW_SUBSET=1t
>>> KMP_AFFINITY=compact
>>> I_MPI_PIN_DOMAIN=socket
>>> I_MPI_PIN_PROCESSOR_LIST=0-63
>>> MKL_DYNAMIC=0
>>>
>>> The code is in https://github.com/amfoggia/LSQuantumED and it has a
>>> readme to compile it and run it. When I run the test I used only 32
>>> processors per node, and I used 1024 nodes in total, and it's for nspins=38.
>>> Thank you
>>>
>>> El vie., 21 jun. 2019 a las 20:03, Zhang, Junchao (<jczhang at mcs.anl.gov>)
>>> escribió:
>>>
>>>> Ale,
>>>>   Did you use Intel KNL nodes?  Mr. Hong (cc'ed) did experiments on KNL
>>>> nodes  one year ago. He used 32768 processors and called MatAssemblyEnd 118
>>>> times and it used only 1.5 seconds in total.  So I guess something was
>>>> wrong with your test. If you can share your code, I can have a test on our
>>>> machine to see how it goes.
>>>>  Thanks.
>>>> --Junchao Zhang
>>>>
>>>>
>>>> On Fri, Jun 21, 2019 at 11:00 AM Junchao Zhang <jczhang at mcs.anl.gov>
>>>> wrote:
>>>>
>>>>> MatAssembly was called once (in stage 5) and cost 2.5% of the total
>>>>> time.  Look at stage 5. It says MatAssemblyBegin calls BuildTwoSidedF,
>>>>> which does global synchronization. The high max/min ratio means load
>>>>> imbalance. What I do not understand is MatAssemblyEnd. The ratio is 1.0. It
>>>>> means processors are already synchronized. With 32768 processors, there are
>>>>> 1.2e+06 messages with average length 1.9e+06 bytes. So each processor sends
>>>>> 36 (1.2e+06/32768) ~2MB messages and it takes 54 seconds. Another chance is
>>>>> the reduction at  MatAssemblyEnd. I don't know why it needs 8 reductions.
>>>>> In my mind, one is enough. I need to look at the code.
>>>>>
>>>>> Summary of Stages:   ----- Time ------  ----- Flop ------  ---
>>>>> Messages ---  -- Message Lengths --  -- Reductions --
>>>>>                         Avg     %Total     Avg     %Total    Count
>>>>> %Total     Avg         %Total    Count   %Total
>>>>>  0:      Main Stage: 8.5045e+02  13.0%  3.0633e+15  14.0%  8.196e+07
>>>>>  13.1%  7.768e+06       13.1%  2.530e+02  13.0%
>>>>>  1:    Create Basis: 7.9234e-02   0.0%  0.0000e+00   0.0%  0.000e+00
>>>>> 0.0%  0.000e+00        0.0%  0.000e+00   0.0%
>>>>>  2:  Create Lattice: 8.3944e-05   0.0%  0.0000e+00   0.0%  0.000e+00
>>>>> 0.0%  0.000e+00        0.0%  0.000e+00   0.0%
>>>>>  3:   Create Hamilt: 1.0694e+02   1.6%  0.0000e+00   0.0%  0.000e+00
>>>>> 0.0%  0.000e+00        0.0%  2.000e+00   0.1%
>>>>>  5:         Offdiag: 1.6525e+02   2.5%  0.0000e+00   0.0%  1.188e+06
>>>>> 0.2%  1.942e+06        0.0%  8.000e+00   0.4%
>>>>>  6: Phys quantities: 5.4045e+03  82.8%  1.8866e+16  86.0%  5.417e+08
>>>>>  86.7%  7.768e+06       86.8%  1.674e+03  86.1%
>>>>>
>>>>> --- Event Stage 5: Offdiag
>>>>> BuildTwoSidedF         1 1.0 7.1565e+01 148448.9 0.00e+00 0.0 0.0e+00
>>>>> 0.0e+00 0.0e+00  1  0  0  0  0  28  0  0  0  0     0
>>>>> MatAssemblyBegin       1 1.0 7.1565e+01 127783.7 0.00e+00 0.0 0.0e+00
>>>>> 0.0e+00 0.0e+00  1  0  0  0  0  28  0  0  0  0     0
>>>>> MatAssemblyEnd         1 1.0 5.3762e+01 1.0      0.00e+00 0.0 1.2e+06
>>>>> 1.9e+06 8.0e+00  1  0  0  0  0  33  0100100100     0
>>>>> VecSet                 1 1.0 7.5533e-02 9.0      0.00e+00 0.0 0.0e+00
>>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>>
>>>>>
>>>>> --Junchao Zhang
>>>>>
>>>>>
>>>>> On Fri, Jun 21, 2019 at 10:34 AM Smith, Barry F. <bsmith at mcs.anl.gov>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>    The load balance is definitely out of whack.
>>>>>>
>>>>>>
>>>>>>
>>>>>> BuildTwoSidedF         1 1.0 1.6722e-0241.0 0.00e+00 0.0 0.0e+00
>>>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>>> MatMult              138 1.0 2.6604e+02 7.4 3.19e+10 2.1 8.2e+07
>>>>>> 7.8e+06 0.0e+00  2  4 13 13  0  15 25100100  0 2935476
>>>>>> MatAssemblyBegin       1 1.0 1.6807e-0236.1 0.00e+00 0.0 0.0e+00
>>>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>>> MatAssemblyEnd         1 1.0 3.5680e-01 3.9 0.00e+00 0.0 0.0e+00
>>>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>>> VecNorm                2 1.0 4.4252e+0174.8 1.73e+07 1.0 0.0e+00
>>>>>> 0.0e+00 2.0e+00  1  0  0  0  0   5  0  0  0  1 12780
>>>>>> VecCopy                6 1.0 6.5655e-02 2.6 0.00e+00 0.0 0.0e+00
>>>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>>> VecAXPY                2 1.0 1.3793e-02 2.7 1.73e+07 1.0 0.0e+00
>>>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 41000838
>>>>>> VecScatterBegin      138 1.0 1.1653e+0285.8 0.00e+00 0.0 8.2e+07
>>>>>> 7.8e+06 0.0e+00  1  0 13 13  0   4  0100100  0     0
>>>>>> VecScatterEnd        138 1.0 1.3653e+0222.4 0.00e+00 0.0 0.0e+00
>>>>>> 0.0e+00 0.0e+00  1  0  0  0  0   4  0  0  0  0     0
>>>>>> VecSetRandom           1 1.0 9.6668e-01 2.2 0.00e+00 0.0 0.0e+00
>>>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>>>
>>>>>> Note that VecCopy/AXPY/SetRandom which are all embarrassingly
>>>>>> parallel have a balance ratio above 2 which means some processes have more
>>>>>> than twice the work of others. Meanwhile the ratio for anything with
>>>>>> communication is extremely in balanced, some processes get to the
>>>>>> synchronization point well before other processes.
>>>>>>
>>>>>> The first thing I would do is worry about the load imbalance, what is
>>>>>> its cause? is it one process with much less work than others (not great but
>>>>>> not terrible) or is it one process with much more work then the others
>>>>>> (terrible) or something in between. I think once you get a handle on the
>>>>>> load balance the rest may fall into place, otherwise we still have some
>>>>>> exploring to do. This is not expected behavior for a good machine with a
>>>>>> good network and a well balanced job. After you understand the load
>>>>>> balancing you may need to use one of the parallel performance visualization
>>>>>> tools to see why the synchronization is out of whack.
>>>>>>
>>>>>>    Good luck
>>>>>>
>>>>>>   Barry
>>>>>>
>>>>>>
>>>>>> > On Jun 21, 2019, at 9:27 AM, Ale Foggia <amfoggia at gmail.com> wrote:
>>>>>> >
>>>>>> > I'm sending one with a bit less time.
>>>>>> > I'm timing the functions also with std::chronos and for the case of
>>>>>> 180 seconds the program runs out of memory (and crushes) before the PETSc
>>>>>> log gets to be printed, so I know the time only from my function. Anyway,
>>>>>> in every case, the times between std::chronos and the PETSc log match.
>>>>>> >
>>>>>> > (The large times are in part "4b- Building offdiagonal part" or
>>>>>> "Event Stage 5: Offdiag").
>>>>>> >
>>>>>> > El vie., 21 jun. 2019 a las 16:09, Zhang, Junchao (<
>>>>>> jczhang at mcs.anl.gov>) escribió:
>>>>>> >
>>>>>> >
>>>>>> > On Fri, Jun 21, 2019 at 8:07 AM Ale Foggia <amfoggia at gmail.com>
>>>>>> wrote:
>>>>>> > Thanks both of you for your answers,
>>>>>> >
>>>>>> > El jue., 20 jun. 2019 a las 22:20, Smith, Barry F. (<
>>>>>> bsmith at mcs.anl.gov>) escribió:
>>>>>> >
>>>>>> >   Note that this is a one time cost if the nonzero structure of the
>>>>>> matrix stays the same. It will not happen in future MatAssemblies.
>>>>>> >
>>>>>> > > On Jun 20, 2019, at 3:16 PM, Zhang, Junchao via petsc-users <
>>>>>> petsc-users at mcs.anl.gov> wrote:
>>>>>> > >
>>>>>> > > Those messages were used to build MatMult communication pattern
>>>>>> for the matrix. They were not part of the matrix entries-passing you
>>>>>> imagined, but indeed happened in MatAssemblyEnd. If you want to make sure
>>>>>> processors do not set remote entries, you can use
>>>>>> MatSetOption(A,MAT_NO_OFF_PROC_ENTRIES,PETSC_TRUE), which will generate an
>>>>>> error when an off-proc entry is set.
>>>>>> >
>>>>>> > I started being concerned about this when I saw that the assembly
>>>>>> was taking a few hundreds of seconds in my code, like 180 seconds, which
>>>>>> for me it's a considerable time. Do you think (or maybe you need more
>>>>>> information to answer this) that this time is "reasonable" for
>>>>>> communicating the pattern for the matrix? I already checked that I'm not
>>>>>> setting any remote entries.
>>>>>> > It is not reasonable. Could you send log view of that test with 180
>>>>>> seconds MatAssembly?
>>>>>> >
>>>>>> > Also I see (in my code) that even if there are no messages being
>>>>>> passed during MatAssemblyBegin, it is taking time and the "ratio" is very
>>>>>> big.
>>>>>> >
>>>>>> > >
>>>>>> > >
>>>>>> > > --Junchao Zhang
>>>>>> > >
>>>>>> > >
>>>>>> > > On Thu, Jun 20, 2019 at 4:13 AM Ale Foggia via petsc-users <
>>>>>> petsc-users at mcs.anl.gov> wrote:
>>>>>> > > Hello all!
>>>>>> > >
>>>>>> > > During the conference I showed you a problem happening during
>>>>>> MatAssemblyEnd in a particular code that I have. Now, I tried the same with
>>>>>> a simple code (a symmetric problem corresponding to the Laplacian operator
>>>>>> in 1D, from the SLEPc Hands-On exercises). As I understand (and please,
>>>>>> correct me if I'm wrong), in this case the elements of the matrix are
>>>>>> computed locally by each process so there should not be any communication
>>>>>> during the assembly. However, in the log I get that there are messages
>>>>>> being passed. Also, the number of messages changes with the number of
>>>>>> processes used and the size of the matrix. Could you please help me
>>>>>> understand this?
>>>>>> > >
>>>>>> > > I attach the code I used and the log I get for a small problem.
>>>>>> > >
>>>>>> > > Cheers,
>>>>>> > > Ale
>>>>>> > >
>>>>>> >
>>>>>> > <log.txt>
>>>>>>
>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190628/44c35808/attachment-0001.html>