[petsc-users] Communication during MatAssemblyEnd

Mon Jul 1 04:10:03 CDT 2019

Oh, I also got the same error when I switched to the newest version of
SLEPc (using OpenBlas), and I don't know where it is coming from.
Can you tell me which version of SLEPc and PETSc are you using? And, are
you using MKL?
Thanks for trying :)

El vie., 28 jun. 2019 a las 16:57, Zhang, Junchao (<jczhang at mcs.anl.gov>)
escribió:

> Ran with 64 nodes and 32 ranks/node, met  slepc errors and did not know
> how to proceed :(
>
> [363]PETSC ERROR: --------------------- Error Message
> --------------------------------------------------------------
> [363]PETSC ERROR: Error in external library
> [363]PETSC ERROR: Error in LAPACK subroutine steqr: info=0
> [363]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html
> for trouble shooting.
> [363]PETSC ERROR: Petsc Development GIT revision: v3.11.2-1052-gf1480a5c
>  GIT Date: 2019-06-22 21:39:54 +0000
> [363]PETSC ERROR: /tmp/main.x on a arch-cray-xc40-knl-opt named nid03387
> by jczhang Fri Jun 28 07:26:59 2019
> [1225]PETSC ERROR: #2 DSSolve() line 586 in
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/sys/classes/ds/interface/dsops.c
> [1225]PETSC ERROR: #3 EPSSolve_KrylovSchur_Symm() line 55 in
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/eps/impls/krylov/krylovschur/ks-symm.c
> [1225]PETSC ERROR: #4 EPSSolve() line 149 in
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/eps/interface/epssolve.c
> [240]PETSC ERROR: #2 DSSolve() line 586 in
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/sys/classes/ds/interface/dsops.c
> [240]PETSC ERROR: #3 EPSSolve_KrylovSchur_Symm() line 55 in
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/eps/impls/krylov/krylovschur/ks-symm.c
> [240]PETSC ERROR: #4 EPSSolve() line 149 in
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/eps/interface/epssolve.c
>
> --Junchao Zhang
>
>
> On Fri, Jun 28, 2019 at 4:02 AM Ale Foggia <amfoggia at gmail.com> wrote:
>
>> Junchao,
>> I'm sorry for the late response.
>>
>> El mié., 26 jun. 2019 a las 16:39, Zhang, Junchao (<jczhang at mcs.anl.gov>)
>> escribió:
>>
>>> Ale,
>>> The job got a chance to run but failed with out-of-memory, "Some of your
>>> processes may have been killed by the cgroup out-of-memory handler."
>>>
>>
>> I mentioned that I used 1024 nodes and 32 processes on each node because
>> the application needs a lot of memory. I think that for a system of size
>> 38, one needs above 256 nodes for sure (assuming only 32 procs per node). I
>> would try with 512 if it's possible.
>>
>> I also tried with 128 core with ./main.x 2 ... and got a weird error
>>> message  "The size of the basis has to be at least equal to the number
>>>                              of MPI processes used."
>>>
>>
>> The error comes from the fact that you put a system size of only 2 which
>> is too small.
>> I can also see the problem in the assembly with system sizes smaller than
>> 38, so you can try with like 30 (for which I also have a log). In that case
>> I run with 64 nodes and 32 processes per node. I think the problem may also
>> fit in 32 nodes.
>>
>> --Junchao Zhang
>>>
>>>
>>> On Tue, Jun 25, 2019 at 11:24 PM Junchao Zhang <jczhang at mcs.anl.gov>
>>> wrote:
>>>
>>>> Ale,
>>>>   I successfully built your code and submitted a job to the NERSC Cori
>>>> machine requiring 32768 KNL cores and one and a half hours. It is estimated
>>>> to run in 3 days. If you also observed the same problem with less cores,
>>>> what is your input arguments?  Currently, I use what in your log file,
>>>> ./main.x 38 -nn -j1 1.0 -d1 1.0 -eps_type krylovschur -eps_tol 1e-9
>>>> -log_view
>>>>   The smaller the better. Thanks.
>>>> --Junchao Zhang
>>>>
>>>>
>>>> On Mon, Jun 24, 2019 at 6:20 AM Ale Foggia <amfoggia at gmail.com> wrote:
>>>>
>>>>> Yes, I used KNL nodes. I you can perform the test would be great.
>>>>> Could it be that I'm not using the correct configuration of the KNL nodes?
>>>>> These are the environment variables I set:
>>>>> MKL_NUM_THREADS=1
>>>>> OMP_NUM_THREADS=1
>>>>> KMP_HW_SUBSET=1t
>>>>> KMP_AFFINITY=compact
>>>>> I_MPI_PIN_DOMAIN=socket
>>>>> I_MPI_PIN_PROCESSOR_LIST=0-63
>>>>> MKL_DYNAMIC=0
>>>>>
>>>>> The code is in https://github.com/amfoggia/LSQuantumED and it has a
>>>>> readme to compile it and run it. When I run the test I used only 32
>>>>> processors per node, and I used 1024 nodes in total, and it's for nspins=38.
>>>>> Thank you
>>>>>
>>>>> El vie., 21 jun. 2019 a las 20:03, Zhang, Junchao (<
>>>>> jczhang at mcs.anl.gov>) escribió:
>>>>>
>>>>>> Ale,
>>>>>>   Did you use Intel KNL nodes?  Mr. Hong (cc'ed) did experiments on
>>>>>> KNL nodes  one year ago. He used 32768 processors and called MatAssemblyEnd
>>>>>> 118 times and it used only 1.5 seconds in total.  So I guess something was
>>>>>> wrong with your test. If you can share your code, I can have a test on our
>>>>>> machine to see how it goes.
>>>>>>  Thanks.
>>>>>> --Junchao Zhang
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 21, 2019 at 11:00 AM Junchao Zhang <jczhang at mcs.anl.gov>
>>>>>> wrote:
>>>>>>
>>>>>>> MatAssembly was called once (in stage 5) and cost 2.5% of the total
>>>>>>> time.  Look at stage 5. It says MatAssemblyBegin calls BuildTwoSidedF,
>>>>>>> which does global synchronization. The high max/min ratio means load
>>>>>>> imbalance. What I do not understand is MatAssemblyEnd. The ratio is 1.0. It
>>>>>>> means processors are already synchronized. With 32768 processors, there are
>>>>>>> 1.2e+06 messages with average length 1.9e+06 bytes. So each processor sends
>>>>>>> 36 (1.2e+06/32768) ~2MB messages and it takes 54 seconds. Another chance is
>>>>>>> the reduction at  MatAssemblyEnd. I don't know why it needs 8 reductions.
>>>>>>> In my mind, one is enough. I need to look at the code.
>>>>>>>
>>>>>>> Summary of Stages:   ----- Time ------  ----- Flop ------  ---
>>>>>>> Messages ---  -- Message Lengths --  -- Reductions --
>>>>>>>                         Avg     %Total     Avg     %Total    Count
>>>>>>> %Total     Avg         %Total    Count   %Total
>>>>>>>  0:      Main Stage: 8.5045e+02  13.0%  3.0633e+15  14.0%  8.196e+07
>>>>>>>  13.1%  7.768e+06       13.1%  2.530e+02  13.0%
>>>>>>>  1:    Create Basis: 7.9234e-02   0.0%  0.0000e+00   0.0%  0.000e+00
>>>>>>>   0.0%  0.000e+00        0.0%  0.000e+00   0.0%
>>>>>>>  2:  Create Lattice: 8.3944e-05   0.0%  0.0000e+00   0.0%  0.000e+00
>>>>>>>   0.0%  0.000e+00        0.0%  0.000e+00   0.0%
>>>>>>>  3:   Create Hamilt: 1.0694e+02   1.6%  0.0000e+00   0.0%  0.000e+00
>>>>>>>   0.0%  0.000e+00        0.0%  2.000e+00   0.1%
>>>>>>>  5:         Offdiag: 1.6525e+02   2.5%  0.0000e+00   0.0%  1.188e+06
>>>>>>>   0.2%  1.942e+06        0.0%  8.000e+00   0.4%
>>>>>>>  6: Phys quantities: 5.4045e+03  82.8%  1.8866e+16  86.0%  5.417e+08
>>>>>>>  86.7%  7.768e+06       86.8%  1.674e+03  86.1%
>>>>>>>
>>>>>>> --- Event Stage 5: Offdiag
>>>>>>> BuildTwoSidedF         1 1.0 7.1565e+01 148448.9 0.00e+00 0.0
>>>>>>> 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0  28  0  0  0  0     0
>>>>>>> MatAssemblyBegin       1 1.0 7.1565e+01 127783.7 0.00e+00 0.0
>>>>>>> 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0  28  0  0  0  0     0
>>>>>>> MatAssemblyEnd         1 1.0 5.3762e+01 1.0      0.00e+00 0.0
>>>>>>> 1.2e+06 1.9e+06 8.0e+00  1  0  0  0  0  33  0100100100     0
>>>>>>> VecSet                 1 1.0 7.5533e-02 9.0      0.00e+00 0.0
>>>>>>> 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>>>>
>>>>>>>
>>>>>>> --Junchao Zhang
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jun 21, 2019 at 10:34 AM Smith, Barry F. <bsmith at mcs.anl.gov>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>    The load balance is definitely out of whack.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> BuildTwoSidedF         1 1.0 1.6722e-0241.0 0.00e+00 0.0 0.0e+00
>>>>>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>>>>> MatMult              138 1.0 2.6604e+02 7.4 3.19e+10 2.1 8.2e+07
>>>>>>>> 7.8e+06 0.0e+00  2  4 13 13  0  15 25100100  0 2935476
>>>>>>>> MatAssemblyBegin       1 1.0 1.6807e-0236.1 0.00e+00 0.0 0.0e+00
>>>>>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>>>>> MatAssemblyEnd         1 1.0 3.5680e-01 3.9 0.00e+00 0.0 0.0e+00
>>>>>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>>>>> VecNorm                2 1.0 4.4252e+0174.8 1.73e+07 1.0 0.0e+00
>>>>>>>> 0.0e+00 2.0e+00  1  0  0  0  0   5  0  0  0  1 12780
>>>>>>>> VecCopy                6 1.0 6.5655e-02 2.6 0.00e+00 0.0 0.0e+00
>>>>>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>>>>> VecAXPY                2 1.0 1.3793e-02 2.7 1.73e+07 1.0 0.0e+00
>>>>>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 41000838
>>>>>>>> VecScatterBegin      138 1.0 1.1653e+0285.8 0.00e+00 0.0 8.2e+07
>>>>>>>> 7.8e+06 0.0e+00  1  0 13 13  0   4  0100100  0     0
>>>>>>>> VecScatterEnd        138 1.0 1.3653e+0222.4 0.00e+00 0.0 0.0e+00
>>>>>>>> 0.0e+00 0.0e+00  1  0  0  0  0   4  0  0  0  0     0
>>>>>>>> VecSetRandom           1 1.0 9.6668e-01 2.2 0.00e+00 0.0 0.0e+00
>>>>>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>>>>>
>>>>>>>> Note that VecCopy/AXPY/SetRandom which are all embarrassingly
>>>>>>>> parallel have a balance ratio above 2 which means some processes have more
>>>>>>>> than twice the work of others. Meanwhile the ratio for anything with
>>>>>>>> communication is extremely in balanced, some processes get to the
>>>>>>>> synchronization point well before other processes.
>>>>>>>>
>>>>>>>> The first thing I would do is worry about the load imbalance, what
>>>>>>>> is its cause? is it one process with much less work than others (not great
>>>>>>>> but not terrible) or is it one process with much more work then the others
>>>>>>>> (terrible) or something in between. I think once you get a handle on the
>>>>>>>> load balance the rest may fall into place, otherwise we still have some
>>>>>>>> exploring to do. This is not expected behavior for a good machine with a
>>>>>>>> good network and a well balanced job. After you understand the load
>>>>>>>> balancing you may need to use one of the parallel performance visualization
>>>>>>>> tools to see why the synchronization is out of whack.
>>>>>>>>
>>>>>>>>    Good luck
>>>>>>>>
>>>>>>>>   Barry
>>>>>>>>
>>>>>>>>
>>>>>>>> > On Jun 21, 2019, at 9:27 AM, Ale Foggia <amfoggia at gmail.com>
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> > I'm sending one with a bit less time.
>>>>>>>> > I'm timing the functions also with std::chronos and for the case
>>>>>>>> of 180 seconds the program runs out of memory (and crushes) before the
>>>>>>>> PETSc log gets to be printed, so I know the time only from my function.
>>>>>>>> Anyway, in every case, the times between std::chronos and the PETSc log
>>>>>>>> match.
>>>>>>>> >
>>>>>>>> > (The large times are in part "4b- Building offdiagonal part" or
>>>>>>>> "Event Stage 5: Offdiag").
>>>>>>>> >
>>>>>>>> > El vie., 21 jun. 2019 a las 16:09, Zhang, Junchao (<
>>>>>>>> jczhang at mcs.anl.gov>) escribió:
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On Fri, Jun 21, 2019 at 8:07 AM Ale Foggia <amfoggia at gmail.com>
>>>>>>>> wrote:
>>>>>>>> > Thanks both of you for your answers,
>>>>>>>> >
>>>>>>>> > El jue., 20 jun. 2019 a las 22:20, Smith, Barry F. (<
>>>>>>>> bsmith at mcs.anl.gov>) escribió:
>>>>>>>> >
>>>>>>>> >   Note that this is a one time cost if the nonzero structure of
>>>>>>>> the matrix stays the same. It will not happen in future MatAssemblies.
>>>>>>>> >
>>>>>>>> > > On Jun 20, 2019, at 3:16 PM, Zhang, Junchao via petsc-users <
>>>>>>>> petsc-users at mcs.anl.gov> wrote:
>>>>>>>> > >
>>>>>>>> > > Those messages were used to build MatMult communication pattern
>>>>>>>> for the matrix. They were not part of the matrix entries-passing you
>>>>>>>> imagined, but indeed happened in MatAssemblyEnd. If you want to make sure
>>>>>>>> processors do not set remote entries, you can use
>>>>>>>> MatSetOption(A,MAT_NO_OFF_PROC_ENTRIES,PETSC_TRUE), which will generate an
>>>>>>>> error when an off-proc entry is set.
>>>>>>>> >
>>>>>>>> > I started being concerned about this when I saw that the assembly
>>>>>>>> was taking a few hundreds of seconds in my code, like 180 seconds, which
>>>>>>>> for me it's a considerable time. Do you think (or maybe you need more
>>>>>>>> information to answer this) that this time is "reasonable" for
>>>>>>>> communicating the pattern for the matrix? I already checked that I'm not
>>>>>>>> setting any remote entries.
>>>>>>>> > It is not reasonable. Could you send log view of that test with
>>>>>>>> 180 seconds MatAssembly?
>>>>>>>> >
>>>>>>>> > Also I see (in my code) that even if there are no messages being
>>>>>>>> passed during MatAssemblyBegin, it is taking time and the "ratio" is very
>>>>>>>> big.
>>>>>>>> >
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> > > --Junchao Zhang
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> > > On Thu, Jun 20, 2019 at 4:13 AM Ale Foggia via petsc-users <
>>>>>>>> petsc-users at mcs.anl.gov> wrote:
>>>>>>>> > > Hello all!
>>>>>>>> > >
>>>>>>>> > > During the conference I showed you a problem happening during
>>>>>>>> MatAssemblyEnd in a particular code that I have. Now, I tried the same with
>>>>>>>> a simple code (a symmetric problem corresponding to the Laplacian operator
>>>>>>>> in 1D, from the SLEPc Hands-On exercises). As I understand (and please,
>>>>>>>> correct me if I'm wrong), in this case the elements of the matrix are
>>>>>>>> computed locally by each process so there should not be any communication
>>>>>>>> during the assembly. However, in the log I get that there are messages
>>>>>>>> being passed. Also, the number of messages changes with the number of
>>>>>>>> processes used and the size of the matrix. Could you please help me
>>>>>>>> understand this?
>>>>>>>> > >
>>>>>>>> > > I attach the code I used and the log I get for a small problem.
>>>>>>>> > >
>>>>>>>> > > Cheers,
>>>>>>>> > > Ale
>>>>>>>> > >
>>>>>>>> >
>>>>>>>> > <log.txt>
>>>>>>>>
>>>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190701/95a893bd/attachment-0001.html>