[petsc-users] Communication during MatAssemblyEnd
Ale Foggia
amfoggia at gmail.com
Mon Jul 1 04:10:03 CDT 2019
Oh, I also got the same error when I switched to the newest version of
SLEPc (using OpenBlas), and I don't know where it is coming from.
Can you tell me which version of SLEPc and PETSc are you using? And, are
you using MKL?
Thanks for trying :)
El vie., 28 jun. 2019 a las 16:57, Zhang, Junchao (<jczhang at mcs.anl.gov>)
escribió:
> Ran with 64 nodes and 32 ranks/node, met slepc errors and did not know
> how to proceed :(
>
> [363]PETSC ERROR: --------------------- Error Message
> --------------------------------------------------------------
> [363]PETSC ERROR: Error in external library
> [363]PETSC ERROR: Error in LAPACK subroutine steqr: info=0
> [363]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html
> for trouble shooting.
> [363]PETSC ERROR: Petsc Development GIT revision: v3.11.2-1052-gf1480a5c
> GIT Date: 2019-06-22 21:39:54 +0000
> [363]PETSC ERROR: /tmp/main.x on a arch-cray-xc40-knl-opt named nid03387
> by jczhang Fri Jun 28 07:26:59 2019
> [1225]PETSC ERROR: #2 DSSolve() line 586 in
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/sys/classes/ds/interface/dsops.c
> [1225]PETSC ERROR: #3 EPSSolve_KrylovSchur_Symm() line 55 in
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/eps/impls/krylov/krylovschur/ks-symm.c
> [1225]PETSC ERROR: #4 EPSSolve() line 149 in
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/eps/interface/epssolve.c
> [240]PETSC ERROR: #2 DSSolve() line 586 in
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/sys/classes/ds/interface/dsops.c
> [240]PETSC ERROR: #3 EPSSolve_KrylovSchur_Symm() line 55 in
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/eps/impls/krylov/krylovschur/ks-symm.c
> [240]PETSC ERROR: #4 EPSSolve() line 149 in
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/eps/interface/epssolve.c
>
> --Junchao Zhang
>
>
> On Fri, Jun 28, 2019 at 4:02 AM Ale Foggia <amfoggia at gmail.com> wrote:
>
>> Junchao,
>> I'm sorry for the late response.
>>
>> El mié., 26 jun. 2019 a las 16:39, Zhang, Junchao (<jczhang at mcs.anl.gov>)
>> escribió:
>>
>>> Ale,
>>> The job got a chance to run but failed with out-of-memory, "Some of your
>>> processes may have been killed by the cgroup out-of-memory handler."
>>>
>>
>> I mentioned that I used 1024 nodes and 32 processes on each node because
>> the application needs a lot of memory. I think that for a system of size
>> 38, one needs above 256 nodes for sure (assuming only 32 procs per node). I
>> would try with 512 if it's possible.
>>
>> I also tried with 128 core with ./main.x 2 ... and got a weird error
>>> message "The size of the basis has to be at least equal to the number
>>> of MPI processes used."
>>>
>>
>> The error comes from the fact that you put a system size of only 2 which
>> is too small.
>> I can also see the problem in the assembly with system sizes smaller than
>> 38, so you can try with like 30 (for which I also have a log). In that case
>> I run with 64 nodes and 32 processes per node. I think the problem may also
>> fit in 32 nodes.
>>
>> --Junchao Zhang
>>>
>>>
>>> On Tue, Jun 25, 2019 at 11:24 PM Junchao Zhang <jczhang at mcs.anl.gov>
>>> wrote:
>>>
>>>> Ale,
>>>> I successfully built your code and submitted a job to the NERSC Cori
>>>> machine requiring 32768 KNL cores and one and a half hours. It is estimated
>>>> to run in 3 days. If you also observed the same problem with less cores,
>>>> what is your input arguments? Currently, I use what in your log file,
>>>> ./main.x 38 -nn -j1 1.0 -d1 1.0 -eps_type krylovschur -eps_tol 1e-9
>>>> -log_view
>>>> The smaller the better. Thanks.
>>>> --Junchao Zhang
>>>>
>>>>
>>>> On Mon, Jun 24, 2019 at 6:20 AM Ale Foggia <amfoggia at gmail.com> wrote:
>>>>
>>>>> Yes, I used KNL nodes. I you can perform the test would be great.
>>>>> Could it be that I'm not using the correct configuration of the KNL nodes?
>>>>> These are the environment variables I set:
>>>>> MKL_NUM_THREADS=1
>>>>> OMP_NUM_THREADS=1
>>>>> KMP_HW_SUBSET=1t
>>>>> KMP_AFFINITY=compact
>>>>> I_MPI_PIN_DOMAIN=socket
>>>>> I_MPI_PIN_PROCESSOR_LIST=0-63
>>>>> MKL_DYNAMIC=0
>>>>>
>>>>> The code is in https://github.com/amfoggia/LSQuantumED and it has a
>>>>> readme to compile it and run it. When I run the test I used only 32
>>>>> processors per node, and I used 1024 nodes in total, and it's for nspins=38.
>>>>> Thank you
>>>>>
>>>>> El vie., 21 jun. 2019 a las 20:03, Zhang, Junchao (<
>>>>> jczhang at mcs.anl.gov>) escribió:
>>>>>
>>>>>> Ale,
>>>>>> Did you use Intel KNL nodes? Mr. Hong (cc'ed) did experiments on
>>>>>> KNL nodes one year ago. He used 32768 processors and called MatAssemblyEnd
>>>>>> 118 times and it used only 1.5 seconds in total. So I guess something was
>>>>>> wrong with your test. If you can share your code, I can have a test on our
>>>>>> machine to see how it goes.
>>>>>> Thanks.
>>>>>> --Junchao Zhang
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 21, 2019 at 11:00 AM Junchao Zhang <jczhang at mcs.anl.gov>
>>>>>> wrote:
>>>>>>
>>>>>>> MatAssembly was called once (in stage 5) and cost 2.5% of the total
>>>>>>> time. Look at stage 5. It says MatAssemblyBegin calls BuildTwoSidedF,
>>>>>>> which does global synchronization. The high max/min ratio means load
>>>>>>> imbalance. What I do not understand is MatAssemblyEnd. The ratio is 1.0. It
>>>>>>> means processors are already synchronized. With 32768 processors, there are
>>>>>>> 1.2e+06 messages with average length 1.9e+06 bytes. So each processor sends
>>>>>>> 36 (1.2e+06/32768) ~2MB messages and it takes 54 seconds. Another chance is
>>>>>>> the reduction at MatAssemblyEnd. I don't know why it needs 8 reductions.
>>>>>>> In my mind, one is enough. I need to look at the code.
>>>>>>>
>>>>>>> Summary of Stages: ----- Time ------ ----- Flop ------ ---
>>>>>>> Messages --- -- Message Lengths -- -- Reductions --
>>>>>>> Avg %Total Avg %Total Count
>>>>>>> %Total Avg %Total Count %Total
>>>>>>> 0: Main Stage: 8.5045e+02 13.0% 3.0633e+15 14.0% 8.196e+07
>>>>>>> 13.1% 7.768e+06 13.1% 2.530e+02 13.0%
>>>>>>> 1: Create Basis: 7.9234e-02 0.0% 0.0000e+00 0.0% 0.000e+00
>>>>>>> 0.0% 0.000e+00 0.0% 0.000e+00 0.0%
>>>>>>> 2: Create Lattice: 8.3944e-05 0.0% 0.0000e+00 0.0% 0.000e+00
>>>>>>> 0.0% 0.000e+00 0.0% 0.000e+00 0.0%
>>>>>>> 3: Create Hamilt: 1.0694e+02 1.6% 0.0000e+00 0.0% 0.000e+00
>>>>>>> 0.0% 0.000e+00 0.0% 2.000e+00 0.1%
>>>>>>> 5: Offdiag: 1.6525e+02 2.5% 0.0000e+00 0.0% 1.188e+06
>>>>>>> 0.2% 1.942e+06 0.0% 8.000e+00 0.4%
>>>>>>> 6: Phys quantities: 5.4045e+03 82.8% 1.8866e+16 86.0% 5.417e+08
>>>>>>> 86.7% 7.768e+06 86.8% 1.674e+03 86.1%
>>>>>>>
>>>>>>> --- Event Stage 5: Offdiag
>>>>>>> BuildTwoSidedF 1 1.0 7.1565e+01 148448.9 0.00e+00 0.0
>>>>>>> 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 28 0 0 0 0 0
>>>>>>> MatAssemblyBegin 1 1.0 7.1565e+01 127783.7 0.00e+00 0.0
>>>>>>> 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 28 0 0 0 0 0
>>>>>>> MatAssemblyEnd 1 1.0 5.3762e+01 1.0 0.00e+00 0.0
>>>>>>> 1.2e+06 1.9e+06 8.0e+00 1 0 0 0 0 33 0100100100 0
>>>>>>> VecSet 1 1.0 7.5533e-02 9.0 0.00e+00 0.0
>>>>>>> 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>>>>>
>>>>>>>
>>>>>>> --Junchao Zhang
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jun 21, 2019 at 10:34 AM Smith, Barry F. <bsmith at mcs.anl.gov>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> The load balance is definitely out of whack.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> BuildTwoSidedF 1 1.0 1.6722e-0241.0 0.00e+00 0.0 0.0e+00
>>>>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>>>>>> MatMult 138 1.0 2.6604e+02 7.4 3.19e+10 2.1 8.2e+07
>>>>>>>> 7.8e+06 0.0e+00 2 4 13 13 0 15 25100100 0 2935476
>>>>>>>> MatAssemblyBegin 1 1.0 1.6807e-0236.1 0.00e+00 0.0 0.0e+00
>>>>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>>>>>> MatAssemblyEnd 1 1.0 3.5680e-01 3.9 0.00e+00 0.0 0.0e+00
>>>>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>>>>>> VecNorm 2 1.0 4.4252e+0174.8 1.73e+07 1.0 0.0e+00
>>>>>>>> 0.0e+00 2.0e+00 1 0 0 0 0 5 0 0 0 1 12780
>>>>>>>> VecCopy 6 1.0 6.5655e-02 2.6 0.00e+00 0.0 0.0e+00
>>>>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>>>>>> VecAXPY 2 1.0 1.3793e-02 2.7 1.73e+07 1.0 0.0e+00
>>>>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 41000838
>>>>>>>> VecScatterBegin 138 1.0 1.1653e+0285.8 0.00e+00 0.0 8.2e+07
>>>>>>>> 7.8e+06 0.0e+00 1 0 13 13 0 4 0100100 0 0
>>>>>>>> VecScatterEnd 138 1.0 1.3653e+0222.4 0.00e+00 0.0 0.0e+00
>>>>>>>> 0.0e+00 0.0e+00 1 0 0 0 0 4 0 0 0 0 0
>>>>>>>> VecSetRandom 1 1.0 9.6668e-01 2.2 0.00e+00 0.0 0.0e+00
>>>>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>>>>>>
>>>>>>>> Note that VecCopy/AXPY/SetRandom which are all embarrassingly
>>>>>>>> parallel have a balance ratio above 2 which means some processes have more
>>>>>>>> than twice the work of others. Meanwhile the ratio for anything with
>>>>>>>> communication is extremely in balanced, some processes get to the
>>>>>>>> synchronization point well before other processes.
>>>>>>>>
>>>>>>>> The first thing I would do is worry about the load imbalance, what
>>>>>>>> is its cause? is it one process with much less work than others (not great
>>>>>>>> but not terrible) or is it one process with much more work then the others
>>>>>>>> (terrible) or something in between. I think once you get a handle on the
>>>>>>>> load balance the rest may fall into place, otherwise we still have some
>>>>>>>> exploring to do. This is not expected behavior for a good machine with a
>>>>>>>> good network and a well balanced job. After you understand the load
>>>>>>>> balancing you may need to use one of the parallel performance visualization
>>>>>>>> tools to see why the synchronization is out of whack.
>>>>>>>>
>>>>>>>> Good luck
>>>>>>>>
>>>>>>>> Barry
>>>>>>>>
>>>>>>>>
>>>>>>>> > On Jun 21, 2019, at 9:27 AM, Ale Foggia <amfoggia at gmail.com>
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> > I'm sending one with a bit less time.
>>>>>>>> > I'm timing the functions also with std::chronos and for the case
>>>>>>>> of 180 seconds the program runs out of memory (and crushes) before the
>>>>>>>> PETSc log gets to be printed, so I know the time only from my function.
>>>>>>>> Anyway, in every case, the times between std::chronos and the PETSc log
>>>>>>>> match.
>>>>>>>> >
>>>>>>>> > (The large times are in part "4b- Building offdiagonal part" or
>>>>>>>> "Event Stage 5: Offdiag").
>>>>>>>> >
>>>>>>>> > El vie., 21 jun. 2019 a las 16:09, Zhang, Junchao (<
>>>>>>>> jczhang at mcs.anl.gov>) escribió:
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On Fri, Jun 21, 2019 at 8:07 AM Ale Foggia <amfoggia at gmail.com>
>>>>>>>> wrote:
>>>>>>>> > Thanks both of you for your answers,
>>>>>>>> >
>>>>>>>> > El jue., 20 jun. 2019 a las 22:20, Smith, Barry F. (<
>>>>>>>> bsmith at mcs.anl.gov>) escribió:
>>>>>>>> >
>>>>>>>> > Note that this is a one time cost if the nonzero structure of
>>>>>>>> the matrix stays the same. It will not happen in future MatAssemblies.
>>>>>>>> >
>>>>>>>> > > On Jun 20, 2019, at 3:16 PM, Zhang, Junchao via petsc-users <
>>>>>>>> petsc-users at mcs.anl.gov> wrote:
>>>>>>>> > >
>>>>>>>> > > Those messages were used to build MatMult communication pattern
>>>>>>>> for the matrix. They were not part of the matrix entries-passing you
>>>>>>>> imagined, but indeed happened in MatAssemblyEnd. If you want to make sure
>>>>>>>> processors do not set remote entries, you can use
>>>>>>>> MatSetOption(A,MAT_NO_OFF_PROC_ENTRIES,PETSC_TRUE), which will generate an
>>>>>>>> error when an off-proc entry is set.
>>>>>>>> >
>>>>>>>> > I started being concerned about this when I saw that the assembly
>>>>>>>> was taking a few hundreds of seconds in my code, like 180 seconds, which
>>>>>>>> for me it's a considerable time. Do you think (or maybe you need more
>>>>>>>> information to answer this) that this time is "reasonable" for
>>>>>>>> communicating the pattern for the matrix? I already checked that I'm not
>>>>>>>> setting any remote entries.
>>>>>>>> > It is not reasonable. Could you send log view of that test with
>>>>>>>> 180 seconds MatAssembly?
>>>>>>>> >
>>>>>>>> > Also I see (in my code) that even if there are no messages being
>>>>>>>> passed during MatAssemblyBegin, it is taking time and the "ratio" is very
>>>>>>>> big.
>>>>>>>> >
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> > > --Junchao Zhang
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> > > On Thu, Jun 20, 2019 at 4:13 AM Ale Foggia via petsc-users <
>>>>>>>> petsc-users at mcs.anl.gov> wrote:
>>>>>>>> > > Hello all!
>>>>>>>> > >
>>>>>>>> > > During the conference I showed you a problem happening during
>>>>>>>> MatAssemblyEnd in a particular code that I have. Now, I tried the same with
>>>>>>>> a simple code (a symmetric problem corresponding to the Laplacian operator
>>>>>>>> in 1D, from the SLEPc Hands-On exercises). As I understand (and please,
>>>>>>>> correct me if I'm wrong), in this case the elements of the matrix are
>>>>>>>> computed locally by each process so there should not be any communication
>>>>>>>> during the assembly. However, in the log I get that there are messages
>>>>>>>> being passed. Also, the number of messages changes with the number of
>>>>>>>> processes used and the size of the matrix. Could you please help me
>>>>>>>> understand this?
>>>>>>>> > >
>>>>>>>> > > I attach the code I used and the log I get for a small problem.
>>>>>>>> > >
>>>>>>>> > > Cheers,
>>>>>>>> > > Ale
>>>>>>>> > >
>>>>>>>> >
>>>>>>>> > <log.txt>
>>>>>>>>
>>>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190701/95a893bd/attachment-0001.html>
More information about the petsc-users
mailing list