[petsc-users] LU factorization and solution of independent matrices does not scale, why?

Thu Dec 27 10:10:22 CST 2012

Have anyone of you tried to reproduce this problem?

Thomas

Am 21.12.2012 22:05, schrieb Thomas Witkowski:
> So, here it is. Just compile and run with
>
> mpiexec -np 64 ./ex10 -ksp_type preonly -pc_type lu 
> -pc_factor_mat_solver_package superlu_dist -log_summary
>
> 64 cores: 0.09 seconds for solving
> 1024 cores: 2.6 seconds for solving
>
> Thomas
>
>
> Zitat von Jed Brown <jedbrown at mcs.anl.gov>:
>
>> Can you reproduce this in a simpler environment so that we can report 
>> it?
>> As I understand your statement, it sounds like you could reproduce by
>> changing src/ksp/ksp/examples/tutorials/ex10.c to create a subcomm of 
>> size
>> 4 and the using that everywhere, then compare log_summary running on 4
>> cores to running on more (despite everything really being independent)
>>
>> It would also be worth using an MPI profiler to see if it's really 
>> spending
>> a lot of time in MPI_Iprobe. Since SuperLU_DIST does not use 
>> MPI_Iprobe, it
>> may be something else.
>>
>> On Fri, Dec 21, 2012 at 8:51 AM, Thomas Witkowski <
>> Thomas.Witkowski at tu-dresden.de> wrote:
>>
>>> I use a modified MPICH version. On the system I use for these 
>>> benchmarks I
>>> cannot use another MPI library.
>>>
>>> I'm not fixed to MUMPS. Superlu_dist, for example, works also perfectly
>>> for this. But there is still the following problem I cannot solve: 
>>> When I
>>> increase the number of coarse space matrices, there seems to be no 
>>> scaling
>>> direct solver for this. Just to summaries:
>>> - one coarse space matrix is created always by one "cluster" 
>>> consisting of
>>> four subdomanins/MPI tasks
>>> - the four tasks are always local to one node, thus inter-node network
>>> communication is not required for computing factorization and solve
>>> - independent of the number of cluster, the coarse space matrices 
>>> are the
>>> same, have the same number of rows, nnz structure but possibly 
>>> different
>>> values
>>> - there is NO load unbalancing
>>> - the matrices must be factorized and there are a lot of solves (> 100)
>>> with them
>>>
>>> It should be pretty clear, that computing LU factorization and solving
>>> with it should scale perfectly. But at the moment, all direct solver I
>>> tried (mumps, superlu_dist, pastix) are not able to scale. The loos of
>>> scale is really worse, as you can see from the numbers I send before.
>>>
>>> Any ideas? Suggestions? Without a scaling solver method for these 
>>> kind of
>>> systems, my multilevel FETI-DP code is just more or less a joke, 
>>> only some
>>> orders of magnitude slower than standard FETI-DP method :)
>>>
>>> Thomas
>>>
>>> Zitat von Jed Brown <jedbrown at mcs.anl.gov>:
>>>
>>>  MUMPS uses MPI_Iprobe on MPI_COMM_WORLD (hard-coded). What MPI
>>>> implementation have you been using? Is the behavior different with a
>>>> different implementation?
>>>>
>>>>
>>>> On Fri, Dec 21, 2012 at 2:36 AM, Thomas Witkowski <
>>>> thomas.witkowski at tu-dresden.de**> wrote:
>>>>
>>>>  Okay, I did a similar benchmark now with PETSc's event logging:
>>>>>
>>>>> UMFPACK
>>>>>  16p: Local solve          350 1.0 2.3025e+01 1.1 5.00e+04 1.0 
>>>>> 0.0e+00
>>>>> 0.0e+00 7.0e+02 63  0  0  0 52  63  0  0  0 51     0
>>>>>  64p: Local solve          350 1.0 2.3208e+01 1.1 5.00e+04 1.0 
>>>>> 0.0e+00
>>>>> 0.0e+00 7.0e+02 60  0  0  0 52  60  0  0  0 51     0
>>>>> 256p: Local solve          350 1.0 2.3373e+01 1.1 5.00e+04 1.0 
>>>>> 0.0e+00
>>>>> 0.0e+00 7.0e+02 49  0  0  0 52  49  0  0  0 51     1
>>>>>
>>>>> MUMPS
>>>>>  16p: Local solve          350 1.0 4.7183e+01 1.1 5.00e+04 1.0 
>>>>> 0.0e+00
>>>>> 0.0e+00 7.0e+02 75  0  0  0 52  75  0  0  0 51     0
>>>>>  64p: Local solve          350 1.0 7.1409e+01 1.1 5.00e+04 1.0 
>>>>> 0.0e+00
>>>>> 0.0e+00 7.0e+02 78  0  0  0 52  78  0  0  0 51     0
>>>>> 256p: Local solve          350 1.0 2.6079e+02 1.1 5.00e+04 1.0 
>>>>> 0.0e+00
>>>>> 0.0e+00 7.0e+02 82  0  0  0 52  82  0  0  0 51     0
>>>>>
>>>>>
>>>>> As you see, the local solves with UMFPACK have nearly constant 
>>>>> time with
>>>>> increasing number of subdomains. This is what I expect. The I replace
>>>>> UMFPACK by MUMPS and I see increasing time for local solves. In 
>>>>> the last
>>>>> columns, UMFPACK has a decreasing value from 63 to 49, while MUMPS's
>>>>> column
>>>>> increases here from 75 to 82. What does this mean?
>>>>>
>>>>> Thomas
>>>>>
>>>>> Am 21.12.2012 02:19, schrieb Matthew Knepley:
>>>>>
>>>>>  On Thu, Dec 20, 2012 at 3:39 PM, Thomas Witkowski
>>>>>
>>>>>> <Thomas.Witkowski at tu-dresden.****de 
>>>>>> <Thomas.Witkowski at tu-dresden.**de<Thomas.Witkowski at tu-dresden.de>
>>>>>> >>
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>  I cannot use the information from log_summary, as I have three
>>>>>>> different
>>>>>>> LU
>>>>>>> factorizations and solve (local matrices and two hierarchies of 
>>>>>>> coarse
>>>>>>> grids). Therefore, I use the following work around to get the 
>>>>>>> timing of
>>>>>>> the
>>>>>>> solve I'm intrested in:
>>>>>>>
>>>>>>>  You misunderstand how to use logging. You just put these thing in
>>>>>> separate stages. Stages represent
>>>>>> parts of the code over which events are aggregated.
>>>>>>
>>>>>>     Matt
>>>>>>
>>>>>>       MPI::COMM_WORLD.Barrier();
>>>>>>
>>>>>>>      wtime = MPI::Wtime();
>>>>>>>      KSPSolve(*(data->ksp_schur_****primal_local), tmp_primal,
>>>>>>>
>>>>>>> tmp_primal);
>>>>>>>      FetiTimings::fetiSolve03 += (MPI::Wtime() - wtime);
>>>>>>>
>>>>>>> The factorization is done explicitly before with "KSPSetUp", so 
>>>>>>> I can
>>>>>>> measure the time for LU factorization. It also does not scale! 
>>>>>>> For 64
>>>>>>> cores,
>>>>>>> I takes 0.05 seconds, for 1024 cores 1.2 seconds. In all 
>>>>>>> calculations,
>>>>>>> the
>>>>>>> local coarse space matrices defined on four cores have exactly 
>>>>>>> the same
>>>>>>> number of rows and exactly the same number of non zero entries. So,
>>>>>>> from
>>>>>>> my
>>>>>>> point of view, the time should be absolutely constant.
>>>>>>>
>>>>>>> Thomas
>>>>>>>
>>>>>>> Zitat von Barry Smith <bsmith at mcs.anl.gov>:
>>>>>>>
>>>>>>>
>>>>>>>     Are you timing ONLY the time to factor and solve the 
>>>>>>> subproblems?
>>>>>>>  Or
>>>>>>>
>>>>>>>> also the time to get the data to the collection of 4 cores at 
>>>>>>>> a  time?
>>>>>>>>
>>>>>>>>     If you are only using LU for these problems and not 
>>>>>>>> elsewhere in
>>>>>>>>  the
>>>>>>>> code you can get the factorization and time from MatLUFactor()  
>>>>>>>> and
>>>>>>>> MatSolve() or you can use stages to put this calculation in 
>>>>>>>> its  own
>>>>>>>> stage
>>>>>>>> and use the MatLUFactor() and MatSolve() time from that  stage.
>>>>>>>> Also look at the load balancing column for the factorization and
>>>>>>>>  solve
>>>>>>>> stage, it is well balanced?
>>>>>>>>
>>>>>>>>     Barry
>>>>>>>>
>>>>>>>> On Dec 20, 2012, at 2:16 PM, Thomas Witkowski
>>>>>>>> <thomas.witkowski at tu-dresden.****de 
>>>>>>>> <thomas.witkowski at tu-dresden.**de<thomas.witkowski at tu-dresden.de>
>>>>>>>> >>
>>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>  In my multilevel FETI-DP code, I have localized course matrices,
>>>>>>>>  which
>>>>>>>>
>>>>>>>>> are defined on only a subset of all MPI tasks, typically  
>>>>>>>>> between 4
>>>>>>>>> and 64
>>>>>>>>> tasks. The MatAIJ and the KSP objects are both defined on a MPI
>>>>>>>>> communicator, which is a subset of MPI::COMM_WORLD. The LU
>>>>>>>>> factorization of
>>>>>>>>> the matrices is computed  with either MUMPS or superlu_dist, 
>>>>>>>>> but both
>>>>>>>>> show
>>>>>>>>> some scaling  property I really wonder of: When the overall 
>>>>>>>>> problem
>>>>>>>>> size is
>>>>>>>>> increased, the solve with the LU factorization of the local  
>>>>>>>>> matrices
>>>>>>>>> does
>>>>>>>>> not scale! But why not? I just increase the number of  local
>>>>>>>>> matrices,
>>>>>>>>> but
>>>>>>>>> all of them are independent of each other. Some example: I use 64
>>>>>>>>> cores,
>>>>>>>>> each coarse matrix is spanned by 4 cores  so there are 16 MPI
>>>>>>>>> communicators
>>>>>>>>> with 16 coarse space matrices.  The problem need to solve 192 
>>>>>>>>> times
>>>>>>>>> with the
>>>>>>>>> coarse space systems,  and this takes together 0.09 seconds. 
>>>>>>>>> Now I
>>>>>>>>> increase
>>>>>>>>> the number of  cores to 256, but let the local coarse space be
>>>>>>>>> defined
>>>>>>>>> again
>>>>>>>>> on  only 4 cores. Again, 192 solutions with these coarse 
>>>>>>>>> spaces are
>>>>>>>>> required, but now this takes 0.24 seconds. The same for 1024 
>>>>>>>>> cores,
>>>>>>>>>  and we
>>>>>>>>> are at 1.7 seconds for the local coarse space solver!
>>>>>>>>>
>>>>>>>>> For me, this is a total mystery! Any idea how to explain, 
>>>>>>>>> debug and
>>>>>>>>> eventually how to resolve this problem?
>>>>>>>>>
>>>>>>>>> Thomas
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>> -- 
>>>>>> What most experimenters take for granted before they begin their
>>>>>> experiments is infinitely more interesting than any results to which
>>>>>> their experiments lead.
>>>>>> -- Norbert Wiener
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>