[petsc-users] LU factorization and solution of independent matrices does not scale, why?

Fri Dec 21 08:08:14 CST 2012

MUMPS uses MPI_Iprobe on MPI_COMM_WORLD (hard-coded). What MPI
implementation have you been using? Is the behavior different with a
different implementation?

On Fri, Dec 21, 2012 at 2:36 AM, Thomas Witkowski <
thomas.witkowski at tu-dresden.de> wrote:

> Okay, I did a similar benchmark now with PETSc's event logging:
>
> UMFPACK
>  16p: Local solve          350 1.0 2.3025e+01 1.1 5.00e+04 1.0 0.0e+00
> 0.0e+00 7.0e+02 63  0  0  0 52  63  0  0  0 51     0
>  64p: Local solve          350 1.0 2.3208e+01 1.1 5.00e+04 1.0 0.0e+00
> 0.0e+00 7.0e+02 60  0  0  0 52  60  0  0  0 51     0
> 256p: Local solve          350 1.0 2.3373e+01 1.1 5.00e+04 1.0 0.0e+00
> 0.0e+00 7.0e+02 49  0  0  0 52  49  0  0  0 51     1
>
> MUMPS
>  16p: Local solve          350 1.0 4.7183e+01 1.1 5.00e+04 1.0 0.0e+00
> 0.0e+00 7.0e+02 75  0  0  0 52  75  0  0  0 51     0
>  64p: Local solve          350 1.0 7.1409e+01 1.1 5.00e+04 1.0 0.0e+00
> 0.0e+00 7.0e+02 78  0  0  0 52  78  0  0  0 51     0
> 256p: Local solve          350 1.0 2.6079e+02 1.1 5.00e+04 1.0 0.0e+00
> 0.0e+00 7.0e+02 82  0  0  0 52  82  0  0  0 51     0
>
>
> As you see, the local solves with UMFPACK have nearly constant time with
> increasing number of subdomains. This is what I expect. The I replace
> UMFPACK by MUMPS and I see increasing time for local solves. In the last
> columns, UMFPACK has a decreasing value from 63 to 49, while MUMPS's column
> increases here from 75 to 82. What does this mean?
>
> Thomas
>
> Am 21.12.2012 02:19, schrieb Matthew Knepley:
>
>  On Thu, Dec 20, 2012 at 3:39 PM, Thomas Witkowski
>> <Thomas.Witkowski at tu-dresden.**de <Thomas.Witkowski at tu-dresden.de>>
>> wrote:
>>
>>> I cannot use the information from log_summary, as I have three different
>>> LU
>>> factorizations and solve (local matrices and two hierarchies of coarse
>>> grids). Therefore, I use the following work around to get the timing of
>>> the
>>> solve I'm intrested in:
>>>
>> You misunderstand how to use logging. You just put these thing in
>> separate stages. Stages represent
>> parts of the code over which events are aggregated.
>>
>>     Matt
>>
>>       MPI::COMM_WORLD.Barrier();
>>>      wtime = MPI::Wtime();
>>>      KSPSolve(*(data->ksp_schur_**primal_local), tmp_primal,
>>> tmp_primal);
>>>      FetiTimings::fetiSolve03 += (MPI::Wtime() - wtime);
>>>
>>> The factorization is done explicitly before with "KSPSetUp", so I can
>>> measure the time for LU factorization. It also does not scale! For 64
>>> cores,
>>> I takes 0.05 seconds, for 1024 cores 1.2 seconds. In all calculations,
>>> the
>>> local coarse space matrices defined on four cores have exactly the same
>>> number of rows and exactly the same number of non zero entries. So, from
>>> my
>>> point of view, the time should be absolutely constant.
>>>
>>> Thomas
>>>
>>> Zitat von Barry Smith <bsmith at mcs.anl.gov>:
>>>
>>>
>>>     Are you timing ONLY the time to factor and solve the subproblems?  Or
>>>> also the time to get the data to the collection of 4 cores at a  time?
>>>>
>>>>     If you are only using LU for these problems and not elsewhere in
>>>>  the
>>>> code you can get the factorization and time from MatLUFactor()  and
>>>> MatSolve() or you can use stages to put this calculation in its  own
>>>> stage
>>>> and use the MatLUFactor() and MatSolve() time from that  stage.
>>>> Also look at the load balancing column for the factorization and  solve
>>>> stage, it is well balanced?
>>>>
>>>>     Barry
>>>>
>>>> On Dec 20, 2012, at 2:16 PM, Thomas Witkowski
>>>> <thomas.witkowski at tu-dresden.**de <thomas.witkowski at tu-dresden.de>>
>>>> wrote:
>>>>
>>>>  In my multilevel FETI-DP code, I have localized course matrices,  which
>>>>> are defined on only a subset of all MPI tasks, typically  between 4
>>>>> and 64
>>>>> tasks. The MatAIJ and the KSP objects are both  defined on a MPI
>>>>> communicator, which is a subset of  MPI::COMM_WORLD. The LU
>>>>> factorization of
>>>>> the matrices is computed  with either MUMPS or superlu_dist, but both
>>>>> show
>>>>> some scaling  property I really wonder of: When the overall problem
>>>>> size is
>>>>> increased, the solve with the LU factorization of the local  matrices
>>>>> does
>>>>> not scale! But why not? I just increase the number of  local matrices,
>>>>> but
>>>>> all of them are independent of each other. Some  example: I use 64
>>>>> cores,
>>>>> each coarse matrix is spanned by 4 cores  so there are 16 MPI
>>>>> communicators
>>>>> with 16 coarse space matrices.  The problem need to solve 192 times
>>>>> with the
>>>>> coarse space systems,  and this takes together 0.09 seconds. Now I
>>>>> increase
>>>>> the number of  cores to 256, but let the local coarse space be defined
>>>>> again
>>>>> on  only 4 cores. Again, 192 solutions with these coarse spaces are
>>>>> required, but now this takes 0.24 seconds. The same for 1024 cores,
>>>>>  and we
>>>>> are at 1.7 seconds for the local coarse space solver!
>>>>>
>>>>> For me, this is a total mystery! Any idea how to explain, debug and
>>>>> eventually how to resolve this problem?
>>>>>
>>>>> Thomas
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which
>> their experiments lead.
>> -- Norbert Wiener
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20121221/8968e584/attachment-0001.html>