[petsc-users] LU factorization and solution of independent matrices does not scale, why?

Fri Dec 21 09:51:12 CST 2012

I use a modified MPICH version. On the system I use for these  
benchmarks I cannot use another MPI library.

I'm not fixed to MUMPS. Superlu_dist, for example, works also  
perfectly for this. But there is still the following problem I cannot  
solve: When I increase the number of coarse space matrices, there  
seems to be no scaling direct solver for this. Just to summaries:
- one coarse space matrix is created always by one "cluster"  
consisting of four subdomanins/MPI tasks
- the four tasks are always local to one node, thus inter-node network  
communication is not required for computing factorization and solve
- independent of the number of cluster, the coarse space matrices are  
the same, have the same number of rows, nnz structure but possibly  
different values
- there is NO load unbalancing
- the matrices must be factorized and there are a lot of solves (>  
100) with them

It should be pretty clear, that computing LU factorization and solving  
with it should scale perfectly. But at the moment, all direct solver I  
tried (mumps, superlu_dist, pastix) are not able to scale. The loos of  
scale is really worse, as you can see from the numbers I send before.

Any ideas? Suggestions? Without a scaling solver method for these kind  
of systems, my multilevel FETI-DP code is just more or less a joke,  
only some orders of magnitude slower than standard FETI-DP method :)

Thomas

Zitat von Jed Brown <jedbrown at mcs.anl.gov>:

> MUMPS uses MPI_Iprobe on MPI_COMM_WORLD (hard-coded). What MPI
> implementation have you been using? Is the behavior different with a
> different implementation?
>
>
> On Fri, Dec 21, 2012 at 2:36 AM, Thomas Witkowski <
> thomas.witkowski at tu-dresden.de> wrote:
>
>> Okay, I did a similar benchmark now with PETSc's event logging:
>>
>> UMFPACK
>>  16p: Local solve          350 1.0 2.3025e+01 1.1 5.00e+04 1.0 0.0e+00
>> 0.0e+00 7.0e+02 63  0  0  0 52  63  0  0  0 51     0
>>  64p: Local solve          350 1.0 2.3208e+01 1.1 5.00e+04 1.0 0.0e+00
>> 0.0e+00 7.0e+02 60  0  0  0 52  60  0  0  0 51     0
>> 256p: Local solve          350 1.0 2.3373e+01 1.1 5.00e+04 1.0 0.0e+00
>> 0.0e+00 7.0e+02 49  0  0  0 52  49  0  0  0 51     1
>>
>> MUMPS
>>  16p: Local solve          350 1.0 4.7183e+01 1.1 5.00e+04 1.0 0.0e+00
>> 0.0e+00 7.0e+02 75  0  0  0 52  75  0  0  0 51     0
>>  64p: Local solve          350 1.0 7.1409e+01 1.1 5.00e+04 1.0 0.0e+00
>> 0.0e+00 7.0e+02 78  0  0  0 52  78  0  0  0 51     0
>> 256p: Local solve          350 1.0 2.6079e+02 1.1 5.00e+04 1.0 0.0e+00
>> 0.0e+00 7.0e+02 82  0  0  0 52  82  0  0  0 51     0
>>
>>
>> As you see, the local solves with UMFPACK have nearly constant time with
>> increasing number of subdomains. This is what I expect. The I replace
>> UMFPACK by MUMPS and I see increasing time for local solves. In the last
>> columns, UMFPACK has a decreasing value from 63 to 49, while MUMPS's column
>> increases here from 75 to 82. What does this mean?
>>
>> Thomas
>>
>> Am 21.12.2012 02:19, schrieb Matthew Knepley:
>>
>>  On Thu, Dec 20, 2012 at 3:39 PM, Thomas Witkowski
>>> <Thomas.Witkowski at tu-dresden.**de <Thomas.Witkowski at tu-dresden.de>>
>>> wrote:
>>>
>>>> I cannot use the information from log_summary, as I have three different
>>>> LU
>>>> factorizations and solve (local matrices and two hierarchies of coarse
>>>> grids). Therefore, I use the following work around to get the timing of
>>>> the
>>>> solve I'm intrested in:
>>>>
>>> You misunderstand how to use logging. You just put these thing in
>>> separate stages. Stages represent
>>> parts of the code over which events are aggregated.
>>>
>>>     Matt
>>>
>>>       MPI::COMM_WORLD.Barrier();
>>>>      wtime = MPI::Wtime();
>>>>      KSPSolve(*(data->ksp_schur_**primal_local), tmp_primal,
>>>> tmp_primal);
>>>>      FetiTimings::fetiSolve03 += (MPI::Wtime() - wtime);
>>>>
>>>> The factorization is done explicitly before with "KSPSetUp", so I can
>>>> measure the time for LU factorization. It also does not scale! For 64
>>>> cores,
>>>> I takes 0.05 seconds, for 1024 cores 1.2 seconds. In all calculations,
>>>> the
>>>> local coarse space matrices defined on four cores have exactly the same
>>>> number of rows and exactly the same number of non zero entries. So, from
>>>> my
>>>> point of view, the time should be absolutely constant.
>>>>
>>>> Thomas
>>>>
>>>> Zitat von Barry Smith <bsmith at mcs.anl.gov>:
>>>>
>>>>
>>>>     Are you timing ONLY the time to factor and solve the subproblems?  Or
>>>>> also the time to get the data to the collection of 4 cores at a  time?
>>>>>
>>>>>     If you are only using LU for these problems and not elsewhere in
>>>>>  the
>>>>> code you can get the factorization and time from MatLUFactor()  and
>>>>> MatSolve() or you can use stages to put this calculation in its  own
>>>>> stage
>>>>> and use the MatLUFactor() and MatSolve() time from that  stage.
>>>>> Also look at the load balancing column for the factorization and  solve
>>>>> stage, it is well balanced?
>>>>>
>>>>>     Barry
>>>>>
>>>>> On Dec 20, 2012, at 2:16 PM, Thomas Witkowski
>>>>> <thomas.witkowski at tu-dresden.**de <thomas.witkowski at tu-dresden.de>>
>>>>> wrote:
>>>>>
>>>>>  In my multilevel FETI-DP code, I have localized course matrices,  which
>>>>>> are defined on only a subset of all MPI tasks, typically  between 4
>>>>>> and 64
>>>>>> tasks. The MatAIJ and the KSP objects are both  defined on a MPI
>>>>>> communicator, which is a subset of  MPI::COMM_WORLD. The LU
>>>>>> factorization of
>>>>>> the matrices is computed  with either MUMPS or superlu_dist, but both
>>>>>> show
>>>>>> some scaling  property I really wonder of: When the overall problem
>>>>>> size is
>>>>>> increased, the solve with the LU factorization of the local  matrices
>>>>>> does
>>>>>> not scale! But why not? I just increase the number of  local matrices,
>>>>>> but
>>>>>> all of them are independent of each other. Some  example: I use 64
>>>>>> cores,
>>>>>> each coarse matrix is spanned by 4 cores  so there are 16 MPI
>>>>>> communicators
>>>>>> with 16 coarse space matrices.  The problem need to solve 192 times
>>>>>> with the
>>>>>> coarse space systems,  and this takes together 0.09 seconds. Now I
>>>>>> increase
>>>>>> the number of  cores to 256, but let the local coarse space be defined
>>>>>> again
>>>>>> on  only 4 cores. Again, 192 solutions with these coarse spaces are
>>>>>> required, but now this takes 0.24 seconds. The same for 1024 cores,
>>>>>>  and we
>>>>>> are at 1.7 seconds for the local coarse space solver!
>>>>>>
>>>>>> For me, this is a total mystery! Any idea how to explain, debug and
>>>>>> eventually how to resolve this problem?
>>>>>>
>>>>>> Thomas
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which
>>> their experiments lead.
>>> -- Norbert Wiener
>>>
>>
>>
>