[petsc-users] LU factorization and solution of independent matrices does not scale, why?

Sat Jan 5 22:36:31 CST 2013

I don't seem able to read this matrix file because it appears to be
corrupt. Our mail programs might be messing with it. Can you encode it as
binary or provide it some other way?

Can you compare timing with all the other ranks dropped out. That is,
instead of

MPI_Comm_split(MPI_COMM_WORLD, rank / 4, rank, &coarseMpiComm);

have all procs with rank >= 4 pass MPI_UNDEFINED and then skip over the
solve itself (they'll get MPI_COMM_NULL returned).

On Fri, Dec 21, 2012 at 3:05 PM, Thomas Witkowski <
Thomas.Witkowski at tu-dresden.de> wrote:

> So, here it is. Just compile and run with
>
> mpiexec -np 64 ./ex10 -ksp_type preonly -pc_type lu
> -pc_factor_mat_solver_package superlu_dist -log_summary
>
> 64 cores: 0.09 seconds for solving
> 1024 cores: 2.6 seconds for solving
>
>
> Thomas
>
>
> Zitat von Jed Brown <jedbrown at mcs.anl.gov>:
>
>  Can you reproduce this in a simpler environment so that we can report it?
>> As I understand your statement, it sounds like you could reproduce by
>> changing src/ksp/ksp/examples/**tutorials/ex10.c to create a subcomm of
>> size
>> 4 and the using that everywhere, then compare log_summary running on 4
>> cores to running on more (despite everything really being independent)
>>
>> It would also be worth using an MPI profiler to see if it's really
>> spending
>> a lot of time in MPI_Iprobe. Since SuperLU_DIST does not use MPI_Iprobe,
>> it
>> may be something else.
>>
>> On Fri, Dec 21, 2012 at 8:51 AM, Thomas Witkowski <
>> Thomas.Witkowski at tu-dresden.de**> wrote:
>>
>>  I use a modified MPICH version. On the system I use for these benchmarks
>>> I
>>> cannot use another MPI library.
>>>
>>> I'm not fixed to MUMPS. Superlu_dist, for example, works also perfectly
>>> for this. But there is still the following problem I cannot solve: When I
>>> increase the number of coarse space matrices, there seems to be no
>>> scaling
>>> direct solver for this. Just to summaries:
>>> - one coarse space matrix is created always by one "cluster" consisting
>>> of
>>> four subdomanins/MPI tasks
>>> - the four tasks are always local to one node, thus inter-node network
>>> communication is not required for computing factorization and solve
>>> - independent of the number of cluster, the coarse space matrices are the
>>> same, have the same number of rows, nnz structure but possibly different
>>> values
>>> - there is NO load unbalancing
>>> - the matrices must be factorized and there are a lot of solves (> 100)
>>> with them
>>>
>>> It should be pretty clear, that computing LU factorization and solving
>>> with it should scale perfectly. But at the moment, all direct solver I
>>> tried (mumps, superlu_dist, pastix) are not able to scale. The loos of
>>> scale is really worse, as you can see from the numbers I send before.
>>>
>>> Any ideas? Suggestions? Without a scaling solver method for these kind of
>>> systems, my multilevel FETI-DP code is just more or less a joke, only
>>> some
>>> orders of magnitude slower than standard FETI-DP method :)
>>>
>>> Thomas
>>>
>>> Zitat von Jed Brown <jedbrown at mcs.anl.gov>:
>>>
>>>  MUMPS uses MPI_Iprobe on MPI_COMM_WORLD (hard-coded). What MPI
>>>
>>>> implementation have you been using? Is the behavior different with a
>>>> different implementation?
>>>>
>>>>
>>>> On Fri, Dec 21, 2012 at 2:36 AM, Thomas Witkowski <
>>>> thomas.witkowski at tu-dresden.de****> wrote:
>>>>
>>>>  Okay, I did a similar benchmark now with PETSc's event logging:
>>>>
>>>>>
>>>>> UMFPACK
>>>>>  16p: Local solve          350 1.0 2.3025e+01 1.1 5.00e+04 1.0 0.0e+00
>>>>> 0.0e+00 7.0e+02 63  0  0  0 52  63  0  0  0 51     0
>>>>>  64p: Local solve          350 1.0 2.3208e+01 1.1 5.00e+04 1.0 0.0e+00
>>>>> 0.0e+00 7.0e+02 60  0  0  0 52  60  0  0  0 51     0
>>>>> 256p: Local solve          350 1.0 2.3373e+01 1.1 5.00e+04 1.0 0.0e+00
>>>>> 0.0e+00 7.0e+02 49  0  0  0 52  49  0  0  0 51     1
>>>>>
>>>>> MUMPS
>>>>>  16p: Local solve          350 1.0 4.7183e+01 1.1 5.00e+04 1.0 0.0e+00
>>>>> 0.0e+00 7.0e+02 75  0  0  0 52  75  0  0  0 51     0
>>>>>  64p: Local solve          350 1.0 7.1409e+01 1.1 5.00e+04 1.0 0.0e+00
>>>>> 0.0e+00 7.0e+02 78  0  0  0 52  78  0  0  0 51     0
>>>>> 256p: Local solve          350 1.0 2.6079e+02 1.1 5.00e+04 1.0 0.0e+00
>>>>> 0.0e+00 7.0e+02 82  0  0  0 52  82  0  0  0 51     0
>>>>>
>>>>>
>>>>> As you see, the local solves with UMFPACK have nearly constant time
>>>>> with
>>>>> increasing number of subdomains. This is what I expect. The I replace
>>>>> UMFPACK by MUMPS and I see increasing time for local solves. In the
>>>>> last
>>>>> columns, UMFPACK has a decreasing value from 63 to 49, while MUMPS's
>>>>> column
>>>>> increases here from 75 to 82. What does this mean?
>>>>>
>>>>> Thomas
>>>>>
>>>>> Am 21.12.2012 02:19, schrieb Matthew Knepley:
>>>>>
>>>>>  On Thu, Dec 20, 2012 at 3:39 PM, Thomas Witkowski
>>>>>
>>>>>  <Thomas.Witkowski at tu-dresden.******de  <Thomas.Witkowski at tu-dresden.*
>>>>>> ***de<Thomas.Witkowski at tu-**dresden.de<Thomas.Witkowski at tu-dresden.de>
>>>>>> >
>>>>>>
>>>>>> >>
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>  I cannot use the information from log_summary, as I have three
>>>>>>
>>>>>>> different
>>>>>>> LU
>>>>>>> factorizations and solve (local matrices and two hierarchies of
>>>>>>> coarse
>>>>>>> grids). Therefore, I use the following work around to get the timing
>>>>>>> of
>>>>>>> the
>>>>>>> solve I'm intrested in:
>>>>>>>
>>>>>>>  You misunderstand how to use logging. You just put these thing in
>>>>>>>
>>>>>> separate stages. Stages represent
>>>>>> parts of the code over which events are aggregated.
>>>>>>
>>>>>>     Matt
>>>>>>
>>>>>>       MPI::COMM_WORLD.Barrier();
>>>>>>
>>>>>>       wtime = MPI::Wtime();
>>>>>>>      KSPSolve(*(data->ksp_schur_******primal_local), tmp_primal,
>>>>>>>
>>>>>>>
>>>>>>> tmp_primal);
>>>>>>>      FetiTimings::fetiSolve03 += (MPI::Wtime() - wtime);
>>>>>>>
>>>>>>> The factorization is done explicitly before with "KSPSetUp", so I can
>>>>>>> measure the time for LU factorization. It also does not scale! For 64
>>>>>>> cores,
>>>>>>> I takes 0.05 seconds, for 1024 cores 1.2 seconds. In all
>>>>>>> calculations,
>>>>>>> the
>>>>>>> local coarse space matrices defined on four cores have exactly the
>>>>>>> same
>>>>>>> number of rows and exactly the same number of non zero entries. So,
>>>>>>> from
>>>>>>> my
>>>>>>> point of view, the time should be absolutely constant.
>>>>>>>
>>>>>>> Thomas
>>>>>>>
>>>>>>> Zitat von Barry Smith <bsmith at mcs.anl.gov>:
>>>>>>>
>>>>>>>
>>>>>>>     Are you timing ONLY the time to factor and solve the subproblems?
>>>>>>>  Or
>>>>>>>
>>>>>>>  also the time to get the data to the collection of 4 cores at a
>>>>>>>>  time?
>>>>>>>>
>>>>>>>>     If you are only using LU for these problems and not elsewhere in
>>>>>>>>  the
>>>>>>>> code you can get the factorization and time from MatLUFactor()  and
>>>>>>>> MatSolve() or you can use stages to put this calculation in its  own
>>>>>>>> stage
>>>>>>>> and use the MatLUFactor() and MatSolve() time from that  stage.
>>>>>>>> Also look at the load balancing column for the factorization and
>>>>>>>>  solve
>>>>>>>> stage, it is well balanced?
>>>>>>>>
>>>>>>>>     Barry
>>>>>>>>
>>>>>>>> On Dec 20, 2012, at 2:16 PM, Thomas Witkowski
>>>>>>>> <thomas.witkowski at tu-dresden.******de
>>>>>>>>  <thomas.witkowski at tu-dresden.****de<thomas.witkowski at tu-**
>>>>>>>> dresden.de <thomas.witkowski at tu-dresden.de>>
>>>>>>>>
>>>>>>>> >>
>>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>  In my multilevel FETI-DP code, I have localized course matrices,
>>>>>>>>  which
>>>>>>>>
>>>>>>>>  are defined on only a subset of all MPI tasks, typically  between 4
>>>>>>>>> and 64
>>>>>>>>> tasks. The MatAIJ and the KSP objects are both  defined on a MPI
>>>>>>>>> communicator, which is a subset of  MPI::COMM_WORLD. The LU
>>>>>>>>> factorization of
>>>>>>>>> the matrices is computed  with either MUMPS or superlu_dist, but
>>>>>>>>> both
>>>>>>>>> show
>>>>>>>>> some scaling  property I really wonder of: When the overall problem
>>>>>>>>> size is
>>>>>>>>> increased, the solve with the LU factorization of the local
>>>>>>>>>  matrices
>>>>>>>>> does
>>>>>>>>> not scale! But why not? I just increase the number of  local
>>>>>>>>> matrices,
>>>>>>>>> but
>>>>>>>>> all of them are independent of each other. Some  example: I use 64
>>>>>>>>> cores,
>>>>>>>>> each coarse matrix is spanned by 4 cores  so there are 16 MPI
>>>>>>>>> communicators
>>>>>>>>> with 16 coarse space matrices.  The problem need to solve 192 times
>>>>>>>>> with the
>>>>>>>>> coarse space systems,  and this takes together 0.09 seconds. Now I
>>>>>>>>> increase
>>>>>>>>> the number of  cores to 256, but let the local coarse space be
>>>>>>>>> defined
>>>>>>>>> again
>>>>>>>>> on  only 4 cores. Again, 192 solutions with these coarse spaces are
>>>>>>>>> required, but now this takes 0.24 seconds. The same for 1024 cores,
>>>>>>>>>  and we
>>>>>>>>> are at 1.7 seconds for the local coarse space solver!
>>>>>>>>>
>>>>>>>>> For me, this is a total mystery! Any idea how to explain, debug and
>>>>>>>>> eventually how to resolve this problem?
>>>>>>>>>
>>>>>>>>> Thomas
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>  --
>>>>>> What most experimenters take for granted before they begin their
>>>>>> experiments is infinitely more interesting than any results to which
>>>>>> their experiments lead.
>>>>>> -- Norbert Wiener
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20130105/925b2483/attachment.html>