[petsc-users] LU factorization and solution of independent matrices does not scale, why?
Thomas Witkowski
thomas.witkowski at tu-dresden.de
Thu Dec 27 10:10:22 CST 2012
Have anyone of you tried to reproduce this problem?
Thomas
Am 21.12.2012 22:05, schrieb Thomas Witkowski:
> So, here it is. Just compile and run with
>
> mpiexec -np 64 ./ex10 -ksp_type preonly -pc_type lu
> -pc_factor_mat_solver_package superlu_dist -log_summary
>
> 64 cores: 0.09 seconds for solving
> 1024 cores: 2.6 seconds for solving
>
> Thomas
>
>
> Zitat von Jed Brown <jedbrown at mcs.anl.gov>:
>
>> Can you reproduce this in a simpler environment so that we can report
>> it?
>> As I understand your statement, it sounds like you could reproduce by
>> changing src/ksp/ksp/examples/tutorials/ex10.c to create a subcomm of
>> size
>> 4 and the using that everywhere, then compare log_summary running on 4
>> cores to running on more (despite everything really being independent)
>>
>> It would also be worth using an MPI profiler to see if it's really
>> spending
>> a lot of time in MPI_Iprobe. Since SuperLU_DIST does not use
>> MPI_Iprobe, it
>> may be something else.
>>
>> On Fri, Dec 21, 2012 at 8:51 AM, Thomas Witkowski <
>> Thomas.Witkowski at tu-dresden.de> wrote:
>>
>>> I use a modified MPICH version. On the system I use for these
>>> benchmarks I
>>> cannot use another MPI library.
>>>
>>> I'm not fixed to MUMPS. Superlu_dist, for example, works also perfectly
>>> for this. But there is still the following problem I cannot solve:
>>> When I
>>> increase the number of coarse space matrices, there seems to be no
>>> scaling
>>> direct solver for this. Just to summaries:
>>> - one coarse space matrix is created always by one "cluster"
>>> consisting of
>>> four subdomanins/MPI tasks
>>> - the four tasks are always local to one node, thus inter-node network
>>> communication is not required for computing factorization and solve
>>> - independent of the number of cluster, the coarse space matrices
>>> are the
>>> same, have the same number of rows, nnz structure but possibly
>>> different
>>> values
>>> - there is NO load unbalancing
>>> - the matrices must be factorized and there are a lot of solves (> 100)
>>> with them
>>>
>>> It should be pretty clear, that computing LU factorization and solving
>>> with it should scale perfectly. But at the moment, all direct solver I
>>> tried (mumps, superlu_dist, pastix) are not able to scale. The loos of
>>> scale is really worse, as you can see from the numbers I send before.
>>>
>>> Any ideas? Suggestions? Without a scaling solver method for these
>>> kind of
>>> systems, my multilevel FETI-DP code is just more or less a joke,
>>> only some
>>> orders of magnitude slower than standard FETI-DP method :)
>>>
>>> Thomas
>>>
>>> Zitat von Jed Brown <jedbrown at mcs.anl.gov>:
>>>
>>> MUMPS uses MPI_Iprobe on MPI_COMM_WORLD (hard-coded). What MPI
>>>> implementation have you been using? Is the behavior different with a
>>>> different implementation?
>>>>
>>>>
>>>> On Fri, Dec 21, 2012 at 2:36 AM, Thomas Witkowski <
>>>> thomas.witkowski at tu-dresden.de**> wrote:
>>>>
>>>> Okay, I did a similar benchmark now with PETSc's event logging:
>>>>>
>>>>> UMFPACK
>>>>> 16p: Local solve 350 1.0 2.3025e+01 1.1 5.00e+04 1.0
>>>>> 0.0e+00
>>>>> 0.0e+00 7.0e+02 63 0 0 0 52 63 0 0 0 51 0
>>>>> 64p: Local solve 350 1.0 2.3208e+01 1.1 5.00e+04 1.0
>>>>> 0.0e+00
>>>>> 0.0e+00 7.0e+02 60 0 0 0 52 60 0 0 0 51 0
>>>>> 256p: Local solve 350 1.0 2.3373e+01 1.1 5.00e+04 1.0
>>>>> 0.0e+00
>>>>> 0.0e+00 7.0e+02 49 0 0 0 52 49 0 0 0 51 1
>>>>>
>>>>> MUMPS
>>>>> 16p: Local solve 350 1.0 4.7183e+01 1.1 5.00e+04 1.0
>>>>> 0.0e+00
>>>>> 0.0e+00 7.0e+02 75 0 0 0 52 75 0 0 0 51 0
>>>>> 64p: Local solve 350 1.0 7.1409e+01 1.1 5.00e+04 1.0
>>>>> 0.0e+00
>>>>> 0.0e+00 7.0e+02 78 0 0 0 52 78 0 0 0 51 0
>>>>> 256p: Local solve 350 1.0 2.6079e+02 1.1 5.00e+04 1.0
>>>>> 0.0e+00
>>>>> 0.0e+00 7.0e+02 82 0 0 0 52 82 0 0 0 51 0
>>>>>
>>>>>
>>>>> As you see, the local solves with UMFPACK have nearly constant
>>>>> time with
>>>>> increasing number of subdomains. This is what I expect. The I replace
>>>>> UMFPACK by MUMPS and I see increasing time for local solves. In
>>>>> the last
>>>>> columns, UMFPACK has a decreasing value from 63 to 49, while MUMPS's
>>>>> column
>>>>> increases here from 75 to 82. What does this mean?
>>>>>
>>>>> Thomas
>>>>>
>>>>> Am 21.12.2012 02:19, schrieb Matthew Knepley:
>>>>>
>>>>> On Thu, Dec 20, 2012 at 3:39 PM, Thomas Witkowski
>>>>>
>>>>>> <Thomas.Witkowski at tu-dresden.****de
>>>>>> <Thomas.Witkowski at tu-dresden.**de<Thomas.Witkowski at tu-dresden.de>
>>>>>> >>
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>> I cannot use the information from log_summary, as I have three
>>>>>>> different
>>>>>>> LU
>>>>>>> factorizations and solve (local matrices and two hierarchies of
>>>>>>> coarse
>>>>>>> grids). Therefore, I use the following work around to get the
>>>>>>> timing of
>>>>>>> the
>>>>>>> solve I'm intrested in:
>>>>>>>
>>>>>>> You misunderstand how to use logging. You just put these thing in
>>>>>> separate stages. Stages represent
>>>>>> parts of the code over which events are aggregated.
>>>>>>
>>>>>> Matt
>>>>>>
>>>>>> MPI::COMM_WORLD.Barrier();
>>>>>>
>>>>>>> wtime = MPI::Wtime();
>>>>>>> KSPSolve(*(data->ksp_schur_****primal_local), tmp_primal,
>>>>>>>
>>>>>>> tmp_primal);
>>>>>>> FetiTimings::fetiSolve03 += (MPI::Wtime() - wtime);
>>>>>>>
>>>>>>> The factorization is done explicitly before with "KSPSetUp", so
>>>>>>> I can
>>>>>>> measure the time for LU factorization. It also does not scale!
>>>>>>> For 64
>>>>>>> cores,
>>>>>>> I takes 0.05 seconds, for 1024 cores 1.2 seconds. In all
>>>>>>> calculations,
>>>>>>> the
>>>>>>> local coarse space matrices defined on four cores have exactly
>>>>>>> the same
>>>>>>> number of rows and exactly the same number of non zero entries. So,
>>>>>>> from
>>>>>>> my
>>>>>>> point of view, the time should be absolutely constant.
>>>>>>>
>>>>>>> Thomas
>>>>>>>
>>>>>>> Zitat von Barry Smith <bsmith at mcs.anl.gov>:
>>>>>>>
>>>>>>>
>>>>>>> Are you timing ONLY the time to factor and solve the
>>>>>>> subproblems?
>>>>>>> Or
>>>>>>>
>>>>>>>> also the time to get the data to the collection of 4 cores at
>>>>>>>> a time?
>>>>>>>>
>>>>>>>> If you are only using LU for these problems and not
>>>>>>>> elsewhere in
>>>>>>>> the
>>>>>>>> code you can get the factorization and time from MatLUFactor()
>>>>>>>> and
>>>>>>>> MatSolve() or you can use stages to put this calculation in
>>>>>>>> its own
>>>>>>>> stage
>>>>>>>> and use the MatLUFactor() and MatSolve() time from that stage.
>>>>>>>> Also look at the load balancing column for the factorization and
>>>>>>>> solve
>>>>>>>> stage, it is well balanced?
>>>>>>>>
>>>>>>>> Barry
>>>>>>>>
>>>>>>>> On Dec 20, 2012, at 2:16 PM, Thomas Witkowski
>>>>>>>> <thomas.witkowski at tu-dresden.****de
>>>>>>>> <thomas.witkowski at tu-dresden.**de<thomas.witkowski at tu-dresden.de>
>>>>>>>> >>
>>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> In my multilevel FETI-DP code, I have localized course matrices,
>>>>>>>> which
>>>>>>>>
>>>>>>>>> are defined on only a subset of all MPI tasks, typically
>>>>>>>>> between 4
>>>>>>>>> and 64
>>>>>>>>> tasks. The MatAIJ and the KSP objects are both defined on a MPI
>>>>>>>>> communicator, which is a subset of MPI::COMM_WORLD. The LU
>>>>>>>>> factorization of
>>>>>>>>> the matrices is computed with either MUMPS or superlu_dist,
>>>>>>>>> but both
>>>>>>>>> show
>>>>>>>>> some scaling property I really wonder of: When the overall
>>>>>>>>> problem
>>>>>>>>> size is
>>>>>>>>> increased, the solve with the LU factorization of the local
>>>>>>>>> matrices
>>>>>>>>> does
>>>>>>>>> not scale! But why not? I just increase the number of local
>>>>>>>>> matrices,
>>>>>>>>> but
>>>>>>>>> all of them are independent of each other. Some example: I use 64
>>>>>>>>> cores,
>>>>>>>>> each coarse matrix is spanned by 4 cores so there are 16 MPI
>>>>>>>>> communicators
>>>>>>>>> with 16 coarse space matrices. The problem need to solve 192
>>>>>>>>> times
>>>>>>>>> with the
>>>>>>>>> coarse space systems, and this takes together 0.09 seconds.
>>>>>>>>> Now I
>>>>>>>>> increase
>>>>>>>>> the number of cores to 256, but let the local coarse space be
>>>>>>>>> defined
>>>>>>>>> again
>>>>>>>>> on only 4 cores. Again, 192 solutions with these coarse
>>>>>>>>> spaces are
>>>>>>>>> required, but now this takes 0.24 seconds. The same for 1024
>>>>>>>>> cores,
>>>>>>>>> and we
>>>>>>>>> are at 1.7 seconds for the local coarse space solver!
>>>>>>>>>
>>>>>>>>> For me, this is a total mystery! Any idea how to explain,
>>>>>>>>> debug and
>>>>>>>>> eventually how to resolve this problem?
>>>>>>>>>
>>>>>>>>> Thomas
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>> --
>>>>>> What most experimenters take for granted before they begin their
>>>>>> experiments is infinitely more interesting than any results to which
>>>>>> their experiments lead.
>>>>>> -- Norbert Wiener
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>
More information about the petsc-users
mailing list