[petsc-users] LU factorization and solution of independent matrices does not scale, why?
Matthew Knepley
knepley at gmail.com
Fri Dec 21 10:00:21 CST 2012
On Fri, Dec 21, 2012 at 10:51 AM, Thomas Witkowski
<Thomas.Witkowski at tu-dresden.de> wrote:
> I use a modified MPICH version. On the system I use for these benchmarks I
> cannot use another MPI library.
>
> I'm not fixed to MUMPS. Superlu_dist, for example, works also perfectly for
> this. But there is still the following problem I cannot solve: When I
> increase the number of coarse space matrices, there seems to be no scaling
> direct solver for this. Just to summaries:
> - one coarse space matrix is created always by one "cluster" consisting of
> four subdomanins/MPI tasks
> - the four tasks are always local to one node, thus inter-node network
> communication is not required for computing factorization and solve
> - independent of the number of cluster, the coarse space matrices are the
> same, have the same number of rows, nnz structure but possibly different
> values
> - there is NO load unbalancing
> - the matrices must be factorized and there are a lot of solves (> 100) with
> them
So the numbers you have below for UMFPACK are using one matrix per MPI rank
instead of one matrix per 4 ranks?
There seem to be two obvious sources of bugs:
1) Your parallel solver is not just using the comm with 4 ranks
2) These ranks are not clustered together on one node for that comm
Matt
> It should be pretty clear, that computing LU factorization and solving with
> it should scale perfectly. But at the moment, all direct solver I tried
> (mumps, superlu_dist, pastix) are not able to scale. The loos of scale is
> really worse, as you can see from the numbers I send before.
>
> Any ideas? Suggestions? Without a scaling solver method for these kind of
> systems, my multilevel FETI-DP code is just more or less a joke, only some
> orders of magnitude slower than standard FETI-DP method :)
>
> Thomas
>
> Zitat von Jed Brown <jedbrown at mcs.anl.gov>:
>
>> MUMPS uses MPI_Iprobe on MPI_COMM_WORLD (hard-coded). What MPI
>> implementation have you been using? Is the behavior different with a
>> different implementation?
>>
>>
>> On Fri, Dec 21, 2012 at 2:36 AM, Thomas Witkowski <
>> thomas.witkowski at tu-dresden.de> wrote:
>>
>>> Okay, I did a similar benchmark now with PETSc's event logging:
>>>
>>> UMFPACK
>>> 16p: Local solve 350 1.0 2.3025e+01 1.1 5.00e+04 1.0 0.0e+00
>>> 0.0e+00 7.0e+02 63 0 0 0 52 63 0 0 0 51 0
>>> 64p: Local solve 350 1.0 2.3208e+01 1.1 5.00e+04 1.0 0.0e+00
>>> 0.0e+00 7.0e+02 60 0 0 0 52 60 0 0 0 51 0
>>> 256p: Local solve 350 1.0 2.3373e+01 1.1 5.00e+04 1.0 0.0e+00
>>> 0.0e+00 7.0e+02 49 0 0 0 52 49 0 0 0 51 1
>>>
>>> MUMPS
>>> 16p: Local solve 350 1.0 4.7183e+01 1.1 5.00e+04 1.0 0.0e+00
>>> 0.0e+00 7.0e+02 75 0 0 0 52 75 0 0 0 51 0
>>> 64p: Local solve 350 1.0 7.1409e+01 1.1 5.00e+04 1.0 0.0e+00
>>> 0.0e+00 7.0e+02 78 0 0 0 52 78 0 0 0 51 0
>>> 256p: Local solve 350 1.0 2.6079e+02 1.1 5.00e+04 1.0 0.0e+00
>>> 0.0e+00 7.0e+02 82 0 0 0 52 82 0 0 0 51 0
>>>
>>>
>>> As you see, the local solves with UMFPACK have nearly constant time with
>>> increasing number of subdomains. This is what I expect. The I replace
>>> UMFPACK by MUMPS and I see increasing time for local solves. In the last
>>> columns, UMFPACK has a decreasing value from 63 to 49, while MUMPS's
>>> column
>>> increases here from 75 to 82. What does this mean?
>>>
>>> Thomas
>>>
>>> Am 21.12.2012 02:19, schrieb Matthew Knepley:
>>>
>>> On Thu, Dec 20, 2012 at 3:39 PM, Thomas Witkowski
>>>>
>>>> <Thomas.Witkowski at tu-dresden.**de <Thomas.Witkowski at tu-dresden.de>>
>>>>
>>>> wrote:
>>>>
>>>>> I cannot use the information from log_summary, as I have three
>>>>> different
>>>>> LU
>>>>> factorizations and solve (local matrices and two hierarchies of coarse
>>>>> grids). Therefore, I use the following work around to get the timing of
>>>>> the
>>>>> solve I'm intrested in:
>>>>>
>>>> You misunderstand how to use logging. You just put these thing in
>>>> separate stages. Stages represent
>>>> parts of the code over which events are aggregated.
>>>>
>>>> Matt
>>>>
>>>> MPI::COMM_WORLD.Barrier();
>>>>>
>>>>> wtime = MPI::Wtime();
>>>>> KSPSolve(*(data->ksp_schur_**primal_local), tmp_primal,
>>>>>
>>>>> tmp_primal);
>>>>> FetiTimings::fetiSolve03 += (MPI::Wtime() - wtime);
>>>>>
>>>>> The factorization is done explicitly before with "KSPSetUp", so I can
>>>>> measure the time for LU factorization. It also does not scale! For 64
>>>>> cores,
>>>>> I takes 0.05 seconds, for 1024 cores 1.2 seconds. In all calculations,
>>>>> the
>>>>> local coarse space matrices defined on four cores have exactly the same
>>>>> number of rows and exactly the same number of non zero entries. So,
>>>>> from
>>>>> my
>>>>> point of view, the time should be absolutely constant.
>>>>>
>>>>> Thomas
>>>>>
>>>>> Zitat von Barry Smith <bsmith at mcs.anl.gov>:
>>>>>
>>>>>
>>>>> Are you timing ONLY the time to factor and solve the subproblems?
>>>>> Or
>>>>>>
>>>>>> also the time to get the data to the collection of 4 cores at a time?
>>>>>>
>>>>>> If you are only using LU for these problems and not elsewhere in
>>>>>> the
>>>>>> code you can get the factorization and time from MatLUFactor() and
>>>>>> MatSolve() or you can use stages to put this calculation in its own
>>>>>> stage
>>>>>> and use the MatLUFactor() and MatSolve() time from that stage.
>>>>>> Also look at the load balancing column for the factorization and
>>>>>> solve
>>>>>> stage, it is well balanced?
>>>>>>
>>>>>> Barry
>>>>>>
>>>>>> On Dec 20, 2012, at 2:16 PM, Thomas Witkowski
>>>>>> <thomas.witkowski at tu-dresden.**de <thomas.witkowski at tu-dresden.de>>
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>> In my multilevel FETI-DP code, I have localized course matrices,
>>>>>> which
>>>>>>>
>>>>>>> are defined on only a subset of all MPI tasks, typically between 4
>>>>>>> and 64
>>>>>>> tasks. The MatAIJ and the KSP objects are both defined on a MPI
>>>>>>> communicator, which is a subset of MPI::COMM_WORLD. The LU
>>>>>>> factorization of
>>>>>>> the matrices is computed with either MUMPS or superlu_dist, but both
>>>>>>> show
>>>>>>> some scaling property I really wonder of: When the overall problem
>>>>>>> size is
>>>>>>> increased, the solve with the LU factorization of the local matrices
>>>>>>> does
>>>>>>> not scale! But why not? I just increase the number of local
>>>>>>> matrices,
>>>>>>> but
>>>>>>> all of them are independent of each other. Some example: I use 64
>>>>>>> cores,
>>>>>>> each coarse matrix is spanned by 4 cores so there are 16 MPI
>>>>>>> communicators
>>>>>>> with 16 coarse space matrices. The problem need to solve 192 times
>>>>>>> with the
>>>>>>> coarse space systems, and this takes together 0.09 seconds. Now I
>>>>>>> increase
>>>>>>> the number of cores to 256, but let the local coarse space be
>>>>>>> defined
>>>>>>> again
>>>>>>> on only 4 cores. Again, 192 solutions with these coarse spaces are
>>>>>>> required, but now this takes 0.24 seconds. The same for 1024 cores,
>>>>>>> and we
>>>>>>> are at 1.7 seconds for the local coarse space solver!
>>>>>>>
>>>>>>> For me, this is a total mystery! Any idea how to explain, debug and
>>>>>>> eventually how to resolve this problem?
>>>>>>>
>>>>>>> Thomas
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> What most experimenters take for granted before they begin their
>>>> experiments is infinitely more interesting than any results to which
>>>> their experiments lead.
>>>> -- Norbert Wiener
>>>>
>>>
>>>
>>
>
>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener
More information about the petsc-users
mailing list