[petsc-users] LU factorization and solution of independent matrices does not scale, why?
Thomas Witkowski
thomas.witkowski at tu-dresden.de
Fri Dec 21 03:36:02 CST 2012
Okay, I did a similar benchmark now with PETSc's event logging:
UMFPACK
16p: Local solve 350 1.0 2.3025e+01 1.1 5.00e+04 1.0 0.0e+00
0.0e+00 7.0e+02 63 0 0 0 52 63 0 0 0 51 0
64p: Local solve 350 1.0 2.3208e+01 1.1 5.00e+04 1.0 0.0e+00
0.0e+00 7.0e+02 60 0 0 0 52 60 0 0 0 51 0
256p: Local solve 350 1.0 2.3373e+01 1.1 5.00e+04 1.0 0.0e+00
0.0e+00 7.0e+02 49 0 0 0 52 49 0 0 0 51 1
MUMPS
16p: Local solve 350 1.0 4.7183e+01 1.1 5.00e+04 1.0 0.0e+00
0.0e+00 7.0e+02 75 0 0 0 52 75 0 0 0 51 0
64p: Local solve 350 1.0 7.1409e+01 1.1 5.00e+04 1.0 0.0e+00
0.0e+00 7.0e+02 78 0 0 0 52 78 0 0 0 51 0
256p: Local solve 350 1.0 2.6079e+02 1.1 5.00e+04 1.0 0.0e+00
0.0e+00 7.0e+02 82 0 0 0 52 82 0 0 0 51 0
As you see, the local solves with UMFPACK have nearly constant time with
increasing number of subdomains. This is what I expect. The I replace
UMFPACK by MUMPS and I see increasing time for local solves. In the last
columns, UMFPACK has a decreasing value from 63 to 49, while MUMPS's
column increases here from 75 to 82. What does this mean?
Thomas
Am 21.12.2012 02:19, schrieb Matthew Knepley:
> On Thu, Dec 20, 2012 at 3:39 PM, Thomas Witkowski
> <Thomas.Witkowski at tu-dresden.de> wrote:
>> I cannot use the information from log_summary, as I have three different LU
>> factorizations and solve (local matrices and two hierarchies of coarse
>> grids). Therefore, I use the following work around to get the timing of the
>> solve I'm intrested in:
> You misunderstand how to use logging. You just put these thing in
> separate stages. Stages represent
> parts of the code over which events are aggregated.
>
> Matt
>
>> MPI::COMM_WORLD.Barrier();
>> wtime = MPI::Wtime();
>> KSPSolve(*(data->ksp_schur_primal_local), tmp_primal, tmp_primal);
>> FetiTimings::fetiSolve03 += (MPI::Wtime() - wtime);
>>
>> The factorization is done explicitly before with "KSPSetUp", so I can
>> measure the time for LU factorization. It also does not scale! For 64 cores,
>> I takes 0.05 seconds, for 1024 cores 1.2 seconds. In all calculations, the
>> local coarse space matrices defined on four cores have exactly the same
>> number of rows and exactly the same number of non zero entries. So, from my
>> point of view, the time should be absolutely constant.
>>
>> Thomas
>>
>> Zitat von Barry Smith <bsmith at mcs.anl.gov>:
>>
>>
>>> Are you timing ONLY the time to factor and solve the subproblems? Or
>>> also the time to get the data to the collection of 4 cores at a time?
>>>
>>> If you are only using LU for these problems and not elsewhere in the
>>> code you can get the factorization and time from MatLUFactor() and
>>> MatSolve() or you can use stages to put this calculation in its own stage
>>> and use the MatLUFactor() and MatSolve() time from that stage.
>>> Also look at the load balancing column for the factorization and solve
>>> stage, it is well balanced?
>>>
>>> Barry
>>>
>>> On Dec 20, 2012, at 2:16 PM, Thomas Witkowski
>>> <thomas.witkowski at tu-dresden.de> wrote:
>>>
>>>> In my multilevel FETI-DP code, I have localized course matrices, which
>>>> are defined on only a subset of all MPI tasks, typically between 4 and 64
>>>> tasks. The MatAIJ and the KSP objects are both defined on a MPI
>>>> communicator, which is a subset of MPI::COMM_WORLD. The LU factorization of
>>>> the matrices is computed with either MUMPS or superlu_dist, but both show
>>>> some scaling property I really wonder of: When the overall problem size is
>>>> increased, the solve with the LU factorization of the local matrices does
>>>> not scale! But why not? I just increase the number of local matrices, but
>>>> all of them are independent of each other. Some example: I use 64 cores,
>>>> each coarse matrix is spanned by 4 cores so there are 16 MPI communicators
>>>> with 16 coarse space matrices. The problem need to solve 192 times with the
>>>> coarse space systems, and this takes together 0.09 seconds. Now I increase
>>>> the number of cores to 256, but let the local coarse space be defined again
>>>> on only 4 cores. Again, 192 solutions with these coarse spaces are
>>>> required, but now this takes 0.24 seconds. The same for 1024 cores, and we
>>>> are at 1.7 seconds for the local coarse space solver!
>>>>
>>>> For me, this is a total mystery! Any idea how to explain, debug and
>>>> eventually how to resolve this problem?
>>>>
>>>> Thomas
>>>
>>>
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which
> their experiments lead.
> -- Norbert Wiener
More information about the petsc-users
mailing list