[petsc-users] LU factorization and solution of independent matrices does not scale, why?

Thu Dec 20 19:19:45 CST 2012

On Thu, Dec 20, 2012 at 3:39 PM, Thomas Witkowski
<Thomas.Witkowski at tu-dresden.de> wrote:
> I cannot use the information from log_summary, as I have three different LU
> factorizations and solve (local matrices and two hierarchies of coarse
> grids). Therefore, I use the following work around to get the timing of the
> solve I'm intrested in:

You misunderstand how to use logging. You just put these thing in
separate stages. Stages represent
parts of the code over which events are aggregated.

   Matt

>     MPI::COMM_WORLD.Barrier();
>     wtime = MPI::Wtime();
>     KSPSolve(*(data->ksp_schur_primal_local), tmp_primal, tmp_primal);
>     FetiTimings::fetiSolve03 += (MPI::Wtime() - wtime);
>
> The factorization is done explicitly before with "KSPSetUp", so I can
> measure the time for LU factorization. It also does not scale! For 64 cores,
> I takes 0.05 seconds, for 1024 cores 1.2 seconds. In all calculations, the
> local coarse space matrices defined on four cores have exactly the same
> number of rows and exactly the same number of non zero entries. So, from my
> point of view, the time should be absolutely constant.
>
> Thomas
>
> Zitat von Barry Smith <bsmith at mcs.anl.gov>:
>
>
>>
>>   Are you timing ONLY the time to factor and solve the subproblems?  Or
>> also the time to get the data to the collection of 4 cores at a  time?
>>
>>    If you are only using LU for these problems and not elsewhere in  the
>> code you can get the factorization and time from MatLUFactor()  and
>> MatSolve() or you can use stages to put this calculation in its  own stage
>> and use the MatLUFactor() and MatSolve() time from that  stage.
>> Also look at the load balancing column for the factorization and  solve
>> stage, it is well balanced?
>>
>>    Barry
>>
>> On Dec 20, 2012, at 2:16 PM, Thomas Witkowski
>> <thomas.witkowski at tu-dresden.de> wrote:
>>
>>> In my multilevel FETI-DP code, I have localized course matrices,  which
>>> are defined on only a subset of all MPI tasks, typically  between 4 and 64
>>> tasks. The MatAIJ and the KSP objects are both  defined on a MPI
>>> communicator, which is a subset of  MPI::COMM_WORLD. The LU factorization of
>>> the matrices is computed  with either MUMPS or superlu_dist, but both show
>>> some scaling  property I really wonder of: When the overall problem size is
>>> increased, the solve with the LU factorization of the local  matrices does
>>> not scale! But why not? I just increase the number of  local matrices, but
>>> all of them are independent of each other. Some  example: I use 64 cores,
>>> each coarse matrix is spanned by 4 cores  so there are 16 MPI communicators
>>> with 16 coarse space matrices.  The problem need to solve 192 times with the
>>> coarse space systems,  and this takes together 0.09 seconds. Now I increase
>>> the number of  cores to 256, but let the local coarse space be defined again
>>> on  only 4 cores. Again, 192 solutions with these coarse spaces are
>>> required, but now this takes 0.24 seconds. The same for 1024 cores,  and we
>>> are at 1.7 seconds for the local coarse space solver!
>>>
>>> For me, this is a total mystery! Any idea how to explain, debug and
>>> eventually how to resolve this problem?
>>>
>>> Thomas
>>
>>
>>
>
>

--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener