[petsc-users] LU factorization and solution of independent matrices does not scale, why?

Thu Dec 20 15:07:18 CST 2012

Hi Thomas,

Assuming this is not the issue (it is probably worth explicitly measuring),
it is also important to ensure that the sparsity pattern is preserved, not
just the number of nonzeros per row. A sparse matrix with random nonzero
locations is much more expensive to factor than one with entries near the
diagonal.

Jack
On Dec 20, 2012 1:01 PM, "Thomas Witkowski" <Thomas.Witkowski at tu-dresden.de>
wrote:

> Jack, I also considered this problem. The 4 MPI tasks of each coarse space
> matrix should run all on one node (each node contains 4 dual core CPUs).
> I'm not 100% sure, but I discussed this with the administrators of the
> system. The system should schedule always the first 8 ranks to the first
> node, and so on. And the coarse space matrices are build on ranks 0-3, 4-7
> ...
>
> I'm running at the moment some benchmarks, where I replaced the local LU
> factorization from using UMFPACK to MUMPS. Each matrix and the
> corresponding ksp object are defined on PETSC_COMM_SELF and the problem is
> perfectly balanced (the grid is a unit square uniformly refined). Lets
> see...
>
> Thomas
>
> Zitat von Jack Poulson <jack.poulson at gmail.com>:
>
>  Hi Thomas,
>>
>> Network topology is important. Since most machines are not fully
>> connected,
>> random subsets of four processes will become more scattered about the
>> cluster as you increase your total number of processes.
>>
>> Jack
>> On Dec 20, 2012 12:39 PM, "Thomas Witkowski" <
>> Thomas.Witkowski at tu-dresden.**de <Thomas.Witkowski at tu-dresden.de>>
>> wrote:
>>
>>  I cannot use the information from log_summary, as I have three different
>>> LU factorizations and solve (local matrices and two hierarchies of coarse
>>> grids). Therefore, I use the following work around to get the timing of
>>> the
>>> solve I'm intrested in:
>>>
>>>     MPI::COMM_WORLD.Barrier();
>>>     wtime = MPI::Wtime();
>>>     KSPSolve(*(data->ksp_schur_****primal_local), tmp_primal,
>>> tmp_primal);
>>>     FetiTimings::fetiSolve03 += (MPI::Wtime() - wtime);
>>>
>>> The factorization is done explicitly before with "KSPSetUp", so I can
>>> measure the time for LU factorization. It also does not scale! For 64
>>> cores, I takes 0.05 seconds, for 1024 cores 1.2 seconds. In all
>>> calculations, the local coarse space matrices defined on four cores have
>>> exactly the same number of rows and exactly the same number of non zero
>>> entries. So, from my point of view, the time should be absolutely
>>> constant.
>>>
>>> Thomas
>>>
>>> Zitat von Barry Smith <bsmith at mcs.anl.gov>:
>>>
>>>
>>>    Are you timing ONLY the time to factor and solve the subproblems?  Or
>>>> also the time to get the data to the collection of 4 cores at a  time?
>>>>
>>>>    If you are only using LU for these problems and not elsewhere in  the
>>>> code you can get the factorization and time from MatLUFactor()  and
>>>> MatSolve() or you can use stages to put this calculation in its  own
>>>> stage
>>>> and use the MatLUFactor() and MatSolve() time from that  stage.
>>>> Also look at the load balancing column for the factorization and  solve
>>>> stage, it is well balanced?
>>>>
>>>>    Barry
>>>>
>>>> On Dec 20, 2012, at 2:16 PM, Thomas Witkowski  <
>>>> thomas.witkowski at tu-dresden.****de <thomas.witkowski at tu-dresden.**de<thomas.witkowski at tu-dresden.de>>>
>>>> wrote:
>>>>
>>>>  In my multilevel FETI-DP code, I have localized course matrices,  which
>>>>
>>>>> are defined on only a subset of all MPI tasks, typically  between 4
>>>>> and 64
>>>>> tasks. The MatAIJ and the KSP objects are both  defined on a MPI
>>>>> communicator, which is a subset of  MPI::COMM_WORLD. The LU
>>>>> factorization
>>>>> of the matrices is computed  with either MUMPS or superlu_dist, but
>>>>> both
>>>>> show some scaling  property I really wonder of: When the overall
>>>>> problem
>>>>> size is  increased, the solve with the LU factorization of the local
>>>>>  matrices does not scale! But why not? I just increase the number  of
>>>>>  local
>>>>> matrices, but all of them are independent of each other. Some
>>>>>  example: I
>>>>> use 64 cores, each coarse matrix is spanned by 4 cores  so there  are
>>>>> 16 MPI
>>>>> communicators with 16 coarse space matrices.  The problem need to
>>>>>  solve 192
>>>>> times with the coarse space systems,  and this takes together  0.09
>>>>> seconds.
>>>>> Now I increase the number of  cores to 256, but let the local coarse
>>>>> space
>>>>> be defined again on  only 4 cores. Again, 192 solutions with these
>>>>> coarse
>>>>> spaces are  required, but now this takes 0.24 seconds. The same for
>>>>> 1024
>>>>> cores,  and we are at 1.7 seconds for the local coarse space solver!
>>>>>
>>>>> For me, this is a total mystery! Any idea how to explain, debug and
>>>>>  eventually how to resolve this problem?
>>>>>
>>>>> Thomas
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20121220/5e649a84/attachment-0001.html>