[petsc-users] LU factorization and solution of independent matrices does not scale, why?
Thomas Witkowski
Thomas.Witkowski at tu-dresden.de
Thu Dec 20 15:01:29 CST 2012
Jack, I also considered this problem. The 4 MPI tasks of each coarse
space matrix should run all on one node (each node contains 4 dual
core CPUs). I'm not 100% sure, but I discussed this with the
administrators of the system. The system should schedule always the
first 8 ranks to the first node, and so on. And the coarse space
matrices are build on ranks 0-3, 4-7 ...
I'm running at the moment some benchmarks, where I replaced the local
LU factorization from using UMFPACK to MUMPS. Each matrix and the
corresponding ksp object are defined on PETSC_COMM_SELF and the
problem is perfectly balanced (the grid is a unit square uniformly
refined). Lets see...
Thomas
Zitat von Jack Poulson <jack.poulson at gmail.com>:
> Hi Thomas,
>
> Network topology is important. Since most machines are not fully connected,
> random subsets of four processes will become more scattered about the
> cluster as you increase your total number of processes.
>
> Jack
> On Dec 20, 2012 12:39 PM, "Thomas Witkowski" <Thomas.Witkowski at tu-dresden.de>
> wrote:
>
>> I cannot use the information from log_summary, as I have three different
>> LU factorizations and solve (local matrices and two hierarchies of coarse
>> grids). Therefore, I use the following work around to get the timing of the
>> solve I'm intrested in:
>>
>> MPI::COMM_WORLD.Barrier();
>> wtime = MPI::Wtime();
>> KSPSolve(*(data->ksp_schur_**primal_local), tmp_primal, tmp_primal);
>> FetiTimings::fetiSolve03 += (MPI::Wtime() - wtime);
>>
>> The factorization is done explicitly before with "KSPSetUp", so I can
>> measure the time for LU factorization. It also does not scale! For 64
>> cores, I takes 0.05 seconds, for 1024 cores 1.2 seconds. In all
>> calculations, the local coarse space matrices defined on four cores have
>> exactly the same number of rows and exactly the same number of non zero
>> entries. So, from my point of view, the time should be absolutely constant.
>>
>> Thomas
>>
>> Zitat von Barry Smith <bsmith at mcs.anl.gov>:
>>
>>
>>> Are you timing ONLY the time to factor and solve the subproblems? Or
>>> also the time to get the data to the collection of 4 cores at a time?
>>>
>>> If you are only using LU for these problems and not elsewhere in the
>>> code you can get the factorization and time from MatLUFactor() and
>>> MatSolve() or you can use stages to put this calculation in its own stage
>>> and use the MatLUFactor() and MatSolve() time from that stage.
>>> Also look at the load balancing column for the factorization and solve
>>> stage, it is well balanced?
>>>
>>> Barry
>>>
>>> On Dec 20, 2012, at 2:16 PM, Thomas Witkowski <
>>> thomas.witkowski at tu-dresden.**de <thomas.witkowski at tu-dresden.de>> wrote:
>>>
>>> In my multilevel FETI-DP code, I have localized course matrices, which
>>>> are defined on only a subset of all MPI tasks, typically between 4 and 64
>>>> tasks. The MatAIJ and the KSP objects are both defined on a MPI
>>>> communicator, which is a subset of MPI::COMM_WORLD. The LU factorization
>>>> of the matrices is computed with either MUMPS or superlu_dist, but both
>>>> show some scaling property I really wonder of: When the overall problem
>>>> size is increased, the solve with the LU factorization of the local
>>>> matrices does not scale! But why not? I just increase the number
>>>> of local
>>>> matrices, but all of them are independent of each other. Some example: I
>>>> use 64 cores, each coarse matrix is spanned by 4 cores so there
>>>> are 16 MPI
>>>> communicators with 16 coarse space matrices. The problem need to
>>>> solve 192
>>>> times with the coarse space systems, and this takes together
>>>> 0.09 seconds.
>>>> Now I increase the number of cores to 256, but let the local coarse space
>>>> be defined again on only 4 cores. Again, 192 solutions with these coarse
>>>> spaces are required, but now this takes 0.24 seconds. The same for 1024
>>>> cores, and we are at 1.7 seconds for the local coarse space solver!
>>>>
>>>> For me, this is a total mystery! Any idea how to explain, debug and
>>>> eventually how to resolve this problem?
>>>>
>>>> Thomas
>>>>
>>>
>>>
>>>
>>
>>
>
More information about the petsc-users
mailing list