[petsc-users] LU factorization and solution of independent matrices does not scale, why?

Thomas Witkowski Thomas.Witkowski at tu-dresden.de
Thu Dec 20 15:01:29 CST 2012


Jack, I also considered this problem. The 4 MPI tasks of each coarse  
space matrix should run all on one node (each node contains 4 dual  
core CPUs). I'm not 100% sure, but I discussed this with the  
administrators of the system. The system should schedule always the  
first 8 ranks to the first node, and so on. And the coarse space  
matrices are build on ranks 0-3, 4-7 ...

I'm running at the moment some benchmarks, where I replaced the local  
LU factorization from using UMFPACK to MUMPS. Each matrix and the  
corresponding ksp object are defined on PETSC_COMM_SELF and the  
problem is perfectly balanced (the grid is a unit square uniformly  
refined). Lets see...

Thomas

Zitat von Jack Poulson <jack.poulson at gmail.com>:

> Hi Thomas,
>
> Network topology is important. Since most machines are not fully connected,
> random subsets of four processes will become more scattered about the
> cluster as you increase your total number of processes.
>
> Jack
> On Dec 20, 2012 12:39 PM, "Thomas Witkowski" <Thomas.Witkowski at tu-dresden.de>
> wrote:
>
>> I cannot use the information from log_summary, as I have three different
>> LU factorizations and solve (local matrices and two hierarchies of coarse
>> grids). Therefore, I use the following work around to get the timing of the
>> solve I'm intrested in:
>>
>>     MPI::COMM_WORLD.Barrier();
>>     wtime = MPI::Wtime();
>>     KSPSolve(*(data->ksp_schur_**primal_local), tmp_primal, tmp_primal);
>>     FetiTimings::fetiSolve03 += (MPI::Wtime() - wtime);
>>
>> The factorization is done explicitly before with "KSPSetUp", so I can
>> measure the time for LU factorization. It also does not scale! For 64
>> cores, I takes 0.05 seconds, for 1024 cores 1.2 seconds. In all
>> calculations, the local coarse space matrices defined on four cores have
>> exactly the same number of rows and exactly the same number of non zero
>> entries. So, from my point of view, the time should be absolutely constant.
>>
>> Thomas
>>
>> Zitat von Barry Smith <bsmith at mcs.anl.gov>:
>>
>>
>>>   Are you timing ONLY the time to factor and solve the subproblems?  Or
>>> also the time to get the data to the collection of 4 cores at a  time?
>>>
>>>    If you are only using LU for these problems and not elsewhere in  the
>>> code you can get the factorization and time from MatLUFactor()  and
>>> MatSolve() or you can use stages to put this calculation in its  own stage
>>> and use the MatLUFactor() and MatSolve() time from that  stage.
>>> Also look at the load balancing column for the factorization and  solve
>>> stage, it is well balanced?
>>>
>>>    Barry
>>>
>>> On Dec 20, 2012, at 2:16 PM, Thomas Witkowski  <
>>> thomas.witkowski at tu-dresden.**de <thomas.witkowski at tu-dresden.de>> wrote:
>>>
>>>  In my multilevel FETI-DP code, I have localized course matrices,  which
>>>> are defined on only a subset of all MPI tasks, typically  between 4 and 64
>>>> tasks. The MatAIJ and the KSP objects are both  defined on a MPI
>>>> communicator, which is a subset of  MPI::COMM_WORLD. The LU factorization
>>>> of the matrices is computed  with either MUMPS or superlu_dist, but both
>>>> show some scaling  property I really wonder of: When the overall problem
>>>> size is  increased, the solve with the LU factorization of the local
>>>>  matrices does not scale! But why not? I just increase the number  
>>>>  of  local
>>>> matrices, but all of them are independent of each other. Some  example: I
>>>> use 64 cores, each coarse matrix is spanned by 4 cores  so there   
>>>> are 16 MPI
>>>> communicators with 16 coarse space matrices.  The problem need to  
>>>>  solve 192
>>>> times with the coarse space systems,  and this takes together   
>>>> 0.09 seconds.
>>>> Now I increase the number of  cores to 256, but let the local coarse space
>>>> be defined again on  only 4 cores. Again, 192 solutions with these coarse
>>>> spaces are  required, but now this takes 0.24 seconds. The same for 1024
>>>> cores,  and we are at 1.7 seconds for the local coarse space solver!
>>>>
>>>> For me, this is a total mystery! Any idea how to explain, debug and
>>>>  eventually how to resolve this problem?
>>>>
>>>> Thomas
>>>>
>>>
>>>
>>>
>>
>>
>




More information about the petsc-users mailing list