[petsc-dev] Extract sub-matrices efficiently

Thu May 19 13:28:56 CDT 2016

> On May 19, 2016, at 7:21 AM, <victor.magri at dicea.unipd.it> <victor.magri at dicea.unipd.it> wrote:
> 
> Il 18-05-2016 21:23 Barry Smith ha scritto:
> 
>> 
>>   Are the many small matrices you are extracting consisting of parts coming from different processes? Or does each sub matrix come from the same process (and remain on the same process) or are most coming from the same process but some shared across multiple processes? 
>> 
>>    The parallel code was originally written assuming one wishes to get a small number of large sub matrices. Thus it is a bit communication heavy, for example it doesn't share any communications or all reduce across the matrices. It is, of course, possible to write custom code for your case but in order to help you to do that we need to know more about your particular use case.

    I suspect that by far the biggest cost currently is that since the sub matrices may span multiple processes each "getting" of a sub matrix requires parallel communication and synchronization. Saving time on reusing a buffer is very very minor compared to this cost.

    To make your implementation really efficient I think you need to work directly with the underlying MPIAIJ data structure so you can access the matrix entries in their native format and you have to think about how to manage efficiently that most of the time you don't need communication to get the sub matrices but sometimes you do. Calling the PETSc MatGetSubmatrices will never be efficient for your algorithm

  Barry


>> 
>>    Barry
>> 
>> 
>>> On May 18, 2016, at 11:41 AM, Jed Brown <jed at jedbrown.org> wrote:
>>> 
>>> Could you give some more information about your preconditioner?
>>> Sequential extraction of many small submatrices is not high performance,
>>> but perhaps we can restructure.
>>> 
>>> victor.magri at dicea.unipd.it writes:
>>> 
>>>> 
>>>> 
>>>> Dear PETSc developers, 
>>>> I have a matrix of the type MATMPIAIJ from which I want to extract small
>>>> sub-matrices and sub-vectors (number of rows from 1 to 50 maximum)
>>>> several times. This is a very fundamental step of a preconditioner that
>>>> I'm writing. After that, I want to solve these sub-systems locally. 
>>>> 
>>>> In my current implementation I use MatGetSubMatrices with
>>>> MAT_INITIAL_MATRIX to extract the sub-matrices and MatGetRow plus some
>>>> other artifacts to extract the sub-vectors. However, I see that this
>>>> procedure is too costly, once it creates and destroys the output
>>>> variables each time they are called. I thought about using the MAT_REUSE
>>>> flag, but I can't since the nonzero pattern of the local sub-matrix may
>>>> change from call to call. 
>>>> 
>>>> Can you give me some hint on how to do this task efficiently? I
>>>> appreciate your help! 
>>>> 
>>>> -- 
>>>> 
>>>> Victor A. P. Magri - PhD student
>>>> Dept. of Civil, Environmental and Architectural Eng.
>>>> University of Padova
>>>> Via Marzolo, 9 - 35131 Padova, Italy
> Thank you for the prompt reply, Barry and Jed.
> 
> If a communication reducing algorithm is firstly used such as the ones provided by METIS or Scotch, I would say that these sub-matrices are mostly coming from the same process and just some of them are shared across multiple processes. The preconditioner that I'm implementing is the Factorized Sparse Approximate Inverse with adaptive nonzero pattern generation. You can find it in the journal paper http://dl.acm.org/citation.cfm?doid=2732672.2629475. It relies in an explicit approximation (called matrix F) for the inverse of the lower triangular Cholesky factor.
> 
> Given a maximum number of nonzeros per row of the F matrix, we want to find the best column positions and their respective values which gives the largest reduction in the Kaporin condition number of the preconditioned matrix. This search can be done concurrently from row to row. Considering an arbitrary row of F, we have to gather a sub-matrix and a sub-vector with sizes equal to the current number of nonzero elements already found for F and solve this subsystem, this process is repeated up to when the desired number of nonzeros is reached. Since the maximum size of these subsystems is known, I would like to allocate two local buffer variables (one for the sub-matrix and the other for the sub-vector) just one time and use it along the whole process, instead of creating and destroying new variables each time the sub-systems are gathered. This should reduce a lot the time for building the preconditioner since it is a task which is done a number of times. 
> 
> Thank you for your help!
> 
> -- 
> Victor A. P. Magri - PhD student
> Dept. of Civil, Environmental and Architectural Eng.
> University of Padova
> Via Marzolo, 9 - 35131 Padova, Italy