[petsc-users] -log_summary for MatMult

Fri Jun 15 09:58:45 CDT 2012

On Fri, Jun 15, 2012 at 9:51 AM, Jack Poulson <jack.poulson at gmail.com>wrote:

> On Fri, Jun 15, 2012 at 9:14 AM, Matthew Knepley <knepley at gmail.com>wrote:
>
>> On Fri, Jun 15, 2012 at 9:18 PM, Alexander Grayver <
>> agrayver at gfz-potsdam.de> wrote:
>>
>>> **
>>> On 15.06.2012 14:46, Matthew Knepley wrote:
>>>
>>> On Fri, Jun 15, 2012 at 8:31 PM, Alexander Grayver <
>>> agrayver at gfz-potsdam.de> wrote:
>>>
>>>>  Matt,
>>>>
>>>> According to that code:
>>>>
>>>> 486: *PetscErrorCode MatMult_MPIDense(Mat mat,Vec xx,Vec yy)*487: {488:   Mat_MPIDense   *mdn = (Mat_MPIDense*)mat->data;
>>>> 492:  VecScatterBegin(mdn->Mvctx,xx,mdn->lvec,INSERT_VALUES,SCATTER_FORWARD);493:   VecScatterEnd(mdn->Mvctx,xx,mdn->lvec,INSERT_VALUES,SCATTER_FORWARD);494:   MatMult_SeqDense(mdn->A,mdn->lvec,yy);495:   return(0);496: }
>>>>
>>>>
>>>> Each process has its own local copy of the vector?
>>>>
>>>
>>>  I am not sure what your point is. VecScatter is just an interface that
>>> has many implementations.
>>>
>>>
>>> I'm trying to estimate the amount of data needs to be communicated over
>>> all processes during this operation.
>>> In debugger I see VecScatter from the code above reduces to the
>>> MPI_Allgatherv and results in (assuming vector is distributed uniformly)
>>>
>>> bytes_send_received = num_of_proc * ((num_of_proc - 1) * vec_size_local)
>>> * 2 * sizeof(PetscScalar)
>>>
>>> Does that look reasonable?
>>>
>>
>> This is not really a useful exercise, since
>>
>>   a) PETSc does not currently have an optimized parallel dense
>> implementation
>>
>>   b) We are implementing an Elemental interface this summer. You can try
>> it out in petsc-dev
>>
>>   c) Elemental is much more efficient than our simple implementation, and
>> uses a unique
>>       approach to communication (all reductions)
>>
>> I would take the comp+comm estimates from Jack's slides on Elemental
>>
>>    Matt
>>
>>
>
> Jed and I just discussed this an hour or so ago; the MatMult (Gemv in BLAS
> parlance) implementation via Elemental will, in the simplest case of a
> square n x n matrix distributed over a square sqrt(p) x sqrt(p) process
> grid, require:
>
> 1) A gather (via a padded MPI_Gather, not MPI_Gatherv) within teams of
> sqrt(p) processes, with a per-process communication volume of approximately
> n/sqrt(p), and a latency cost of log2(sqrt(p)) ~= 1/2 log2(p)
> 2) A Gemv with the local part of A against the just gathered part of x,
> (z[MC,*] := A[MC,MR] x[MR,* ] in Elemental notation)
> 3) A sum-scatter within teams of sqrt(p) processes (via
> MPI_Reduce_scatter, or MPI_Reduce_scatter_block if it is available),
> requiring roughly n/sqrt(p) per-process communication volume and roughly
> log2(sqrt(p)) latency again.
>
> Thus, the total per-process communication volume will be about 2 n /
> sqrt(p), the total latency will be about log2(p), and the local computation
> is roughly the optimal value, n^2 / p.
>
> Jack
>

I just noticed two minor corrections to what I just said:
1) The version that will be used by PETSc will actually use a padded
MPI_Allgather, not a padded MPI_Gather, but the cost is essentially the
same.
2) The sequential cost of a square matrix-vector multiplication is 2 n^2,
so the local computation will be roughly 2 n^2 / p flops.

Jack
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120615/3affe9b6/attachment.html>