[petsc-users] -log_summary for MatMult

Fri Jun 15 09:51:46 CDT 2012

On Fri, Jun 15, 2012 at 9:14 AM, Matthew Knepley <knepley at gmail.com> wrote:

> On Fri, Jun 15, 2012 at 9:18 PM, Alexander Grayver <
> agrayver at gfz-potsdam.de> wrote:
>
>> **
>> On 15.06.2012 14:46, Matthew Knepley wrote:
>>
>> On Fri, Jun 15, 2012 at 8:31 PM, Alexander Grayver <
>> agrayver at gfz-potsdam.de> wrote:
>>
>>>  Matt,
>>>
>>> According to that code:
>>>
>>> 486: *PetscErrorCode MatMult_MPIDense(Mat mat,Vec xx,Vec yy)*487: {488:   Mat_MPIDense   *mdn = (Mat_MPIDense*)mat->data;
>>> 492:  VecScatterBegin(mdn->Mvctx,xx,mdn->lvec,INSERT_VALUES,SCATTER_FORWARD);493:   VecScatterEnd(mdn->Mvctx,xx,mdn->lvec,INSERT_VALUES,SCATTER_FORWARD);494:   MatMult_SeqDense(mdn->A,mdn->lvec,yy);495:   return(0);496: }
>>>
>>>
>>> Each process has its own local copy of the vector?
>>>
>>
>>  I am not sure what your point is. VecScatter is just an interface that
>> has many implementations.
>>
>>
>> I'm trying to estimate the amount of data needs to be communicated over
>> all processes during this operation.
>> In debugger I see VecScatter from the code above reduces to the
>> MPI_Allgatherv and results in (assuming vector is distributed uniformly)
>>
>> bytes_send_received = num_of_proc * ((num_of_proc - 1) * vec_size_local)
>> * 2 * sizeof(PetscScalar)
>>
>> Does that look reasonable?
>>
>
> This is not really a useful exercise, since
>
>   a) PETSc does not currently have an optimized parallel dense
> implementation
>
>   b) We are implementing an Elemental interface this summer. You can try
> it out in petsc-dev
>
>   c) Elemental is much more efficient than our simple implementation, and
> uses a unique
>       approach to communication (all reductions)
>
> I would take the comp+comm estimates from Jack's slides on Elemental
>
>    Matt
>
>

Jed and I just discussed this an hour or so ago; the MatMult (Gemv in BLAS
parlance) implementation via Elemental will, in the simplest case of a
square n x n matrix distributed over a square sqrt(p) x sqrt(p) process
grid, require:

1) A gather (via a padded MPI_Gather, not MPI_Gatherv) within teams of
sqrt(p) processes, with a per-process communication volume of approximately
n/sqrt(p), and a latency cost of log2(sqrt(p)) ~= 1/2 log2(p)
2) A Gemv with the local part of A against the just gathered part of x,
(z[MC,*] := A[MC,MR] x[MR,* ] in Elemental notation)
3) A sum-scatter within teams of sqrt(p) processes (via MPI_Reduce_scatter,
or MPI_Reduce_scatter_block if it is available), requiring roughly
n/sqrt(p) per-process communication volume and roughly log2(sqrt(p))
latency again.

Thus, the total per-process communication volume will be about 2 n /
sqrt(p), the total latency will be about log2(p), and the local computation
is roughly the optimal value, n^2 / p.

Jack
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120615/a59b2d02/attachment.html>