On Fri, Jun 15, 2012 at 9:51 AM, Jack Poulson <span dir="ltr"><<a href="mailto:jack.poulson@gmail.com" target="_blank">jack.poulson@gmail.com</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="HOEnZb"><div class="h5">On Fri, Jun 15, 2012 at 9:14 AM, Matthew Knepley <span dir="ltr"><<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>></span> wrote:<br></div></div><div class="gmail_quote">

<div><div class="h5"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div><div>On Fri, Jun 15, 2012 at 9:18 PM, Alexander Grayver <span dir="ltr"><<a href="mailto:agrayver@gfz-potsdam.de" target="_blank">agrayver@gfz-potsdam.de</a>></span> wrote:<br></div></div>

<div class="gmail_quote"><div><div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<u></u>


  <div bgcolor="#ffffff" text="#000000">

    On 15.06.2012 14:46, Matthew Knepley wrote:

    <blockquote type="cite">On Fri, Jun 15, 2012 at 8:31 PM, Alexander Grayver <span dir="ltr"><<a href="mailto:agrayver@gfz-potsdam.de" target="_blank">agrayver@gfz-potsdam.de</a>></span>

      wrote:<br>

      <div class="gmail_quote">

        <blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

          <div bgcolor="#ffffff" text="#000000"> Matt,<br>

            <br>

            According to that code:<br>

            <br>

            <pre width="80"><a name="137f0a0e9f3539ea_137f07f4c843e420_137f04d12db00842_137f0222fdf86feb_line486">486: </a><b><font color="#4169e1"><a name="137f0a0e9f3539ea_137f07f4c843e420_137f04d12db00842_137f0222fdf86feb_MatMult_MPIDense"></a><a>PetscErrorCode</a> MatMult_MPIDense(<a>Mat</a> mat,<a>Vec</a> xx,<a>Vec</a> yy)</font></b>

<a name="137f0a0e9f3539ea_137f07f4c843e420_137f04d12db00842_137f0222fdf86feb_line487">487: </a>{

<a name="137f0a0e9f3539ea_137f07f4c843e420_137f04d12db00842_137f0222fdf86feb_line488">488: </a>  Mat_MPIDense   *mdn = (Mat_MPIDense*)mat->data;


<a name="137f0a0e9f3539ea_137f07f4c843e420_137f04d12db00842_137f0222fdf86feb_line492">492: </a> <a>VecScatterBegin</a>(mdn->Mvctx,xx,mdn->lvec,<a>INSERT_VALUES</a>,<a>SCATTER_FORWARD</a>);

<a name="137f0a0e9f3539ea_137f07f4c843e420_137f04d12db00842_137f0222fdf86feb_line493">493: </a>  <a>VecScatterEnd</a>(mdn->Mvctx,xx,mdn->lvec,<a>INSERT_VALUES</a>,<a>SCATTER_FORWARD</a>);

<a name="137f0a0e9f3539ea_137f07f4c843e420_137f04d12db00842_137f0222fdf86feb_line494">494: </a>  MatMult_SeqDense(mdn->A,mdn->lvec,yy);

<a name="137f0a0e9f3539ea_137f07f4c843e420_137f04d12db00842_137f0222fdf86feb_line495">495: </a>  <font color="#4169e1">return</font>(0);

<a name="137f0a0e9f3539ea_137f07f4c843e420_137f04d12db00842_137f0222fdf86feb_line496">496: </a>}


</pre>

            Each process has its own local copy of the vector? <br>

          </div>

        </blockquote>

        <div><br>

        </div>

        <div>I am not sure what your point is. VecScatter is just an

          interface that has many implementations. </div>

      </div>

    </blockquote>

    <br>

    I'm trying to estimate the amount of data needs to be communicated

    over all processes during this operation. <br>

    In debugger I see VecScatter from the code above reduces to the

    MPI_Allgatherv and results in (assuming vector is distributed

    uniformly)<br>

    <br>

    bytes_send_received = num_of_proc * ((num_of_proc - 1) *

    vec_size_local) * 2 * sizeof(PetscScalar)<br>

    <br>

    Does that look reasonable?<br></div></blockquote><div><br></div></div></div><div>This is not really a useful exercise, since</div><div><br></div><div>  a) PETSc does not currently have an optimized parallel dense implementation</div>


<div><br></div><div>  b) We are implementing an Elemental interface this summer. You can try it out in petsc-dev</div><div><br></div><div>  c) Elemental is much more efficient than our simple implementation, and uses a unique</div>


<div>      approach to communication (all reductions)</div><div><br></div><div>I would take the comp+comm estimates from Jack's slides on Elemental</div><div><br></div><div>   Matt</div><div> <br></div></div></blockquote>


</div></div><div><br>Jed and I just discussed this an hour or so ago; the MatMult (Gemv in BLAS parlance) implementation via Elemental will, in the simplest case of a square n x n matrix distributed over a square sqrt(p) x sqrt(p) process grid, require:<br>


<br>1) A gather (via a padded MPI_Gather, not MPI_Gatherv) within teams of sqrt(p) processes, with a per-process communication volume of approximately n/sqrt(p), and a latency cost of log2(sqrt(p)) ~= 1/2 log2(p)<br>2) A Gemv with the local part of A against the just gathered part of x, (z[MC,*] := A[MC,MR] x[MR,* ] in Elemental notation)<br>


3) A sum-scatter within teams of sqrt(p) processes (via MPI_Reduce_scatter, or MPI_Reduce_scatter_block if it is available), requiring roughly n/sqrt(p) per-process communication volume and roughly log2(sqrt(p)) latency again.<br>


<br>Thus, the total per-process communication volume will be about 2 n / sqrt(p), the total latency will be about log2(p), and the local computation is roughly the optimal value, n^2 / p.<span class="HOEnZb"><font color="#888888"><br>

<br>Jack<br></font></span></div></div>

</blockquote></div><br>I just noticed two minor corrections to what I just said:<br>1) The version that will be used by PETSc will actually use a padded MPI_Allgather, not a padded MPI_Gather, but the cost is essentially the same.<br>

2) The sequential cost of a square matrix-vector multiplication is 2 n^2, so the local computation will be roughly 2 n^2 / p flops.<br><br>Jack<br>