<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">Hi,<br>

      <br>

      I just had a look at the threaded version of MatMult_SeqAIJ and I

      think the Flops logging might be incorrect, because the

      nonzerorows aren't counted in MatMult_SeqAIJ_Kernel. Fixing this

      in the thread kernel would require a reduction though, which could

      impact performance. Is this a known problem, or is there a better

      way to compute Flops, which doesn't require the nonzerorows?<br>

      <br>

      Alternatively, would it make sense to pre-compute the nonzerorows

      and store them in the threadcomm? This might require more of the

      AIJ data structure to be exposed to PetscLayoutSetUp /

      PetscThreadCommGetOwnershipRanges though. <br>

      <br>

      Regards,<br>

      Michael <br>

      <br>

      On 08/08/13 12:08, Matthew Knepley wrote:<br>

    </div>

    <blockquote

cite="mid:CAMYG4GmpxmiByzS5xnEc4rFZ7s3nSOra+RCQ8iT6A5QZCP2uxw@mail.gmail.com"

      type="cite">

      <meta http-equiv="Content-Type" content="text/html;

        charset=ISO-8859-1">

      <div dir="ltr">On Thu, Aug 8, 2013 at 5:37 AM, Michael Lange <span

          dir="ltr"><<a moz-do-not-send="true"

            href="mailto:michael.lange@imperial.ac.uk" target="_blank">michael.lange@imperial.ac.uk</a>></span>

        wrote:<br>

        <div class="gmail_extra">

          <div class="gmail_quote">

            <blockquote class="gmail_quote" style="margin:0 0 0

              .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>

              <br>

              We have recently been trying to re-align our OpenMP fork (<a

                moz-do-not-send="true"

                href="https://bitbucket.org/ggorman/petsc-3.3-omp"

                target="_blank">https://bitbucket.org/ggorman/petsc-3.3-omp</a>)

              with petsc/master. Much of our early work has now been

              superseded by the threadcomm implementations.

              Nevertheless, there are still a few algorithmic

              differences between the two branches:<br>

              <br>

              1) Enforcing MPI latency hiding by using task-based spMV:<br>

              If the MPI implementation used does not actually provide

              truly asynchronous communication in hardware, performance

              can be increased by dedicating a single thread to

              overlapping MPI communication in PETSc. However, this is

              arguably a vendor-specific fix which requires significant

              code changes (ie the parallel section needs to be raised

              up by one level). So perhaps the strategy should be to

              give guilty vendors a hard time rather than messing up the

              current abstraction.<br>

              <br>

              2) Nonzero-based thread partitioning:<br>

              Rather than evenly dividing the number of rows among

              threads, we can partition the thread ownership ranges

              according to the number of non-zeros in each row. This

              balances the work load between threads and thus increases

              strong scalability due to optimised bandwidth utilisation.

              In general, this optimisation should integrate well with

              threadcomms, since it only changes the thread ownership

              ranges, but it does require some structural changes since

              nnz is currently not passed to PetscLayoutSetUp. Any

              thoughts on whether people regard such a scheme as useful

              would be greatly appreciated.<br>

            </blockquote>

            <div><br>

            </div>

            <div>I think this should be handled by changing the AIJ data

              structure. Going all the way to "2D" partitions would also

              allow</div>

            <div>us to handle power-law matrix graphs. This would keep

              the thread implementation simple, and at the same time be

              more</div>

            <div>flexible.</div>

            <div><br>

            </div>

            <div>   Matt</div>

            <div> </div>

            <blockquote class="gmail_quote" style="margin:0 0 0

              .8ex;border-left:1px #ccc solid;padding-left:1ex">

              3) MatMult_SeqBAIJ not threaded:<br>

              Is there a reason why MatMult has not been threaded for

              BAIJ matrices, or is somebody already working on this? If

              not, I would like to prepare a pull request for this using

              the same approach as MatMult_SeqAIJ.<br>

              <br>

              We would welcome any suggestions/feedback on this, in

              particular the second point. Up to date benchmarking

              results for the first two methods, including BlueGene/Q,

              can be found in:<br>

              <a moz-do-not-send="true"

                href="http://arxiv.org/abs/1307.4567" target="_blank">http://arxiv.org/abs/1307.4567</a><br>

              <br>

              Kind regards,<br>

              <br>

              Michael Lange<br>

            </blockquote>

          </div>

          <br>

          <br clear="all">

          <div><br>

          </div>

          -- <br>

          What most experimenters take for granted before they begin

          their experiments is infinitely more interesting than any

          results to which their experiments lead.<br>

          -- Norbert Wiener

        </div>

      </div>

    </blockquote>

    <br>

  </body>

</html>