<html>

  <head>

    <meta content="text/html; charset=windows-1252"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">On 08/20/14 16:03, Karl Rupp wrote:<br>

    </div>

    <blockquote cite="mid:53F4AABF.1070309@iue.tuwien.ac.at" type="cite">

      <br>

      <blockquote type="cite">

        <blockquote type="cite">What you could do with 4N procs for

          PETSc is to define your own matrix

          <br>

          layout, where only one out of four processes actually owns

          part of the

          <br>

          matrix. After MatAssemblyBegin()/MatAssemblyEnd() the full

          data gets

          <br>

          correctly transferred to N procs, with the other 3*N procs

          being

          <br>

          'empty'. You should then be able to run the solver with all

          4*N

          <br>

          processors, but only N of them actually do the work on the

          GPUs.

          <br>

        </blockquote>

        OK, I understand your solution, as I was already thinking about

        that,

        <br>

        thanks to confirm it. But, my fear was that the performance was

        not

        <br>

        improved. Indeed, I still don't understand (even after

        <br>

        analyzing -log_summary profiles and searching in the petsc-dev

        archives)

        <br>

        what is slowing down with several MPI tasks sharing one GPU,

        compared to

        <br>

        one MPI task working with one GPU...

        <br>

        In the proposed solution, 4*N processes will still exchange MPI

        messages

        <br>

        during a KSP iteration, and the amount of data copy will be the

        same

        <br>

        between GPU and CPU(s), so if you could enlighten

        <br>

        me, I will be glad.

        <br>

      </blockquote>

      <br>

      One of the causes of the performance penalty you observe is the

      higher PCI-Express communication: If four ranks share a single

      GPU, then each matrix-vector product requires at least 8 vector

      transfers between host and device, rather than just 2 with a

      single MPI rank. Similarly, you have four times the number of

      kernel launches. It may well be that these overheads just eat up

      all the performance gains you could otherwise obtain. I don't know

      your profiling data, so I can't be more specific at this point.

    </blockquote>

    <br>

    Thanks a lot Karli for the explanations. I am currently trying your

    solution.<br>

    <br>

    Pierre<br>

    <blockquote cite="mid:53F4AABF.1070309@iue.tuwien.ac.at" type="cite">

      <br>

      Best regards,

      <br>

      Karli

      <br>

    </blockquote>

    <br>

    <br>

    <div class="moz-signature">-- <br>

      <b>Trio_U support team</b>

      <br>

      Marthe ROUX (01 69 08 00 02) Saclay

      <br>

      Pierre LEDAC (04 38 78 91 49) Grenoble

    </div>

  </body>

</html>