<html>

  <head>

    <meta content="text/html; charset=windows-1252"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">On 08/20/14 13:14, Karl Rupp wrote:<br>

    </div>

    <blockquote cite="mid:53F48333.9030205@iue.tuwien.ac.at" type="cite">Hey,

      <br>

      <br>

      >>> Is there a way to run a calculation with 4*N MPI

      tasks where

      <br>

      <blockquote type="cite">

        <blockquote type="cite">

          <blockquote type="cite">my matrix is first built outside

            PETSc, then to solve the

            <br>

            linear system using PETSc Mat, Vec, KSP on only N MPI

            <br>

            tasks to adress efficiently the N GPUs ?

            <br>

          </blockquote>

          <br>

          as far as I can tell, this should be possible with a suitable

          <br>

          subcommunicator. The tricky piece, however, is to select the

          right MPI

          <br>

          ranks for this. Note that you generally have no guarantee on

          how the

          <br>

          MPI ranks are distributed across the nodes, so be prepared for

          <br>

          something fairly specific to your MPI installation.

          <br>

        </blockquote>

        Yes, I am ready to face this point too.

        <br>

      </blockquote>

      <br>

      Okay, good to know that you are aware of this.

      <br>

      <br>

      <br>

      <br>

      <blockquote type="cite">I also started the work with a purely

        CPU-based solve only to test, but

        <br>

        without success. When

        <br>

        I read this:

        <br>

        <br>

        "If you wish PETSc code to run ONLY on a subcommunicator of

        <br>

        MPI_COMM_WORLD, create that communicator first and assign it to

        <br>

        PETSC_COMM_WORLD

        <br>

<a class="moz-txt-link-rfc2396E" href="http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PETSC_COMM_WORLD.html#PETSC_COMM_WORLD"><http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PETSC_COMM_WORLD.html#PETSC_COMM_WORLD></a>

        <br>

        BEFORE calling PetscInitialize

        <br>

<a class="moz-txt-link-rfc2396E" href="http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize"><http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize></a>().

        <br>

        <br>

        Thus if you are running a four process job and two processes

        will run

        <br>

        PETSc and have PetscInitialize

        <br>

<a class="moz-txt-link-rfc2396E" href="http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize"><http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize></a>()

        <br>

        and PetscFinalize

        <br>

<a class="moz-txt-link-rfc2396E" href="http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscFinalize.html#PetscFinalize"><http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscFinalize.html#PetscFinalize></a>()

        <br>

        and two process will not, then do this. If ALL processes in

        <br>

        the job are using PetscInitialize

        <br>

<a class="moz-txt-link-rfc2396E" href="http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize"><http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize></a>()

        <br>

        and PetscFinalize

        <br>

<a class="moz-txt-link-rfc2396E" href="http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscFinalize.html#PetscFinalize"><http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscFinalize.html#PetscFinalize></a>()

        <br>

        then you don't need to do this, even if different

        subcommunicators of

        <br>

        the job are doing different things with PETSc."

        <br>

        <br>

        I think I am not in this special scenario, because as my matrix

        is

        <br>

        initially partitionned on 4

        <br>

        processes, I need to call PetscInitialize() on each 4 processes

        in order

        <br>

        to build the PETSc matrix

        <br>

        with MatSetValues. And my goal is after to solve the linear

        system on

        <br>

        only 2 processes... So

        <br>

        building a sub-communicator will really do the trick ? Or i miss

        something ?

        <br>

      </blockquote>

      <br>

      oh, then I misunderstood your question. I thought that you want to

      run *your* code on 4N procs and let PETSc never see more than N

      procs when feeding the matrix.

      <br>

      <br>

    </blockquote>

    Sorry, I was not very clear :-)<br>

    <blockquote cite="mid:53F48333.9030205@iue.tuwien.ac.at" type="cite">What

      you could do with 4N procs for PETSc is to define your own matrix

      layout, where only one out of four processes actually owns part of

      the matrix. After MatAssemblyBegin()/MatAssemblyEnd() the full

      data gets correctly transferred to N procs, with the other 3*N

      procs being 'empty'. You should then be able to run the solver

      with all 4*N processors, but only N of them actually do the work

      on the GPUs.

      <br>

    </blockquote>

    OK, I understand your solution, as I was already thinking about

    that, thanks to confirm it. But, my fear was that the performance

    was not improved. Indeed, I still don't understand (even after<br>

    analyzing -log_summary profiles and searching in the petsc-dev

    archives) what is slowing down with several MPI tasks sharing one

    GPU, compared to one MPI task working with one GPU... <br>

    In the proposed solution, 4*N processes will still exchange MPI

    messages during a KSP iteration, and the amount of data copy will be

    the same between GPU and CPU(s), so if you could enlighten <br>

    me, I will be glad.<br>

    <blockquote cite="mid:53F48333.9030205@iue.tuwien.ac.at" type="cite">

      <br>

      A different question is whether you actually need all 4*N MPI

      ranks for the system assembly. You can make your life a lot easier

      if you only run with N MPI ranks upfront, particularly if the

      performance gains from N->4N procs in the assembly stage is

      small relative to the time spent in the solver. </blockquote>

    Indeed, but it is not always small in our cases...<br>

    <blockquote cite="mid:53F48333.9030205@iue.tuwien.ac.at" type="cite">This

      may well be the case for memory bandwidth limited applications,

      where one process can utilize most of the available bandwidth.

      Either way, a test run with N procs will give you a good profiling

      baseline on whether you can expect any performance gains from GPUs

      in the solver stage overall. It may well be that you can get

      faster solver times with some fancy multigrid preconditioning

      techniques on a purely CPU-based implementation, which is

      unavailable on GPUs. Also, your system size needs to be

      sufficiently large (100k unknowns per GPU as a rule of thumb) to

      hide PCI-Express latencies.

      <br>

    </blockquote>

    Indeed, the rule of thumb seems 100-150k unknowns per GPU for my

    app.<br>

    <br>

    Thanks Karli, I really appreciate your advices,<br>

    <br>

    PL<br>

    <blockquote cite="mid:53F48333.9030205@iue.tuwien.ac.at" type="cite">

      <br>

      Best regards,

      <br>

      Karli

      <br>

      <br>

    </blockquote>

    <br>

    <br>

    <div class="moz-signature">-- <br>

      <b>Trio_U support team</b>

      <br>

      Marthe ROUX (01 69 08 00 02) Saclay

      <br>

      Pierre LEDAC (04 38 78 91 49) Grenoble

    </div>

  </body>

</html>