<html>
<head>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">On 08/20/14 13:14, Karl Rupp wrote:<br>
</div>
<blockquote cite="mid:53F48333.9030205@iue.tuwien.ac.at" type="cite">Hey,
<br>
<br>
>>> Is there a way to run a calculation with 4*N MPI
tasks where
<br>
<blockquote type="cite">
<blockquote type="cite">
<blockquote type="cite">my matrix is first built outside
PETSc, then to solve the
<br>
linear system using PETSc Mat, Vec, KSP on only N MPI
<br>
tasks to adress efficiently the N GPUs ?
<br>
</blockquote>
<br>
as far as I can tell, this should be possible with a suitable
<br>
subcommunicator. The tricky piece, however, is to select the
right MPI
<br>
ranks for this. Note that you generally have no guarantee on
how the
<br>
MPI ranks are distributed across the nodes, so be prepared for
<br>
something fairly specific to your MPI installation.
<br>
</blockquote>
Yes, I am ready to face this point too.
<br>
</blockquote>
<br>
Okay, good to know that you are aware of this.
<br>
<br>
<br>
<br>
<blockquote type="cite">I also started the work with a purely
CPU-based solve only to test, but
<br>
without success. When
<br>
I read this:
<br>
<br>
"If you wish PETSc code to run ONLY on a subcommunicator of
<br>
MPI_COMM_WORLD, create that communicator first and assign it to
<br>
PETSC_COMM_WORLD
<br>
<a class="moz-txt-link-rfc2396E" href="http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PETSC_COMM_WORLD.html#PETSC_COMM_WORLD"><http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PETSC_COMM_WORLD.html#PETSC_COMM_WORLD></a>
<br>
BEFORE calling PetscInitialize
<br>
<a class="moz-txt-link-rfc2396E" href="http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize"><http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize></a>().
<br>
<br>
Thus if you are running a four process job and two processes
will run
<br>
PETSc and have PetscInitialize
<br>
<a class="moz-txt-link-rfc2396E" href="http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize"><http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize></a>()
<br>
and PetscFinalize
<br>
<a class="moz-txt-link-rfc2396E" href="http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscFinalize.html#PetscFinalize"><http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscFinalize.html#PetscFinalize></a>()
<br>
and two process will not, then do this. If ALL processes in
<br>
the job are using PetscInitialize
<br>
<a class="moz-txt-link-rfc2396E" href="http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize"><http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize></a>()
<br>
and PetscFinalize
<br>
<a class="moz-txt-link-rfc2396E" href="http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscFinalize.html#PetscFinalize"><http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscFinalize.html#PetscFinalize></a>()
<br>
then you don't need to do this, even if different
subcommunicators of
<br>
the job are doing different things with PETSc."
<br>
<br>
I think I am not in this special scenario, because as my matrix
is
<br>
initially partitionned on 4
<br>
processes, I need to call PetscInitialize() on each 4 processes
in order
<br>
to build the PETSc matrix
<br>
with MatSetValues. And my goal is after to solve the linear
system on
<br>
only 2 processes... So
<br>
building a sub-communicator will really do the trick ? Or i miss
something ?
<br>
</blockquote>
<br>
oh, then I misunderstood your question. I thought that you want to
run *your* code on 4N procs and let PETSc never see more than N
procs when feeding the matrix.
<br>
<br>
</blockquote>
Sorry, I was not very clear :-)<br>
<blockquote cite="mid:53F48333.9030205@iue.tuwien.ac.at" type="cite">What
you could do with 4N procs for PETSc is to define your own matrix
layout, where only one out of four processes actually owns part of
the matrix. After MatAssemblyBegin()/MatAssemblyEnd() the full
data gets correctly transferred to N procs, with the other 3*N
procs being 'empty'. You should then be able to run the solver
with all 4*N processors, but only N of them actually do the work
on the GPUs.
<br>
</blockquote>
OK, I understand your solution, as I was already thinking about
that, thanks to confirm it. But, my fear was that the performance
was not improved. Indeed, I still don't understand (even after<br>
analyzing -log_summary profiles and searching in the petsc-dev
archives) what is slowing down with several MPI tasks sharing one
GPU, compared to one MPI task working with one GPU... <br>
In the proposed solution, 4*N processes will still exchange MPI
messages during a KSP iteration, and the amount of data copy will be
the same between GPU and CPU(s), so if you could enlighten <br>
me, I will be glad.<br>
<blockquote cite="mid:53F48333.9030205@iue.tuwien.ac.at" type="cite">
<br>
A different question is whether you actually need all 4*N MPI
ranks for the system assembly. You can make your life a lot easier
if you only run with N MPI ranks upfront, particularly if the
performance gains from N->4N procs in the assembly stage is
small relative to the time spent in the solver. </blockquote>
Indeed, but it is not always small in our cases...<br>
<blockquote cite="mid:53F48333.9030205@iue.tuwien.ac.at" type="cite">This
may well be the case for memory bandwidth limited applications,
where one process can utilize most of the available bandwidth.
Either way, a test run with N procs will give you a good profiling
baseline on whether you can expect any performance gains from GPUs
in the solver stage overall. It may well be that you can get
faster solver times with some fancy multigrid preconditioning
techniques on a purely CPU-based implementation, which is
unavailable on GPUs. Also, your system size needs to be
sufficiently large (100k unknowns per GPU as a rule of thumb) to
hide PCI-Express latencies.
<br>
</blockquote>
Indeed, the rule of thumb seems 100-150k unknowns per GPU for my
app.<br>
<br>
Thanks Karli, I really appreciate your advices,<br>
<br>
PL<br>
<blockquote cite="mid:53F48333.9030205@iue.tuwien.ac.at" type="cite">
<br>
Best regards,
<br>
Karli
<br>
<br>
</blockquote>
<br>
<br>
<div class="moz-signature">-- <br>
<b>Trio_U support team</b>
<br>
Marthe ROUX (01 69 08 00 02) Saclay
<br>
Pierre LEDAC (04 38 78 91 49) Grenoble
</div>
</body>
</html>