<div dir="ltr"><div class="gmail_quote"><div dir="ltr">On Fri, Nov 16, 2018 at 5:35 AM Karin&NiKo via petsc-users <<a href="mailto:petsc-users@mcs.anl.gov">petsc-users@mcs.anl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div><div>Dear PETSc team,<br><br></div>I have run the same test on the same number of processes as before (1000, 1500 and 2000) but by increasing the number of nodes. The results are much better!<br></div>If I focus on the KSPSolve event, I have the following timings:<br>1000 => 1.2681e+02<br>1500 => 8.7030e+01<br>2000 => 7.8904e+01 <br></div><div>The parallel efficiency between 1000 and 1500 is around to 96% but it decreases drastically when using 2000 processes. I think my problem is too small and the communications begin to be important.<br><br></div><div>I have an extra question : in the profiling section, what is exactly measured in "Time (sec): " ? I wonder if it is the time between PetscInitialize and PetscFinalize?<br></div></div></div></div></div></blockquote><div><br></div><div>Yep. Also, your communication could get more expensive at the 2000 level by including another cabinet or something.</div><div><br></div><div> Thanks,</div><div><br></div><div> Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div></div><div>Thanks again for your help,<br></div><div>Nicolas<br></div><div dir="ltr"><br></div></div></div></div></div><br><div class="gmail_quote"><div dir="ltr">Le ven. 16 nov. 2018 à 00:24, Karin&NiKo <<a href="mailto:niko.karin@gmail.com" target="_blank">niko.karin@gmail.com</a>> a écrit :<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span>Ok. I will do that soon and I will let you know. </span><div>Thanks again,</div><div>Nicolas</div><div><br><div class="gmail_quote"><div dir="ltr">Le jeu. 15 nov. 2018 20:50, Smith, Barry F. <<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>> a écrit :<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
<br>
> On Nov 15, 2018, at 1:02 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br>
> <br>
> There is a lot of load imbalance in VecMAXPY also. The partitioning could be bad and if not its the machine.<br>
<br>
<br>
> <br>
> On Thu, Nov 15, 2018 at 1:56 PM Smith, Barry F. via petsc-users <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>> wrote:<br>
> <br>
> Something is odd about your configuration. Just consider the time for VecMAXPY which is an embarrassingly parallel operation. On 1000 MPI processes it produces<br>
> <br>
> Time flop rate<br>
> VecMAXPY 575 1.0 8.4132e-01 1.5 1.36e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 0 2 0 0 0 1,600,021<br>
> <br>
> on 1500 processes it produces<br>
> <br>
> VecMAXPY 583 1.0 1.0786e+00 3.4 9.38e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 0 2 0 0 0 1,289,187<br>
> <br>
> that is it actually takes longer (the time goes from .84 seconds to 1.08 seconds and the flop rate from 1,600,021 down to 1,289,187) You would never expect this kind of behavior<br>
> <br>
> and on 2000 processes it produces<br>
> <br>
> VecMAXPY 583 1.0 7.1103e-01 2.7 7.03e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 0 2 0 0 0 1,955,563<br>
> <br>
> so it speeds up again but not by very much. This is very mysterious and not what you would expect.<br>
> <br>
> I'm inclined to believe something is out of whack on your computer, are you sure all nodes on the computer are equivalent? Same processors, same clock speeds? What happens if you run the 1000 process case several times, do you get very similar numbers for VecMAXPY()? You should but I am guessing you may not.<br>
> <br>
> Barry<br>
> <br>
> Note that this performance issue doesn't really have anything to do with the preconditioner you are using.<br>
> <br>
> <br>
> <br>
> <br>
> <br>
> > On Nov 15, 2018, at 10:50 AM, Karin&NiKo via petsc-users <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>> wrote:<br>
> > <br>
> > Dear PETSc team,<br>
> > <br>
> > I am solving a linear transient dynamic problem, based on a discretization with finite elements. To do that, I am using FGMRES with GAMG as a preconditioner. I consider here 10 time steps. <br>
> > The problem has round to 118e6 dof and I am running on 1000, 1500 and 2000 procs. So I have something like 100e3, 78e3 and 50e3 dof/proc.<br>
> > I notice that the performance deteriorates when I increase the number of processes. <br>
> > You can find as attached file the log_view of the execution and the detailled definition of the KSP.<br>
> > <br>
> > Is the problem too small to run on that number of processes or is there something wrong with my use of GAMG?<br>
> > <br>
> > I thank you in advance for your help,<br>
> > Nicolas<br>
> > <FGMRES_GAMG_1000procs.txt><FGMRES_GAMG_2000procs.txt><FGMRES_GAMG_1500procs.txt><br>
> <br>
<br>
</blockquote></div></div>
</blockquote></div>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div><div><br></div><div><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br></div></div></div></div></div></div></div></div>