<div dir="ltr"><br><div class="gmail_extra"><div class="gmail_quote">On Sun, Sep 9, 2018 at 6:09 AM, Jed Brown <span dir="ltr"><<a href="mailto:jed@jedbrown.org" target="_blank">jed@jedbrown.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">Tamara Dancheva <<a href="mailto:tamaradanceva19933@gmail.com">tamaradanceva19933@gmail.com</a>> writes:<br>
<br>
> Hi Barry,<br>
><br>
> I see the issue..<br>
><br>
> In the FEM library and solver that I am working on, PETSc is used all<br>
> throughout for the data distribution, synchronization of functions,<br>
> assembly. There is another UPC alternation of using the JANPACK linear<br>
> algebra backend (<a href="http://www.csc.kth.se/~njansson/janpack/" rel="noreferrer" target="_blank">http://www.csc.kth.se/~<wbr>njansson/janpack/</a>), which gives<br>
> increased performance. <br>
<br>
</span>How do you know the JANPACK performance is better? The figures on that<br>
website appeared in a paper submission that was ultimately rejected<br>
after it was discovered that the convergence criteria actually differed<br>
by orders of magnitude and the reference PETSc results were uniformly<br>
faster. The most recent release appears to have been in 2015.<br></blockquote><div><br></div><div>Note also that UPC is perhaps the least performance-portable thing on which one can build a distributed-memory HPC library. Cray XE6 used the Gemini interconnect, which is a PGAS NIC that was designed to run UPC and related models. Naturally, the Cray UPC compiler is designed to max out performance on Cray's PGAS NICs. There are very few other platforms where the UPC user experience will approach this.</div><div><br></div><div>In contrast, MPI send-recv runs well (relative to the hardware) on everything from token-ring Ethernet to the most expensive supercomputer you can buy.</div><div><br></div><div>Thus, even if one assumes that the performance comparisons are completely fair, they are an outlier and the relative performance on most other machines will be nowhere near as favorable to JANPACK. The JANPACK developers need to publish comparisons on multiple commodity networks as well.</div><div><br></div><div>Jeff</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">
> My project is about exploring another pathway, optimization given that<br>
> this software targets large scale computations, an asynchronous<br>
> version of the algorithm for which I have implemented a Block-Jacobi<br>
> with inner Krylov Solvers (inner solve with PETSc). This version aims<br>
> for a speedup factor of about 1.7-2.0 (from some literature although<br>
> not in the same context exactly) <br>
<br>
</span>Could you share What literature you are basing this estimate on? It's<br>
important to make comparisons using a performance model. For example,<br>
if current PETSc results attain 70% of STREAM bandwidth, then no amount<br>
of latency/communication optimization will yield your desired<br>
improvement factors. On the other hand, if your solver is latency<br>
dominated due to pushing to the limit of strong scalability, then these<br>
optimizations might be possible (with many caveats).<br>
<br>
If you could send -log_view output for your application, it would help<br>
us understand the performance setting of your current solver<br>
configuration.<br>
<span class=""><br>
> and it is done with the same motivation behind ExaFLOW<br>
> (<a href="http://exaflow-project.eu/" rel="noreferrer" target="_blank">http://exaflow-project.eu/</a>), I would say. This still requires me to<br>
> modify the ghost exchange routines in order to be able to advance the<br>
> processes out of sync. I could implement this out of PETSc, but I<br>
> would significantly increase the memory footprint, since the necessary<br>
> data is currently fed to PETSc and discarded. In this context, since<br>
> PETSc also works with, stores MPI requests, I can reuse and extend<br>
> upon the implementation since this is close to the approach I have in<br>
> mind (using either a circular of limited size buffer of MPI Requests<br>
> and non-blocking collectives). I had also considering not using PETSc<br>
> at all to avoid all the blocking regions, however considering the<br>
> scope of my project, deemed that it would take too long to implement<br>
> and validate.<br>
<br>
</span>It's very reasonable to implement in PETSc, but let's discuss the<br>
communication pattern first. You said you are working with a FEM model,<br>
but also mention "igatherv". Is this for some sequential mesh<br>
processing task or is it related to the solver? There isn't a<br>
neighborhood igatherv and MPI_Igatherv isn't a pattern that should ever<br>
be needed in a FEM solver.<br>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature">Jeff Hammond<br><a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a><br><a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a></div>
</div></div>