<div dir="ltr"><div class="gmail_quote"><div dir="ltr">On Mon, Jul 9, 2018 at 6:53 PM Jeff Hammond <<a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Mon, Jul 9, 2018 at 6:38 AM, Matthew Knepley <span dir="ltr"><<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><span><div dir="ltr">On Mon, Jul 9, 2018 at 9:34 AM Jeff Hammond <<a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Fri, Jul 6, 2018 at 4:28 PM, Smith, Barry F. <span dir="ltr"><<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
Richard,<br>
<br>
The problem is that OpenMP is too large and has too many different programming models imbedded in it (and it will get worse) to "support OpenMP" from PETSc.<br></blockquote><div><br></div><div>This is also true of MPI. You can write CSP, BSP, PGAS, fork-join, agent-based, etc. in MPI. Just like MPI, you don't have to use all the features. PETSc doesn't use MPI_Comm_spawn, MPI_Rget_accumulate, or MPI_Neighborhood_alltoallv, does it?</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
One way to use #pragma based optimization tools (which is one way to treat OpenMP) is to run the application code in a realistic size problem, using the number of threads/MPI process they prefer with profiling and begin adding #pragmas to the most time consuming code fragments/routines, measuring the (small) improvement in performance as they are added. This is the way I would proceed. The branch generated will not have very many pragmas in it so would likely be acceptable to be included into PETSc. It would also give a quantitative measure of the possible performance with the #prama approach.<br></blockquote><div><br></div><div>This is the textbook Wrong Way to write OpenMP and the reason that the thread-scalability of DOE applications using MPI+OpenMP sucks. It leads to codes that do fork-join far too often and suffer from death by Amdahl, unless you do a second pass where you fuse all the OpenMP regions and replace the serial regions between them with critical sections or similar.</div><div><br></div><div>This isn't how you'd write MPI, is it? No, you'd figure out how to decompose your data properly to exploit locality and then implement an algorithm that minimizes communication and synchronization. Do that with OpenMP.</div></div></div></div></blockquote><div><br></div></span><div>This is the worst advice. It is ivory tower maxims that ignore any practical considerations. What you propose, while</div><div>it would produce the bet performing OpenMP, is a lot of work for absolutely no value. We already have that performance</div><div>with MPI, it would not get any better with all this work, and the shit-wits that want OpenMP do not care about performance,</div><div>they only care about ass covering and doing what other idiots in the DOE tell them to do.</div><div><div class="m_-7276025593156647979h5"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div></div></div></div></div></blockquote></div></div></div></div></blockquote><div><br></div><div>You don't have that performance with MPI when PETSc is used by applications that do not use MPI-only to saturate the execution units of the node - you only have as much performance as is captured by the MPI communicator PETSc is given by the application.</div></div></div></div></blockquote><div><br></div><div>Yep, its always possible to wreck performance by doing something dumb.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div>If PETSc was an application, it could do whatever it wanted, but it's not. If PETSc is a library that intends to meet the needs of HPC applications, it needs to support the programming models the applications are using. Or I suppose you will continue to disparage everyone who doesn't bow down and worship flat MPI on homogeneous big-core machines as a divine execution model until your users abandon you for otherwise inferior software that is willing to embrace user requirements.</div></div></div></div></blockquote><div><br></div><div>As Barry and Jed have noted, we are responding to user requirements. Everyone requesting OpenMP from PETSc uses the way you hate.</div><div>I saw you were able to give an example of someone doing the right thing, which is laudable, however unimportant for us since we are</div><div>responding to our users. You should be proud.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div>It's rather ironic that you are accusing me of ivory tower maxims when you are the one clinging to MPI-only because data sharing by default is bad and OpenMP doesn't have useful context objects while ignoring the obvious practical reality that almost every HPC application is adopting a level of parallelism besides MPI and you are refusing to support them.</div></div></div></div></blockquote><div><br></div><div>That is an interesting point of view, unsupported as far as I can see by data. We get hundreds of mails a week from people not using</div><div>MPI+Anything. Perhaps we are now defining HPC Applications as those for which MPI+X smells nice.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div>Not that you actually care with the problems faced by complex application software, but I'll note that despite everything that's wrong with OpenMP, NWChem benefits from it not just because of memory capacity, communication volume, and cache utilization, but because Linux I/O doesn't scale particularly well in process count, and it's a lot easier to use OpenMP in compute-bound subroutines than it is to rewrite the entire I/O back-end of NWChem to ameliorate contention in the kernel.</div></div></div></div></blockquote><div><br></div><div>I think if we stopped caring about applications, we would stop being relevant. However caring about applications does not entail following any</div><div>batshit crazy paradigm someone chooses to adopt.</div><div><br></div><div> Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div>Jeff</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div><div class="m_-7276025593156647979h5"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Note: that for BLAS 1 operations likely the correct thing to do is turn on MKL BLAS threading (being careful to make sure the number of threads MKL uses matches that used by other parts of the code). This way we don't need to OpenMP optimize many parts of PETSc's vector operations (norm, dot, scale, axpy). In fact, this is the first thing Mark should do, how much does it speed up the vector operations?<br></blockquote><div><br></div><div>BLAS1 operations are all memory-bound unless running out of cache (in which case one shouldn't use threads) and compilers do a great job with them. Just put the pragmas on and let the compiler do its job.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
The problem is how many ECP applications actually use OpenMP just as a #pragma optimization tool, or do they use other features of OpenMP. For example I remember Brian wanted to/did use OpenMP threads directly in BoxLib and didn't just stick to the #pragma model. If they did this then we would need custom PETSc to match their model.<span class="m_-7276025593156647979m_4033405631338937315m_-9069860305387534015HOEnZb"><font color="#888888"><br></font></span></blockquote><div><br></div><div>If this implies that BoxLib will use omp-parallel and then use explicit threading in a manner similar to MPI (omp_get_num_threads=MPI_Comm_size and omp_get_thread_num=MPI_Comm_rank), then this is the Right Way to write OpenMP.</div><div><br></div><div>Unfortunately, the Right Way to use OpenMP makes it hard to use MPI unless you use MPI_THREAD_MULTIPLE and endpoints. ECP projects should be pushing the MPI folks harder to ratify and implement endpoints. I don't know if the proposal is even active right now, but that doesn't prevent DOE from compelling Open-MPI and MPICH to support it.</div><div><br></div><div>To end on a positive note, OpenMP tasking is a relatively composable model and supports DAG-based parallelism. I suspect the initial results in a code like PETSc will be worse than with traditional implicit OpenMP (omp-for-simd on all the loops) but it eventually wins own because it doesn't require any unnecessary barriers and makes it much easier to fuse parallel regions.</div><div><br></div><div>Jeff</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="m_-7276025593156647979m_4033405631338937315m_-9069860305387534015HOEnZb"><font color="#888888">
<br>
Barry<br>
</font></span><div class="m_-7276025593156647979m_4033405631338937315m_-9069860305387534015HOEnZb"><div class="m_-7276025593156647979m_4033405631338937315m_-9069860305387534015h5"><br>
<br>
> On Jul 6, 2018, at 3:07 PM, Mills, Richard Tran <<a href="mailto:rtmills@anl.gov" target="_blank">rtmills@anl.gov</a>> wrote:<br>
> <br>
> True, Barry. But, unfortunately, I think Jed's argument has something to it because the hybrid MPI + OpenMP model has become so popular. I know of a few codes where adopting this model makes some sense, though I believe that, more often, the model has been adopted simply because it is the fashionable thing to do. Regardless of good or bad reasons for its adoption, I do have some real concern that codes that use this model have a difficult time using PETSc effectively because of the lack of thread support. Like many of us, I had hoped that endpoints would make it into the MPI standard and this would provide a reasonable mechanism for integrating PETSc with codes using MPI+threads, but progress on this seems to have stagnated. I hope that the MPI endpoints effort eventually goes somewhere, but what can we do in the meantime? Within the DOE ECP program, the MPI+threads approach is being pushed really hard, and many of the ECP subprojects have adopted it. I think it's mostly idiotic, but I think it's too late to turn the tide and convince most people that pure MPI is the way to go. Meanwhile, my understanding is that we need to be able to support more of the ECP application projects to justify the substantial funding we are getting from the program. Many of these projects are dead-set on using OpenMP. (I note that I believe that the folks Mark is trying to help with PETSc and OpenMP are people affiliated with Carl Steefel's ECP subsurface project.)<br>
> <br>
> Since it looks like MPI endpoints are going to be a long time (or possibly forever) in coming, I think we need (a) stopgap plan(s) to support this crappy MPI + OpenMP model in the meantime. One possible approach is to do what Mark is trying with to do with MKL: Use a third party library that provides optimized OpenMP implementations of computationally expensive kernels. It might make sense to also consider using Karl's ViennaCL library in this manner, which we already use to support GPUs, but which I believe (Karl, please let me know if I am off-base here) we could also use to provide OpenMP-ized linear algebra operations on CPUs as well. Such approaches won't use threads for lots of the things that a PETSc code will do, but might be able to provide decent resource utilization for the most expensive parts for some codes.<br>
> <br>
> Clever ideas from anyone on this list about how to use an adequate number of MPI ranks for PETSc while using only a subset of these ranks for the MPI+OpenMP application code will be appreciated, though I don't know if there are any good solutions.<br>
> <br>
> --Richard<br>
> <br>
> On Wed, Jul 4, 2018 at 11:38 PM, Smith, Barry F. <<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>> wrote:<br>
> <br>
> Jed,<br>
> <br>
> You could use your same argument to argue PETSc should do "something" to help people who have (rightly or wrongly) chosen to code their application in High Performance Fortran or any other similar inane parallel programming model.<br>
> <br>
> Barry<br>
> <br>
> <br>
> <br>
> > On Jul 4, 2018, at 11:51 PM, Jed Brown <<a href="mailto:jed@jedbrown.org" target="_blank">jed@jedbrown.org</a>> wrote:<br>
> > <br>
> > Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>> writes:<br>
> > <br>
> >> On Wed, Jul 4, 2018 at 4:51 PM Jeff Hammond <<a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a>> wrote:<br>
> >> <br>
> >>> On Wed, Jul 4, 2018 at 6:31 AM Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>> wrote:<br>
> >>> <br>
> >>>> On Tue, Jul 3, 2018 at 10:32 PM Jeff Hammond <<a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a>><br>
> >>>> wrote:<br>
> >>>> <br>
> >>>>> <br>
> >>>>> <br>
> >>>>> On Tue, Jul 3, 2018 at 4:35 PM Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br>
> >>>>> <br>
> >>>>>> On Tue, Jul 3, 2018 at 1:00 PM Richard Tran Mills <<a href="mailto:rtmills@anl.gov" target="_blank">rtmills@anl.gov</a>><br>
> >>>>>> wrote:<br>
> >>>>>> <br>
> >>>>>>> Hi Mark,<br>
> >>>>>>> <br>
> >>>>>>> I'm glad to see you trying out the AIJMKL stuff. I think you are the<br>
> >>>>>>> first person trying to actually use it, so we are probably going to expose<br>
> >>>>>>> some bugs and also some performance issues. My somewhat limited testing has<br>
> >>>>>>> shown that the MKL sparse routines often perform worse than our own<br>
> >>>>>>> implementations in PETSc.<br>
> >>>>>>> <br>
> >>>>>> <br>
> >>>>>> My users just want OpenMP.<br>
> >>>>>> <br>
> >>>>>> <br>
> >>>>> <br>
> >>>>> Why not just add OpenMP to PETSc? I know certain developers hate it, but<br>
> >>>>> it is silly to let a principled objection stand in the way of enabling users<br>
> >>>>> <br>
> >>>> <br>
> >>>> "if that would deliver the best performance for NERSC users."<br>
> >>>> <br>
> >>>> You have answered your own question.<br>
> >>>> <br>
> >>> <br>
> >>> Please share the results of your experiments that prove OpenMP does not<br>
> >>> improve performance for Mark’s users.<br>
> >>> <br>
> >> <br>
> >> Oh God. I am supremely uninterested in minutely proving yet again that<br>
> >> OpenMP is not better than MPI.<br>
> >> There are already countless experiments. One more will not add anything of<br>
> >> merit.<br>
> > <br>
> > Jeff assumes an absurd null hypothesis, Matt selfishly believes that<br>
> > users should modify their code/execution environment to subscribe to a<br>
> > more robust and equally performant approach, and the MPI forum abdicates<br>
> > by stalling on endpoints. How do we resolve this?<br>
> > <br>
> >>> Also we are not in the habit of fucking up our codebase in order to follow<br>
> >>>> some fad.<br>
> >>>> <br>
> >>> <br>
> >>> If you can’t use OpenMP without messing up your code base, you probably<br>
> >>> don’t know how to design software.<br>
> >>> <br>
> >> <br>
> >> That is an interesting, if wrong, opinion. It might be your contention that<br>
> >> sticking any random paradigm in a library should<br>
> >> be alright if its "well designed"? I have never encountered such a<br>
> >> well-designed library.<br>
> >> <br>
> >> <br>
> >>> I guess if you refuse to use _Pragma because C99 is still a fad for you,<br>
> >>> it is harder, but clearly _Complex is tolerated.<br>
> >>> <br>
> >> <br>
> >> Yes, littering your code with preprocessor directives improves almost<br>
> >> everything. Doing proper resource management<br>
> >> using Pragmas, in an environment with several layers of libraries, is a<br>
> >> dream.<br>
> >> <br>
> >> <br>
> >>> More seriously, you’ve adopted OpenMP hidden behind MKL<br>
> >>> <br>
> >> <br>
> >> Nope. We can use MKL with that crap shutoff.<br>
> >> <br>
> >> <br>
> >>> so I see no reason why you can’t wrap OpenMP implementations of the PETSc<br>
> >>> sparse kernels in a similar manner.<br>
> >>> <br>
> >> <br>
> >> We could, its just a colossal waste of time and effort, as well as<br>
> >> counterproductive for the codebase :)<br>
> > <br>
> > Endpoints either need to become a thing we can depend on or we need a<br>
> > solution for users that insist on using threads (even if their decision<br>
> > to use threads is objectively bad). The problem Matt harps on is<br>
> > legitimate: OpenMP parallel regions cannot reliably cross module<br>
> > boundaries except for embarrassingly parallel operations. This means<br>
> > loop-level omp parallel which significantly increases overhead for small<br>
> > problem sizes (e.g., slowing coarse grid solves and strong scaling<br>
> > limits). It can be done and isn't that hard, but the Imperial group<br>
> > discarded their branch after observing that it also provided no<br>
> > performance benefit. However, I'm coming around to the idea that PETSc<br>
> > should do it so that there is _a_ solution for users that insist on<br>
> > using threads in a particular way. Unless Endpoints become available<br>
> > and reliable, in which case we could do it right.<br>
> <br>
> <br>
<br>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="m_-7276025593156647979m_4033405631338937315m_-9069860305387534015gmail_signature" data-smartmail="gmail_signature">Jeff Hammond<br><a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a><br><a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a></div>
</div></div>
</blockquote></div></div></div><br clear="all"><span><div><br></div>-- <br><div dir="ltr" class="m_-7276025593156647979m_4033405631338937315gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div><div><br></div><div><a href="http://www.caam.rice.edu/~mk51/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br></div></div></div></div></div></span></div>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="m_-7276025593156647979gmail_signature" data-smartmail="gmail_signature">Jeff Hammond<br><a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a><br><a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a></div>
</div></div>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div><div><br></div><div><a href="http://www.caam.rice.edu/~mk51/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br></div></div></div></div></div></div>