<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div class=""><br class=""></div>   You might take a look at <a href="https://publications.anl.gov/anlpubs/2020/04/159190.pdf" class="">https://publications.anl.gov/anlpubs/2020/04/159190.pdf</a>  in the introduction there is a short discussion about some of the <div class="">"gotchas" when using multi-core CPUs connected to multiple GPUs; it is focused on an IBM Power/Nvidia GPU system but the same abstract issues</div><div class="">will arise on any similar system.</div><div class=""><br class=""></div><div class="">1) How many cores should share a GPU?   Generally 1 but there may be exceptions</div><div class=""><br class=""></div><div class="">2) Special memory utilization that can be copied to/from GPUs faster, it should be turned on?</div><div class=""><br class=""></div><div class="">3) Is it worth doing anything with the "extra" cores that are not accessing a GPU?   Probably not but there may be exceptions.</div><div class=""><br class=""></div><div class="">4) How to communicate between nodes with MPI, can one go directly from GPU to GPU and skip the CPU memory?</div><div class=""><br class=""></div><div class="">  Barry</div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On Jun 9, 2020, at 7:51 PM, Junchao Zhang <<a href="mailto:junchao.zhang@gmail.com" class="">junchao.zhang@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class=""><div dir="ltr" class=""><br class=""></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jun 9, 2020 at 7:11 PM GIBB Gordon <<a href="mailto:g.gibb@epcc.ed.ac.uk" class="">g.gibb@epcc.ed.ac.uk</a>> wrote:<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div style="overflow-wrap: break-word;" class="">

Hi,

<div class=""><br class="">

</div>

<div class="">First of all, my apologies if this is not the appropriate list to send these questions to.</div>

<div class=""><br class="">

</div>

<div class="">

<div style="margin:0px;font-stretch:normal;line-height:normal;font-family:"Helvetica Neue"" class="">

I’m one of the developers of TPLS (<a href="https://sourceforge.net/projects/tpls/" target="_blank" class=""><span style="color:rgb(255,204,19)" class="">https://sourceforge.net/projects/tpls/</span></a>), a Fortran code that uses PETSc, parallelised using DM vectors.

 It uses a mix of our own solvers, and PETSc’s Krylov solvers. At present it has been run on up to 25,000 MPI processes, although larger problem sizes should be able to scale beyond that.</div>

<div style="margin:0px;font-stretch:normal;line-height:normal;font-family:"Helvetica Neue";min-height:14px" class="">

<br class="">

</div>

<div style="margin:0px;font-stretch:normal;line-height:normal;font-family:"Helvetica Neue"" class="">

With the awareness that more and more HPC machines now have one or more GPUs per node, and that upcoming machines that approach/achieve Exascale will be heterogeneous in nature, we are investigating whether it is worth using GPUs with TPLS, and if so, how to

 best do this.</div>

<div style="margin:0px;font-stretch:normal;line-height:normal;font-family:"Helvetica Neue";min-height:14px" class="">

<br class="">

</div>

<div style="margin:0px;font-stretch:normal;line-height:normal;font-family:"Helvetica Neue"" class="">

I see that in principle all we’d need to do to is set some flags as described at <a href="https://www.mcs.anl.gov/petsc/features/gpus.html" target="_blank" class="">

<span style="color:rgb(255,204,19)" class="">https://www.mcs.anl.gov/petsc/features/gpus.html</span></a> to offload work onto the GPU, however I have some questions about doing this in practice:</div>

<div style="margin:0px;font-stretch:normal;line-height:normal;font-family:"Helvetica Neue";min-height:14px" class="">

<br class="">

</div>

<div style="margin:0px;font-stretch:normal;line-height:normal;font-family:"Helvetica Neue"" class="">

The GPU machine I have access to has nodes with two 20 core CPUs and 4 NVIDIA GPUs (so 10 cores per GPU). We could use CUDA or OpenCL, and may well explore both of them. With TPLS being an MPI application, we would wish to use many processes (and nodes), not

 just a single process. How would we best split this problem up? </div>

<div style="margin:0px;font-stretch:normal;line-height:normal;font-family:"Helvetica Neue";min-height:14px" class="">

<br class="">

</div>

<div style="margin:0px;font-stretch:normal;line-height:normal;font-family:"Helvetica Neue"" class="">

Would we have 1 MPI process per GPU (so 4 per node), and then implement our own solvers either to also work on the GPU, or use OpenMP to make use of the 10 cores per GPU? If so, how would we specify to PETSc which GPU each process would use? </div>

<div style="margin:0px;font-stretch:normal;line-height:normal;font-family:"Helvetica Neue";min-height:14px" class="">

<br class="">

</div>

<div style="margin:0px;font-stretch:normal;line-height:normal;font-family:"Helvetica Neue"" class="">

Would we instead just have 40 (or perhaps slightly fewer) MPI processes all sharing the GPUs? Surely this would be inefficient, and would PETSc distribute the work across all 4 GPUs, or would every process end out using a single GPU?</div></div></div></blockquote><div class="">See <a href="https://docs.olcf.ornl.gov/systems/summit_user_guide.html#volta-multi-process-service" class="">https://docs.olcf.ornl.gov/systems/summit_user_guide.html#volta-multi-process-service</a>.  In some cases, we did see better performance with multiple mpi ranks/GPU than 1 rank/GPU. The optimal configuration depends on the code. Think two extremes:  One code with work done all on GPU and the other all on CPU. Probably you only need 1 mpi rank/node for the former, but full ranks for the latter. </div><div class=""> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;" class=""><div class="">

<div style="margin:0px;font-stretch:normal;line-height:normal;font-family:"Helvetica Neue";min-height:14px" class="">

<br class="">

</div>

<div style="margin:0px;font-stretch:normal;line-height:normal;font-family:"Helvetica Neue"" class="">

Would the Krylov solvers be blocking whilst the GPUs are in use running the solvers, or would the host code be able to continue and carry out other calculations whilst waiting for the GPU code to finish? We may need to modify our algorithm to allow for this,

 but it would make sense to introduce some concurrency so that the CPUs aren’t idling whilst waiting for the GPUs to complete their work.</div></div></div></blockquote><div class="">We use asynchronous kernel launch and split-phase communication (VecScatterBegin/End). As long as there is no dependency, you can overlap computations on CPU and GPU, or computations with communications. </div><div class=""> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;" class=""><div class="">

<div style="margin:0px;font-stretch:normal;line-height:normal;font-family:"Helvetica Neue";min-height:14px" class="">

<br class="">

</div>

<div style="margin:0px;font-stretch:normal;line-height:normal;font-family:"Helvetica Neue"" class="">

Finally, I’m trying to get the OpenCL PETSc to work on my laptop (Macbook Pro with discrete AMD Radeon R9 M370X GPU). This is mostly because our GPU cluster is out of action until at least late June and I want to get a head start on experimenting with GPUs

 and TPLS. When I try to run TPLS with the ViennaCL PETSc it reports that my GPU is unable to support double precision. I confirmed that my discrete GPU does support this, however my integrated GPU (Intel Iris) does not. I suspect that ViennaCL is using my

 integrated GPU instead of my discrete one (it is listed as GPU 0 by OpenCL, with the AMD card is GPU 1). Is there any way of getting PETSc to report which OpenCL device is in use, or to select which device to use? I saw there was some discussion about this

 on the mailing list archives but I couldn’t find any conclusion.</div></div></div></blockquote><div class="">No experience. Karl Rupp (cc'ed) might know.</div><div class=""> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;" class=""><div class="">

<div style="margin:0px;font-stretch:normal;line-height:normal;font-family:"Helvetica Neue";min-height:14px" class="">

<br class="">

</div>

<div style="margin:0px;font-stretch:normal;line-height:normal;font-family:"Helvetica Neue"" class="">

Thanks in advance for your help,</div>

<div style="margin:0px;font-stretch:normal;line-height:normal;font-family:"Helvetica Neue"" class="">

<br class="">

</div>

<div style="margin:0px;font-stretch:normal;line-height:normal;font-family:"Helvetica Neue"" class="">

Regards,</div>

<div style="margin:0px;font-stretch:normal;line-height:normal;font-family:"Helvetica Neue";min-height:14px" class="">

<br class="">

</div>

<div style="margin:0px;font-stretch:normal;line-height:normal;font-family:"Helvetica Neue"" class="">

Gordon</div>

</div>

<div class=""><br class="">

</div>

<div class="">

<div class="">

<div style="overflow-wrap: break-word;" class="">

<div style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;" class="">

-----------------------------------------------<br class="">

Dr Gordon P S Gibb<br class="">

EPCC, The University of Edinburgh</div>

<div style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px;" class="">

Tel: +44 131 651 3459</div>

</div>

</div>

<br class="">

</div>

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

</div>

</blockquote></div></div>

</div></blockquote></div><br class=""></div></body></html>