<div dir="ltr"><div dir="ltr">On Fri, Sep 8, 2023 at 4:53 PM Chris Hewson <<a href="mailto:chris@resfrac.com">chris@resfrac.com</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi There,<div><br></div><div>I am trying to solve a linear problem and am having an issue when I use more MPI processes with the KSPsolve slowing down considerably the more processes I add.</div><div><br></div><div>The matrix itself is 620100 X 620100 with ~5 million non-zero entries, I am using petsc version 3.19.5 and have tried with a couple different versions of mpich getting the same behavior (v4.1.2 w/ device ch4:ofi and v3.3.2 w/ ch3:sock).</div><div><br></div><div>In testing, I've noticed the following trend for speed for the KSPSolve function call:</div><div>1 core: 4042 ms<br>2 core: 7085 ms<br>4 core: 26573 ms<br>8 core: 65745 ms<br>16 core: 149283 ms<br></div><div><br></div><div>This was all done on a single node machine w/ 16 non-hyperthreaded cores. We solve quite a few different matrices with PETSc using MPI and haven't noticed an impact like this on performance before.</div><div><br></div><div>I am very confused by this and am a little stumped at the moment as to why this was happening. I've been using the KSPBCGS solver to solve the problem. I have tried with multiple different solvers and pre-conditioners (we usually don't use a pre-conditioner for this part of our code). </div><div><br></div><div>It did seem that using the piped BCGS solver did help improve the parallel speed slightly (maybe 15%), but it still doesn't come close to the single threaded speed. </div><div><br></div><div>I'll attach a link to a folder that contains the specific A, x and b matrices for this problem, as well as a main.cpp file that I was using for testing. </div><div><br></div><div><a href="https://drive.google.com/drive/folders/1CEDinKxu8ZbKpLtwmqKqP1ZIDG7JvDI1?usp=sharing" target="_blank">https://drive.google.com/drive/folders/1CEDinKxu8ZbKpLtwmqKqP1ZIDG7JvDI1?usp=sharing</a><br></div><div><br></div><div>I was testing this in our main code base, but don't include that here, and observe very similar speed results to the ones above. We do use Metis to graph partition in our own code and checked the vector and matrix partitioning and that all made sense. I could be doing the partitioning incorrectly in the example (not 100% sure how it works with the viewer/load functions).</div><div><br></div><div>Any insight or thoughts on this would be greatly appreciated.</div></div></blockquote><div><br></div><div>Send the output of -log_view for each case.</div><div><br></div><div>These are all memory bandwidth-limited operations. When you exhaust the bandwidth you should</div><div>see the performance stagnate, not steeply decay as you do here. Something definitely seems wrong.</div><div>The first step is sending us the logs.</div><div><br></div><div> Thanks,</div><div><br></div><div> Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Thanks,</div><div><div><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><b><br></b></div><div dir="ltr"><b>Chris Hewson</b><div>Senior Reservoir Simulation Engineer</div><div>ResFrac</div><div>+1.587.575.9792</div></div></div></div></div></div></div></div></div></div>
</blockquote></div><br clear="all"><div><br></div><span class="gmail_signature_prefix">-- </span><br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div><div><br></div><div><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br></div></div></div></div></div></div></div></div>