[petsc-users] Slower performance using more MPI processes
Matthew Knepley
knepley at gmail.com
Fri Sep 8 15:59:31 CDT 2023
On Fri, Sep 8, 2023 at 4:53 PM Chris Hewson <chris at resfrac.com> wrote:
> Hi There,
>
> I am trying to solve a linear problem and am having an issue when I use
> more MPI processes with the KSPsolve slowing down considerably the more
> processes I add.
>
> The matrix itself is 620100 X 620100 with ~5 million non-zero entries, I
> am using petsc version 3.19.5 and have tried with a couple different
> versions of mpich getting the same behavior (v4.1.2 w/ device ch4:ofi and
> v3.3.2 w/ ch3:sock).
>
> In testing, I've noticed the following trend for speed for the KSPSolve
> function call:
> 1 core: 4042 ms
> 2 core: 7085 ms
> 4 core: 26573 ms
> 8 core: 65745 ms
> 16 core: 149283 ms
>
> This was all done on a single node machine w/ 16 non-hyperthreaded cores.
> We solve quite a few different matrices with PETSc using MPI and haven't
> noticed an impact like this on performance before.
>
> I am very confused by this and am a little stumped at the moment as to why
> this was happening. I've been using the KSPBCGS solver to solve the
> problem. I have tried with multiple different solvers and pre-conditioners
> (we usually don't use a pre-conditioner for this part of our code).
>
> It did seem that using the piped BCGS solver did help improve the parallel
> speed slightly (maybe 15%), but it still doesn't come close to the single
> threaded speed.
>
> I'll attach a link to a folder that contains the specific A, x and b
> matrices for this problem, as well as a main.cpp file that I was using for
> testing.
>
>
> https://drive.google.com/drive/folders/1CEDinKxu8ZbKpLtwmqKqP1ZIDG7JvDI1?usp=sharing
>
> I was testing this in our main code base, but don't include that here, and
> observe very similar speed results to the ones above. We do use Metis to
> graph partition in our own code and checked the vector and matrix
> partitioning and that all made sense. I could be doing the partitioning
> incorrectly in the example (not 100% sure how it works with the viewer/load
> functions).
>
> Any insight or thoughts on this would be greatly appreciated.
>
Send the output of -log_view for each case.
These are all memory bandwidth-limited operations. When you exhaust the
bandwidth you should
see the performance stagnate, not steeply decay as you do here. Something
definitely seems wrong.
The first step is sending us the logs.
Thanks,
Matt
> Thanks,
>
> *Chris Hewson*
> Senior Reservoir Simulation Engineer
> ResFrac
> +1.587.575.9792
>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230908/00563457/attachment.html>
More information about the petsc-users
mailing list