<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
Junchao,<br>
<br>
Attached is a graph of total RSS from my Mac using openmpi and
mpich (installed with --download-openmpi and --download-mpich).<br>
<br>
The difference is pretty stark! The WaitAll( ) in my part of the
code fixed the run away memory <br>
problem using openmpi but definitely not with mpich.<br>
<br>
Tomorrow I hope to get my linux box set up; unfortunately it needs
an OS update :(<br>
Then I can try to run there and reproduce the same (or find out it
is a Mac quirk, though the<br>
reason I started looking at this was that a use on an HPC system
pointed it out to me).<br>
<br>
-sanjay<br>
<br>
PS: To generate the data, all I did was place a call to
PetscMemoryGetCurrentUsage( ) right after KSPSolve( ), followed by
an MPI_AllReduce( ) to sum across the job (4 processors).<br>
<pre class="moz-signature" cols="72">
</pre>
<div class="moz-cite-prefix">On 6/4/19 4:27 PM, Zhang, Junchao
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CA+MQGp-GEq1D1tykJ1L6VVy6K0fuzUh3eJP=ZFafbPnBh8DPVg@mail.gmail.com">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<div dir="ltr">Hi, Sanjay,
<div> I managed to use Valgrind massif + MPICH master + PETSc
master. I ran ex5 500 time steps with "mpirun -n 4 valgrind
--tool=massif --max-snapshots=200 --detailed-freq=1 ./ex5
-da_grid_x 512 -da_grid_y 512 -ts_type beuler -ts_max_steps
500 -malloc"</div>
<div> I visualized the output with massif-visualizer. From the
attached picture, we can see the total heap size keeps
constant most of the time and is NOT
monotonically increasing. We can also see MPI only allocated
memory at initialization time and kept it. So it is unlikely
that MPICH keeps allocating memory in each KSPSolve call.</div>
<div> From graphs you sent, I can only see RSS is randomly
increased after KSPSolve, but that does not mean heap size
keeps increasing. I recommend you also profile your code with
valgrind massif and visualize it. I failed to install
massif-visualizer on MacBook and CentOS. But I easily got it
installed on Ubuntu.</div>
<div> I want you to confirm that with the MPI_Waitall fix, you
still run out of memory with MPICH (but not OpenMPI). If
needed, I can hack MPICH to get its current memory usage so
that we can calculate its difference after each KSPSolve call.</div>
<div> </div>
<div>
<div><img src="cid:part1.B91F8C7C.750093BF@berkeley.edu"
alt="massif-ex5.png" class="" width="562" height="382"><br>
</div>
</div>
<div><br>
</div>
<div><br clear="all">
<div>
<div dir="ltr" class="gmail_signature"
data-smartmail="gmail_signature">
<div dir="ltr">--Junchao Zhang</div>
</div>
</div>
<br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Mon, Jun 3, 2019 at 6:36 PM
Sanjay Govindjee <<a href="mailto:s_g@berkeley.edu"
moz-do-not-send="true">s_g@berkeley.edu</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">Junchao,<br>
I won't be feasible to share the code but I will run a
similar test as you have done (large problem); I will<br>
try with both MPICH and OpenMPI. I also agree that deltas
are not ideal as there they do not account for latency in
the freeing of memory<br>
etc. But I will note when we have the memory growth issue
latency associated with free( ) appears not to be in play
since the total<br>
memory footprint grows monotonically.<br>
<br>
I'll also have a look at massif. If you figure out the
interface, and can send me the lines to instrument the code
with that will save me<br>
some time.<br>
-sanjay<br>
<div class="gmail-m_3580431406949325625moz-cite-prefix">On
6/3/19 3:17 PM, Zhang, Junchao wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">Sanjay & Barry,
<div> Sorry, I made a mistake that I said I could
reproduced Sanjay's experiments. I found 1) to
correctly use PetscMallocGetCurrentUsage() when petsc
is configured without debugging, I have to add -malloc
to run the program. 2) I have to instrument the code
outside of KSPSolve(). In my case, it is
in SNESSolve_NEWTONLS. In old experiments, I did it
inside KSPSolve. Since KSPSolve can recursively call
KSPSolve, the old results were misleading.</div>
<div> With these fixes, I measured differences of RSS
and Petsc malloc before/after KSPSolve. I did
experiments on MacBook
using src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c
with commands like mpirun -n 4 ./ex5 -da_grid_x 64
-da_grid_y 64 -ts_type beuler -ts_max_steps 500
-malloc.</div>
<div> I find if the grid size is small, I can see a
non-zero RSS-delta randomly, either with one mpi rank
or multiple ranks, with MPICH or OpenMPI. If I
increase grid sizes, e.g., -da_grid_x 256 -da_grid_y
256, I only see non-zero RSS-delta randomly at the
first few iterations (with MPICH or OpenMPI). When the
computer workload is high by simultaneously running
ex5-openmpi and ex5-mpich, the MPICH one pops up much
more non-zero RSS-delta. But "Malloc Delta" behavior
is stable across all runs. There is only one nonzero
malloc delta value in the first KSPSolve call. All
remaining are zero. Something like this:</div>
<blockquote style="margin:0px 0px 0px
40px;border:none;padding:0px">
<div><font face="courier new, monospace">mpirun -n 4
./ex5-mpich -da_grid_x 256 -da_grid_y 256 -ts_type
beuler -ts_max_steps 500 -malloc</font></div>
<div><font face="courier new, monospace">RSS Delta=
32489472, Malloc Delta= 26290304, RSS
End= 136114176</font></div>
<div><font face="courier new, monospace">RSS Delta=
32768, Malloc Delta= 0, RSS
End= 138510336</font></div>
<div><font face="courier new, monospace">RSS Delta=
0, Malloc Delta= 0, RSS
End= 138522624</font></div>
<div><font face="courier new, monospace">RSS Delta=
0, Malloc Delta= 0, RSS
End= 138539008</font></div>
</blockquote>
<div>So I think I can conclude there is no unfreed
memory in KSPSolve() allocated by PETSc. Has MPICH
allocated unfreed memory in KSPSolve? That is possible
and I am trying to find a way like
PetscMallocGetCurrentUsage() to measure that. Also, I
think RSS delta is not a good way to measure memory
allocation. It is dynamic and depends on state of the
computer (swap, shared libraries loaded etc) when
running the code. We should focus on malloc instead.
If there was a valgrind tool, like performance
profiling tools, that can let users measure memory
allocated but not freed in a user specified code
segment, that would be very helpful in this case. But
I have not found one.<br>
</div>
<div><br>
</div>
<div>Sanjay, did you say currently you could run with
OpenMPI without out of memory, but with MPICH, you ran
out of memory? Is it feasible to share your code so
that I can test with? Thanks.</div>
<div><br>
</div>
<div>--Junchao Zhang<br>
</div>
<div><br>
</div>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Sat, Jun 1, 2019
at 3:21 AM Sanjay Govindjee <<a
href="mailto:s_g@berkeley.edu" target="_blank"
moz-do-not-send="true">s_g@berkeley.edu</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px
0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
Barry,<br>
<br>
If you look at the graphs I generated (on my Mac),
you will see that <br>
OpenMPI and MPICH have very different values (along
with the fact that <br>
MPICH does not seem to adhere<br>
to the standard (for releasing MPI_ISend resources
following and MPI_Wait).<br>
<br>
-sanjay<br>
<br>
PS: I agree with Barry's assessment; this is really
not that acceptable.<br>
<br>
On 6/1/19 1:00 AM, Smith, Barry F. wrote:<br>
> Junchao,<br>
><br>
> This is insane. Either the OpenMPI
library or something in the OS underneath related to
sockets and interprocess communication is grabbing
additional space for each round of MPI
communication! Does MPICH have the same values or
different values than OpenMP? When you run on Linux
do you get the same values as Apple or different.
--- Same values seem to indicate the issue is inside
OpenMPI/MPICH different values indicates problem is
more likely at the OS level. Does this happen only
with the default VecScatter that uses blocking MPI,
what happens with PetscSF under Vec? Is it somehow
related to PETSc's use of nonblocking sends and
receives? One could presumably use valgrind to see
exactly what lines in what code are causing these
increases. I don't think we can just shrug and say
this is the way it is, we need to track down and
understand the cause (and if possible fix).<br>
><br>
> Barry<br>
><br>
><br>
>> On May 31, 2019, at 2:53 PM, Zhang, Junchao
<<a href="mailto:jczhang@mcs.anl.gov"
target="_blank" moz-do-not-send="true">jczhang@mcs.anl.gov</a>>
wrote:<br>
>><br>
>> Sanjay,<br>
>> I tried petsc with MPICH and OpenMPI on my
Macbook. I inserted
PetscMemoryGetCurrentUsage/PetscMallocGetCurrentUsage
at the beginning and end of KSPSolve and then
computed the delta and summed over processes. Then I
tested with
src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c<br>
>> With OpenMPI,<br>
>> mpirun -n 4 ./ex5 -da_grid_x 128 -da_grid_y
128 -ts_type beuler -ts_max_steps 500 > 128.log<br>
>> grep -n -v "RSS Delta= 0, Malloc
Delta= 0" 128.log<br>
>> 1:RSS Delta= 69632, Malloc Delta=
0<br>
>> 2:RSS Delta= 69632, Malloc Delta=
0<br>
>> 3:RSS Delta= 69632, Malloc Delta=
0<br>
>> 4:RSS Delta= 69632, Malloc Delta=
0<br>
>> 9:RSS Delta=9.25286e+06, Malloc Delta=
0<br>
>> 22:RSS Delta= 49152, Malloc Delta=
0<br>
>> 44:RSS Delta= 20480, Malloc Delta=
0<br>
>> 53:RSS Delta= 49152, Malloc Delta=
0<br>
>> 66:RSS Delta= 4096, Malloc Delta=
0<br>
>> 97:RSS Delta= 16384, Malloc Delta=
0<br>
>> 119:RSS Delta= 20480, Malloc Delta=
0<br>
>> 141:RSS Delta= 53248, Malloc Delta=
0<br>
>> 176:RSS Delta= 16384, Malloc Delta=
0<br>
>> 308:RSS Delta= 16384, Malloc Delta=
0<br>
>> 352:RSS Delta= 16384, Malloc Delta=
0<br>
>> 550:RSS Delta= 16384, Malloc Delta=
0<br>
>> 572:RSS Delta= 16384, Malloc Delta=
0<br>
>> 669:RSS Delta= 40960, Malloc Delta=
0<br>
>> 924:RSS Delta= 32768, Malloc Delta=
0<br>
>> 1694:RSS Delta= 20480, Malloc Delta=
0<br>
>> 2099:RSS Delta= 16384, Malloc Delta=
0<br>
>> 2244:RSS Delta= 20480, Malloc Delta=
0<br>
>> 3001:RSS Delta= 16384, Malloc Delta=
0<br>
>> 5883:RSS Delta= 16384, Malloc Delta=
0<br>
>><br>
>> If I increased the grid<br>
>> mpirun -n 4 ./ex5 -da_grid_x 512 -da_grid_y
512 -ts_type beuler -ts_max_steps 500 -malloc_test
>512.log<br>
>> grep -n -v "RSS Delta= 0, Malloc
Delta= 0" 512.log<br>
>> 1:RSS Delta=1.05267e+06, Malloc Delta=
0<br>
>> 2:RSS Delta=1.05267e+06, Malloc Delta=
0<br>
>> 3:RSS Delta=1.05267e+06, Malloc Delta=
0<br>
>> 4:RSS Delta=1.05267e+06, Malloc Delta=
0<br>
>> 13:RSS Delta=1.24932e+08, Malloc Delta=
0<br>
>><br>
>> So we did see RSS increase in 4k-page sizes
after KSPSolve. As long as no memory leaks, why do
you care about it? Is it because you run out of
memory?<br>
>><br>
>> On Thu, May 30, 2019 at 1:59 PM Smith,
Barry F. <<a href="mailto:bsmith@mcs.anl.gov"
target="_blank" moz-do-not-send="true">bsmith@mcs.anl.gov</a>>
wrote:<br>
>><br>
>> Thanks for the update. So the current
conclusions are that using the Waitall in your code<br>
>><br>
>> 1) solves the memory issue with OpenMPI in
your code<br>
>><br>
>> 2) does not solve the memory issue with
PETSc KSPSolve<br>
>><br>
>> 3) MPICH has memory issues both for your
code and PETSc KSPSolve (despite) the wait all fix?<br>
>><br>
>> If you literately just comment out the call
to KSPSolve() with OpenMPI is there no growth in
memory usage?<br>
>><br>
>><br>
>> Both 2 and 3 are concerning, indicate
possible memory leak bugs in MPICH and not freeing
all MPI resources in KSPSolve()<br>
>><br>
>> Junchao, can you please investigate 2 and 3
with, for example, a TS example that uses the linear
solver (like with -ts_type beuler)? Thanks<br>
>><br>
>><br>
>> Barry<br>
>><br>
>><br>
>><br>
>>> On May 30, 2019, at 1:47 PM, Sanjay
Govindjee <<a href="mailto:s_g@berkeley.edu"
target="_blank" moz-do-not-send="true">s_g@berkeley.edu</a>>
wrote:<br>
>>><br>
>>> Lawrence,<br>
>>> Thanks for taking a look! This is what
I had been wondering about -- my knowledge of MPI is
pretty minimal and<br>
>>> this origins of the routine were from a
programmer we hired a decade+ back from NERSC. I'll
have to look into<br>
>>> VecScatter. It will be great to
dispense with our roll-your-own routines (we even
have our own reduceALL scattered around the code).<br>
>>><br>
>>> Interestingly, the MPI_WaitALL has
solved the problem when using OpenMPI but it still
persists with MPICH. Graphs attached.<br>
>>> I'm going to run with openmpi for now
(but I guess I really still need to figure out what
is wrong with MPICH and WaitALL;<br>
>>> I'll try Barry's suggestion of
--download-mpich-configure-arguments="--enable-error-messages=all
--enable-g" later today and report back).<br>
>>><br>
>>> Regarding MPI_Barrier, it was put in
due a problem that some processes were finishing up
sending and receiving and exiting the subroutine<br>
>>> before the receiving processes had
completed (which resulted in data loss as the
buffers are freed after the call to the routine).
MPI_Barrier was the solution proposed<br>
>>> to us. I don't think I can dispense
with it, but will think about some more.<br>
>>><br>
>>> I'm not so sure about using MPI_IRecv
as it will require a bit of rewriting since right
now I process the received<br>
>>> data sequentially after each blocking
MPI_Recv -- clearly slower but easier to code.<br>
>>><br>
>>> Thanks again for the help.<br>
>>><br>
>>> -sanjay<br>
>>><br>
>>> On 5/30/19 4:48 AM, Lawrence Mitchell
wrote:<br>
>>>> Hi Sanjay,<br>
>>>><br>
>>>>> On 30 May 2019, at 08:58,
Sanjay Govindjee via petsc-users <<a
href="mailto:petsc-users@mcs.anl.gov"
target="_blank" moz-do-not-send="true">petsc-users@mcs.anl.gov</a>>
wrote:<br>
>>>>><br>
>>>>> The problem seems to persist
but with a different signature. Graphs attached as
before.<br>
>>>>><br>
>>>>> Totals with MPICH (NB: single
run)<br>
>>>>><br>
>>>>> For the CG/Jacobi
data_exchange_total = 41,385,984; kspsolve_total =
38,289,408<br>
>>>>> For the GMRES/BJACOBI
data_exchange_total = 41,324,544; kspsolve_total =
41,324,544<br>
>>>>><br>
>>>>> Just reading the MPI docs I am
wondering if I need some sort of
MPI_Wait/MPI_Waitall before my MPI_Barrier in the
data exchange routine?<br>
>>>>> I would have thought that with
the blocking receives and the MPI_Barrier that
everything will have fully completed and cleaned up
before<br>
>>>>> all processes exited the
routine, but perhaps I am wrong on that.<br>
>>>> Skimming the fortran code you sent
you do:<br>
>>>><br>
>>>> for i in ...:<br>
>>>> call MPI_Isend(..., req, ierr)<br>
>>>><br>
>>>> for i in ...:<br>
>>>> call MPI_Recv(..., ierr)<br>
>>>><br>
>>>> But you never call MPI_Wait on the
request you got back from the Isend. So the MPI
library will never free the data structures it
created.<br>
>>>><br>
>>>> The usual pattern for these
non-blocking communications is to allocate an array
for the requests of length nsend+nrecv and then do:<br>
>>>><br>
>>>> for i in nsend:<br>
>>>> call MPI_Isend(..., req[i],
ierr)<br>
>>>> for j in nrecv:<br>
>>>> call MPI_Irecv(...,
req[nsend+j], ierr)<br>
>>>><br>
>>>> call MPI_Waitall(req, ..., ierr)<br>
>>>><br>
>>>> I note also there's no need for the
Barrier at the end of the routine, this kind of
communication does neighbourwise synchronisation, no
need to add (unnecessary) global synchronisation
too.<br>
>>>><br>
>>>> As an aside, is there a reason you
don't use PETSc's VecScatter to manage this global
to local exchange?<br>
>>>><br>
>>>> Cheers,<br>
>>>><br>
>>>> Lawrence<br>
>>>
<cg_mpichwall.png><cg_wall.png><gmres_mpichwall.png><gmres_wall.png><br>
<br>
</blockquote>
</div>
</div>
</blockquote>
<br>
</div>
</blockquote>
</div>
</blockquote>
<br>
</body>
</html>