<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Hi,<br>
<br>
I want to continue digging into the memory problem here. <br>
I did find a work around in the past, which is to use less cores per
node so that each core has 8G memory. However this is deficient and
expensive. I hope to locate the place that uses the most memory.<br>
<br>
Here is a brief summary of the tests I did in
past: <br>
> Test1: Mesh 1536*128*384 | Process Mesh 48*4*12 <br>
Maximum (over computational time) process memory: total
7.0727e+08 <br>
Current process
memory:
total 7.0727e+08 <br>
Maximum (over computational time) space PetscMalloc()ed: total
6.3908e+11<br>
Current space PetscMalloc()ed:
total 1.8275e+09 <br>
<br>
> Test2: Mesh 1536*128*384 | Process Mesh 96*8*24 <br>
Maximum (over computational time) process memory: total
5.9431e+09 <br>
Current process memory:
total 5.9431e+09<br>
Maximum (over computational time) space PetscMalloc()ed: total
5.3202e+12<br>
Current space
PetscMalloc()ed:
total 5.4844e+09<br>
<br>
> Test3: Mesh 3072*256*768 | Process Mesh 96*8*24<br>
OOM( Out Of Memory ) killer of the supercomputer terminated the
job during "KSPSolve". <br>
<br>
I attached the output of ksp_view( the third test's output is from
ksp_view_pre ), memory_view and also the petsc options.<br>
<br>
In all the tests, each core can access about 2G memory. In test3,
there are 4223139840 non-zeros in the matrix. This will consume
about 1.74M, using double precision. Considering some extra memory
used to store integer index, 2G memory should still be way enough.<br>
<br>
Is there a way to find out which part of KSPSolve uses the most
memory? <br>
Thank you so much.<br>
<br>
BTW, there are 4 options remains unused and I don't understand why
they are omitted:<br>
-mg_coarse_telescope_mg_coarse_ksp_type value: preonly<br>
-mg_coarse_telescope_mg_coarse_pc_type value: bjacobi<br>
-mg_coarse_telescope_mg_levels_ksp_max_it value: 1<br>
-mg_coarse_telescope_mg_levels_ksp_type value: richardson<br>
<br>
<br>
Regards,<br>
Frank<br>
<br>
<div class="moz-cite-prefix">On 07/13/2016 05:47 PM, Dave May wrote:<br>
</div>
<blockquote
cite="mid:CAJ98EDrRQfspLSv8kOuzVsXzH5bL2dfzdwu0VnhOJM2VbaxkWA@mail.gmail.com"
type="cite">
<div dir="ltr"><br>
<div class="gmail_extra"><br>
<div class="gmail_quote">On 14 July 2016 at 01:07, frank <span
dir="ltr"><<a moz-do-not-send="true"
href="mailto:hengjiew@uci.edu" target="_blank">hengjiew@uci.edu</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> Hi Dave,<br>
<br>
Sorry for the late reply.<br>
Thank you so much for your detailed reply.<br>
<br>
I have a question about the estimation of the memory
usage. There are 4223139840 allocated non-zeros and
18432 MPI processes. Double precision is used. So the
memory per process is:<br>
4223139840 * 8bytes / 18432 / 1024 / 1024 = 1.74M ? <br>
Did I do sth wrong here? Because this seems too small.<br>
</div>
</blockquote>
<div><br>
</div>
<div>No - I totally f***ed it up. You are correct. That'll
teach me for fumbling around with my iphone calculator and
not using my brain. (Note that to convert to MB just
divide by 1e6, not 1024^2 - although I apparently cannot
convert between units correctly....)</div>
<div><br>
</div>
<div>From the PETSc objects associated with the solver, It
looks like it _should_ run with 2GB per MPI rank. Sorry
for my mistake. Possibilities are: somewhere in your usage
of PETSc you've introduced a memory leak; PETSc is doing a
huge over allocation (e.g. as per our discussion of
MatPtAP); or in your application code there are other
objects you have forgotten to log the memory for.</div>
<div><br>
</div>
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> <br>
I am running this job on <a moz-do-not-send="true"
href="https://bluewaters.ncsa.illinois.edu/user-guide"
target="_blank">Bluewater</a> </div>
</blockquote>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> I am using the 7
points FD stencil in 3D. <br>
</div>
</blockquote>
<div><br>
</div>
<div>I thought so on both counts.</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> <br>
I apologize that I made a stupid mistake in computing
the memory per core. My settings render each core can
access only 2G memory on average instead of 8G which I
mentioned in previous email. I re-run the job with 8G
memory per core on average and there is no "Out Of
Memory" error. I would do more test to see if there is
still some memory issue.<br>
</div>
</blockquote>
<div><br>
</div>
<div>Ok. I'd still like to know where the memory was being
used since my estimates were off.</div>
<div><br>
</div>
<div><br>
</div>
<div>Thanks,</div>
<div> Dave</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> <br>
Regards,<br>
Frank
<div>
<div class="h5"><br>
<br>
<br>
<div>On 07/11/2016 01:18 PM, Dave May wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">Hi Frank,<br>
<br>
<div class="gmail_extra"><br>
<div class="gmail_quote">On 11 July 2016 at
19:14, frank <span dir="ltr"><<a
moz-do-not-send="true"
href="mailto:hengjiew@uci.edu"
target="_blank">hengjiew@uci.edu</a>></span>
wrote:<br>
<blockquote class="gmail_quote">
<div> Hi Dave,<br>
<br>
I re-run the test using bjacobi as the
preconditioner on the coarse mesh of
telescope. The Grid is 3072*256*768 and
process mesh is 96*8*24. The petsc
option file is attached.<br>
I still got the "Out Of Memory" error.
The error occurred before the linear
solver finished one step. So I don't
have the full info from ksp_view. The
info from ksp_view_pre is attached.</div>
</blockquote>
<div><br>
</div>
<div>Okay - that is essentially useless
(sorry)<br>
</div>
<div> </div>
<blockquote class="gmail_quote">
<div> <br>
It seems to me that the error occurred
when the decomposition was going to be
changed.<br>
</div>
</blockquote>
<div><br>
</div>
<div>Based on what information?<br>
</div>
<div>Running with -info would give us more
clues, but will create a ton of output.<br>
</div>
<div>Please try running the case which
failed with -info<br>
</div>
<div> </div>
<blockquote class="gmail_quote">
<div> I had another test with a grid of
1536*128*384 and the same process mesh
as above. There was no error. The
ksp_view info is attached for
comparison.<br>
Thank you.</div>
</blockquote>
<div><br>
</div>
<div><br>
[3] Here is my crude estimate of your
memory usage. <br>
I'll target the biggest memory hogs only
to get an order of magnitude estimate<br>
<br>
<div>* The Fine grid operator contains
4223139840 non-zeros --> 1.8 GB per
MPI rank assuming double precision.<br>
</div>
<div>The indices for the AIJ could amount
to another 0.3 GB (assuming 32 bit
integers)<br>
</div>
<div><br>
* You use 5 levels of coarsening, so the
other operators should represent
(collectively) <br>
2.1 / 8 + 2.1/8^2 + 2.1/8^3 + 2.1/8^4 ~
300 MB per MPI rank on the communicator
with 18432 ranks.<br>
</div>
<div>The coarse grid should consume ~ 0.5
MB per MPI rank on the communicator with
18432 ranks.</div>
<div><br>
* You use a reduction factor of 64,
making the new communicator with 288 MPI
ranks. <br>
PCTelescope will first gather a
temporary matrix associated with your
coarse level operator assuming a comm
size of 288 living on the comm with size
18432. <br>
This matrix will require approximately
0.5 * 64 = 32 MB per core on the 288
ranks. <br>
This matrix is then used to form a new
MPIAIJ matrix on the subcomm, thus
require another 32 MB per rank. <br>
The temporary matrix is now destroyed.<br>
</div>
<div><br>
* Because a DMDA is detected, a
permutation matrix is assembled. <br>
This requires 2 doubles per point in the
DMDA. <br>
Your coarse DMDA contains 92 x 16 x 48
points. <br>
Thus the permutation matrix will require
< 1 MB per MPI rank on the sub-comm.<br>
<br>
</div>
<div>* Lastly, the matrix is permuted.
This uses MatPtAP(), but the resulting
operator will have the same memory
footprint as the unpermuted matrix (32
MB). At any stage in PCTelescope, only 2
operators of size 32 MB are held in
memory when the DMDA is provided.<br>
</div>
<div><br>
</div>
<div>From my rough estimates, the worst
case memory foot print for any given
core, given your options is
approximately <br>
</div>
<div>2100 MB + 300 MB + 32 MB + 32 MB + 1
MB = 2465 MB<br>
</div>
<div>This is way below 8 GB.<br>
<br>
Note this estimate completely ignores:<br>
(1) the memory required for the
restriction operator, <br>
(2) the potential growth in the number
of non-zeros per row due to Galerkin
coarsening (I wished -ksp_view_pre
reported the output from MatView so we
could see the number of non-zeros
required by the coarse level operators)<br>
</div>
<div>(3) all temporary vectors required by
the CG solver, and those required by the
smoothers.<br>
</div>
<div>(4) internal memory allocated by
MatPtAP<br>
</div>
<div>(5) memory associated with IS's used
within PCTelescope<br>
</div>
<div><br>
</div>
So either I am completely off in my
estimates, or you have not carefully
estimated the memory usage of your
application code. Hopefully others might
examine/correct my rough estimates<br>
</div>
<div>
<div><br>
Since I don't have your code I cannot
access the latter.<br>
Since I don't have access to the same
machine you are running on, I think we
need to take a step back.<br>
</div>
<br>
[1] What machine are you running on? Send
me a URL if its available<br>
</div>
<div><br>
[2] What discretization are you using? (I
am guessing a scalar 7 point FD stencil)<br>
</div>
<div>If it's a 7 point FD stencil, we should
be able to examine the memory usage of
your solver configuration using a
standard, light weight existing PETSc
example, run on your machine at the same
scale. <br>
</div>
<div>This would hopefully enable us to
correctly evaluate the actual memory usage
required by the solver configuration you
are using.<br>
</div>
<div><br>
</div>
<div>Thanks,<br>
</div>
<div> Dave<br>
</div>
<div> </div>
<blockquote class="gmail_quote">
<div><span><br>
<br>
Frank</span>
<div>
<div><br>
<br>
<br>
<br>
<div>On 07/08/2016 10:38 PM, Dave
May wrote:<br>
</div>
<blockquote type="cite"><br>
<br>
On Saturday, 9 July 2016, frank
<<a moz-do-not-send="true"
href="mailto:hengjiew@uci.edu"
target="_blank">hengjiew@uci.edu</a>>
wrote:<br>
<blockquote class="gmail_quote">Hi
Barry and Dave,<br>
<br>
Thank both of you for the
advice.<br>
<br>
@Barry<br>
I made a mistake in the file
names in last email. I attached
the correct files this time.<br>
For all the three tests,
'Telescope' is used as the
coarse preconditioner.<br>
<br>
== Test1: Grid: 1536*128*384,
Process Mesh: 48*4*12<br>
Part of the memory usage:
Vector 125 124
3971904 0.<br>
Matrix 101 101
9462372 0<br>
<br>
== Test2: Grid: 1536*128*384,
Process Mesh: 96*8*24<br>
Part of the memory usage:
Vector 125 124
681672 0.<br>
Matrix 101 101
1462180 0.<br>
<br>
In theory, the memory usage in
Test1 should be 8 times of
Test2. In my case, it is about 6
times.<br>
<br>
== Test3: Grid: 3072*256*768,
Process Mesh: 96*8*24.
Sub-domain per process: 32*32*32<br>
Here I get the out of memory
error.<br>
<br>
I tried to use -mg_coarse
jacobi. In this way, I don't
need to set -mg_coarse_ksp_type
and -mg_coarse_pc_type
explicitly, right?<br>
The linear solver didn't work in
this case. Petsc output some
errors.<br>
<br>
@Dave<br>
In test3, I use only one
instance of 'Telescope'. On the
coarse mesh of 'Telescope', I
used LU as the preconditioner
instead of SVD.<br>
If my set the levels correctly,
then on the last coarse mesh of
MG where it calls 'Telescope',
the sub-domain per process is
2*2*2.<br>
On the last coarse mesh of
'Telescope', there is only one
grid point per process.<br>
I still got the OOM error. The
detailed petsc option file is
attached.</blockquote>
<div><br>
</div>
<div>Do you understand the
expected memory usage for the
particular parallel
LU implementation you are using?
I don't (seriously). Replace LU
with bjacobi and re-run this
test. My point about solver
debugging is still valid. </div>
<div><br>
</div>
<div>And please send the result of
KSPView so we can see what is
actually used in the
computations</div>
<div><br>
</div>
<div>Thanks</div>
<div> Dave</div>
<div> </div>
<blockquote class="gmail_quote"> <br>
<br>
Thank you so much.<br>
<br>
Frank<br>
<br>
<br>
<br>
On 07/06/2016 02:51 PM, Barry
Smith wrote:<br>
<blockquote class="gmail_quote">
<blockquote
class="gmail_quote"> On Jul
6, 2016, at 4:19 PM, frank
<<a
moz-do-not-send="true"
href="mailto:hengjiew@uci.edu"
target="_blank">hengjiew@uci.edu</a>>
wrote:<br>
<br>
Hi Barry,<br>
<br>
Thank you for you advice.<br>
I tried three test. In the
1st test, the grid is
3072*256*768 and the process
mesh is 96*8*24.<br>
The linear solver is 'cg'
the preconditioner is 'mg'
and 'telescope' is used as
the preconditioner at the
coarse mesh.<br>
The system gives me the "Out
of Memory" error before the
linear system is completely
solved.<br>
The info from
'-ksp_view_pre' is attached.
I seems to me that the error
occurs when it reaches the
coarse mesh.<br>
<br>
The 2nd test uses a grid of
1536*128*384 and process
mesh is 96*8*24. The 3rd
test uses the same grid but
a different process mesh
48*4*12.<br>
</blockquote>
Are you sure this is
right? The total matrix and
vector memory usage goes from
2nd test<br>
Vector 384
383 8,193,712
0.<br>
Matrix 103
103 11,508,688
0.<br>
to 3rd test<br>
Vector 384
383 1,590,520
0.<br>
Matrix 103
103 3,508,664
0.<br>
that is the memory usage got
smaller but if you have only
1/8th the processes and the
same grid it should have
gotten about 8 times bigger.
Did you maybe cut the grid by
a factor of 8 also? If so that
still doesn't explain it
because the memory usage
changed by a factor of 5
something for the vectors and
3 something for the matrices.<br>
<br>
<br>
<blockquote
class="gmail_quote"> The
linear solver and petsc
options in 2nd and 3rd tests
are the same in 1st test.
The linear solver works fine
in both test.<br>
I attached the memory usage
of the 2nd and 3rd tests.
The memory info is from the
option '-log_summary'. I
tried to use '-momery_info'
as you suggested, but in my
case petsc treated it as an
unused option. It output
nothing about the memory. Do
I need to add sth to my code
so I can use '-memory_info'?<br>
</blockquote>
Sorry, my mistake the
option is -memory_view<br>
<br>
Can you run the one case
with -memory_view and
-mg_coarse jacobi -ksp_max_it
1 (just so it doesn't iterate
forever) to see how much
memory is used without the
telescope? Also run case 2 the
same way.<br>
<br>
Barry<br>
<br>
<br>
<br>
<blockquote
class="gmail_quote"> In both
tests the memory usage is
not large.<br>
<br>
It seems to me that it might
be the 'telescope'
preconditioner that
allocated a lot of memory
and caused the error in the
1st test.<br>
Is there is a way to show
how much memory it
allocated?<br>
<br>
Frank<br>
<br>
On 07/05/2016 03:37 PM,
Barry Smith wrote:<br>
<blockquote
class="gmail_quote">
Frank,<br>
<br>
You can run with
-ksp_view_pre to have it
"view" the KSP before the
solve so hopefully it gets
that far.<br>
<br>
Please run the
problem that does fit with
-memory_info when the
problem completes it will
show the "high water mark"
for PETSc allocated memory
and total memory used. We
first want to look at
these numbers to see if it
is using more memory than
you expect. You could also
run with say half the grid
spacing to see how the
memory usage scaled with
the increase in grid
points. Make the runs also
with -log_view and send
all the output from these
options.<br>
<br>
Barry<br>
<br>
<blockquote
class="gmail_quote"> On
Jul 5, 2016, at 5:23 PM,
frank <<a
moz-do-not-send="true"
href="mailto:hengjiew@uci.edu" target="_blank">hengjiew@uci.edu</a>>
wrote:<br>
<br>
Hi,<br>
<br>
I am using the CG ksp
solver and Multigrid
preconditioner to solve
a linear system in
parallel.<br>
I chose to use the
'Telescope' as the
preconditioner on the
coarse mesh for its good
performance.<br>
The petsc options file
is attached.<br>
<br>
The domain is a 3d box.<br>
It works well when the
grid is 1536*128*384
and the process mesh is
96*8*24. When I double
the size of grid and
keep the same process
mesh and petsc options,
I get an "out of memory"
error from the
super-cluster I am
using.<br>
Each process has access
to at least 8G memory,
which should be more
than enough for my
application. I am sure
that all the other parts
of my code( except the
linear solver ) do not
use much memory. So I
doubt if there is
something wrong with the
linear solver.<br>
The error occurs before
the linear system is
completely solved so I
don't have the info from
ksp view. I am not able
to re-produce the error
with a smaller problem
either.<br>
In addition, I tried to
use the block jacobi as
the preconditioner with
the same grid and same
decomposition. The
linear solver runs
extremely slow but there
is no memory error.<br>
<br>
How can I diagnose what
exactly cause the error?<br>
Thank you so much.<br>
<br>
Frank<br>
<petsc_options.txt><br>
</blockquote>
</blockquote>
<ksp_view_pre.txt><memory_test2.txt><memory_test3.txt><petsc_options.txt><br>
</blockquote>
</blockquote>
<br>
</blockquote>
</blockquote>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</blockquote>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</blockquote>
<br>
</body>
</html>