<br><br>On Friday, 7 October 2016, frank <<a href="mailto:hengjiew@uci.edu">hengjiew@uci.edu</a>> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
<p>Dear all,</p>
<p>Thank you so much for the advice. <br>
</p>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote"><span></span>
<div>All setup is done in the first solve.</div>
<span>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div> ** The time for 1st solve does not scale. <br>
In practice, I am solving a variable
coefficient Poisson equation. I need to build the
matrix every time step. Therefore, each step is
similar to the 1st solve which does not scale. Is
there a way I can improve the performance? <br>
</div>
</blockquote>
</span></div>
</div>
</div>
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote"><span>
<div><br>
</div>
</span>
<div>You could use rediscretization instead of
Galerkin to produce the coarse operators.</div>
</div>
</div>
</div>
</blockquote>
<div><br>
</div>
<div>
<div>Yes I can think of one option for improved
performance, but I cannot tell whether it will be
beneficial because the logging isn't sufficiently fine
grained (and there is no easy way to get the info out of
petsc). <br>
<br>
I use PtAP to repartition the matrix, this could be
consuming most of the setup time in Telescope with your
run. Such a repartitioning could be avoid if you
provided a method to create the operator on the coarse
levels (what Matt is suggesting). However, this requires
you to be able to define your coefficients on the coarse
grid. This will most likely reduce setup time, but your
coarse grid operators (now re-discretized) are likely to
be less effective than those generated via Galerkin
coarsening.<br>
</div>
</div>
</div>
</div>
</div>
</blockquote>
<br>
Please correct me if I understand this incorrectly: I can define
my own restriction function and pass it to petsc instead of using
PtAP.<br>
If so,what's the interface to do that?</div></blockquote><div><br></div><div>You need to provide your provide a method to KSPSetComputeOoerators to your outer KSP</div><div><br></div><div><a href="http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/KSP/KSPSetComputeOperators.html">http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/KSP/KSPSetComputeOperators.html</a><br></div><div><br></div><div>This method will get propagated through telescope to the KSP running in the sub-comm.</div><div><br></div><div>Note that this functionality is currently not support for fortran. I need to make a small modification to telescope to enable fortran support.</div><div><br></div><div>Thanks</div><div> Dave</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><br>
<br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div class="gmail_extra"><span></span>
<div class="gmail_quote">
<div>Also, you use CG/MG when FMG by itself would
probably be faster. Your smoother is likely not
strong enough, and you</div>
<div>should use something like V(2,2). There is a
lot of tuning that is possible, but difficult to
automate.</div>
</div>
</div>
</div>
</blockquote>
<div><br>
</div>
<div>Matt's completely correct. <br>
If we could automate this in a meaningful manner, we would
have done so.<br>
</div>
</div>
</div>
</div>
</blockquote>
<br>
I am not as familiar with multigrid as you guys. It would be very
kind if you could be more specific.<br>
What does V(2,2) stand for? Is there some strong smoother build in
petsc that I can try?<br>
<br>
<br>
Another thing, the vector assemble and scatter take more time as I
increased the cores#:<br>
<br>
cores# <wbr> 4096
8192 16384 32768 65536 <br>
VecAssemblyBegin 298 2.91E+00 2.87E+00
8.59E+00 2.75E+01 2.21E+03<br>
VecAssemblyEnd 298 3.37E-03 1.78E-03
1.78E-03 5.13E-03 1.99E-03<br>
VecScatterBegin 76303 3.82E+00 3.01E+00
2.54E+00 4.40E+00 1.32E+00<br>
VecScatterEnd 76303 3.09E+01 1.47E+01
2.23E+01 2.96E+01 2.10E+01<br>
<br>
The above data is produced by solving a constant coefficients
Possoin equation with different rhs for 100 steps. <br>
As you can see, the time of VecAssemblyBegin increase dramatically
from 32K cores to 65K. <br>
With 65K cores, it took more time to assemble the rhs than solving
the equation. Is there a way to improve this?<br>
<br>
<br>
Thank you.<br>
<br>
Regards,<br>
Frank <br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>
<div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<div>
<div><br>
<br>
<br>
<br>
<br>
<div>On
10/04/2016 12:56 PM, Dave May wrote:<br>
</div>
<blockquote type="cite"><br>
<br>
On Tuesday, 4 October 2016, frank <<a href="javascript:_e(%7B%7D,'cvml','hengjiew@uci.edu');" target="_blank">hengjiew@uci.edu</a>>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<p>Hi,</p>
This question is follow-up of the
thread "Question about memory
usage in Multigrid
preconditioner".<br>
I used to have the "Out of
Memory(OOM)" problem when using
the CG+Telescope MG solver with
32768 cores. Adding the "-matrap
0; -matptap_scalable" option did
solve that problem. <br>
<br>
Then I test the scalability by
solving a 3d poisson eqn for 1
step. I used one sub-communicator
in all the tests. The difference
between the petsc options in those
tests are: 1 the
pc_telescope_reduction_factor; 2
the number of multigrid levels in
the up/down solver. The function
"ksp_solve" is timed. It is kind
of slow and doesn't scale at all.
<br>
<br>
Test1: 512^3 grid points<br>
Core#
telescope_reduction_factor <wbr>
MG levels# for up/down solver
Time for KSPSolve (s)<br>
512
8 <wbr>
4 / 3 <wbr>
6.2466<br>
4096
64 <wbr>
5 / 3 <wbr>
0.9361<br>
32768
64 <wbr>
4 / 3 <wbr>
4.8914<br>
<br>
Test2: 1024^3 grid points<br>
Core#
telescope_reduction_factor <wbr>
MG levels# for up/down solver
Time for KSPSolve (s)<br>
4096
64 <wbr>
5 / 4
<wbr>
3.4139<br>
8192
128 <wbr>
5 / 4 <wbr>
2.4196<br>
16384 32
<wbr>
5 / 3
<wbr>
5.4150<br>
32768
64 <wbr>
5 / 3 <wbr>
5.6067<br>
65536
128 <wbr>
5 / 3 <wbr>
6.5219</div>
</blockquote>
<div><br>
</div>
<div>You have to be very careful how
you interpret these numbers. Your
solver contains nested calls to
KSPSolve, and unfortunately as a
result the numbers you report
include setup time. This will remain
true even if you call KSPSetUp on
the outermost KSP. </div>
<div><br>
</div>
<div>Your email concerns scalability
of the silver application, so let's
focus on that issue.</div>
<div><br>
</div>
<div>The only way to clearly separate
setup from solve time is to perform
two identical solves. The second
solve will not require any setup.
You should monitor the second solve
via a new PetscStage.</div>
<div><br>
</div>
<div>This was what I did in the
telescope paper. It was the only way
to understand the setup cost (and
scaling) cf the solve time (and
scaling).</div>
<div><br>
</div>
<div>Thanks</div>
<div> Dave</div>
<div>
<div>
<div><br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"> I guess
I didn't set the MG levels
properly. What would be the
efficient way to arrange the
MG levels?<br>
Also which preconditionr at
the coarse mesh of the 2nd
communicator should I use to
improve the performance? <br>
<br>
I attached the test code and
the petsc options file for the
1024^3 cube with 32768 cores.
<br>
<br>
Thank you.<br>
<br>
Regards,<br>
Frank<br>
<br>
<br>
<br>
<br>
<br>
<br>
<div>On 09/15/2016 03:35 AM,
Dave May wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>
<div>
<div>
<div>
<div>HI all,<br>
<br>
</div>
<div>I the only
unexpected
memory usage I
can see is
associated with
the call to
MatPtAP().<br>
</div>
<div>Here is
something you
can try
immediately.<br>
</div>
</div>
Run your code with
the additional
options<br>
-matrap 0
-matptap_scalable<br>
<br>
</div>
<div>I didn't realize
this before, but the
default behaviour of
MatPtAP in parallel
is actually to to
explicitly form the
transpose of P (e.g.
assemble R = P^T)
and then compute
R.A.P. <br>
You don't want to do
this. The option
-matrap 0 resolves
this issue.<br>
</div>
<div><br>
</div>
<div>The
implementation of
P^T.A.P has two
variants. <br>
The scalable
implementation (with
respect to memory
usage) is selected
via the second
option
-matptap_scalable.</div>
<div><br>
</div>
<div>Try it out - I
see a significant
memory reduction
using these options
for particular mesh
sizes / partitions.<br>
</div>
<div><br>
</div>
I've attached a
cleaned up version of
the code you sent me.<br>
</div>
There were a number of
memory leaks and other
issues.<br>
</div>
<div>The main points being<br>
</div>
* You should call
DMDAVecGetArrayF90()
before
VecAssembly{Begin,End}<br>
* You should call
PetscFinalize(), otherwise
the option -log_summary
(-log_view) will not
display anything once the
program has completed.<br>
<div>
<div>
<div><br>
<br>
</div>
<div>Thanks,<br>
</div>
<div> Dave<br>
</div>
<div>
<div>
<div><br>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On
15 September 2016 at
08:03, Hengjie Wang <span dir="ltr"><<a>hengjiew@uci.edu</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
Hi Dave,<br>
<br>
Sorry, I should have
put more comment to
explain the code. <br>
The number of
process in each
dimension is the
same: Px = Py=Pz=P.
So is the domain
size.<br>
So if the you want
to run the code for
a 512^3 grid points
on 16^3 cores, you
need to set "-N 512
-P 16" in the
command line.<br>
I add more comments
and also fix an
error in the
attached code. ( The
error only effects
the accuracy of
solution but not the
memory usage. ) <br>
<div><br>
Thank you.<span><font color="#888888"><br>
Frank</font></span>
<div>
<div><br>
<br>
On 9/14/2016
9:05 PM, Dave
May wrote:<br>
</div>
</div>
</div>
<div>
<div>
<blockquote type="cite"><br>
<br>
On Thursday,
15 September
2016, Dave May
<<a>dave.mayhem23@gmail.com</a>>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
<br>
On Thursday,
15 September
2016, frank
<<a>hengjiew@uci.edu</a>>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
Hi, <br>
<br>
I write a
simple code to
re-produce the
error. I hope
this can help
to diagnose
the problem.<br>
The code just
solves a 3d
poisson
equation. </div>
</blockquote>
<div><br>
</div>
<div>Why is
the stencil
width a
runtime
parameter??
And why is the
default value
2? For 7-pnt
FD Laplace,
you only need
a stencil
width of 1. </div>
<div><br>
</div>
<div>Was this
choice made to
mimic
something in
the
real application
code?</div>
</blockquote>
<div><br>
</div>
Please ignore
- I
misunderstood your
usage of the
param set by
-P
<div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"><br>
I run the code
on a 1024^3
mesh. The
process
partition is
32 * 32 * 32.
That's when I
re-produce the
OOM error.
Each core has
about 2G
memory.<br>
I also run the
code on a
512^3 mesh
with 16 * 16 *
16 processes.
The ksp solver
works fine. <br>
I attached the
code,
ksp_view_pre's
output and my
petsc option
file.<br>
<br>
Thank you.<br>
Frank<br>
<div><br>
On 09/09/2016
06:38 PM,
Hengjie Wang
wrote:<br>
</div>
<blockquote type="cite">Hi
Barry,
<div><br>
</div>
<div>I
checked. On
the
supercomputer,
I had the
option
"-ksp_view_pre"
but it is not
in file I sent
you. I am
sorry for the
confusion.</div>
<div><br>
</div>
<div>Regards,</div>
<div>Frank<span></span><br>
<br>
On Friday,
September 9,
2016, Barry
Smith <<a>bsmith@mcs.anl.gov</a>>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
> On Sep 9,
2016, at 3:11
PM, frank <<a>hengjiew@uci.edu</a>> wrote:<br>
><br>
> Hi Barry,<br>
><br>
> I think
the first KSP
view output is
from
-ksp_view_pre.
Before I
submitted the
test, I was
not sure
whether there
would be OOM
error or not.
So I added
both
-ksp_view_pre
and -ksp_view.<br>
<br>
But the
options file
you sent
specifically
does NOT list
the
-ksp_view_pre
so how could
it be from
that?<br>
<br>
Sorry to be
pedantic but
I've spent too
much time in
the past
trying to
debug from
incorrect
information
and want to
make sure that
the
information I
have is
correct before
thinking.
Please recheck
exactly what
happened.
Rerun with the
exact input
file you
emailed if
that is
needed.<br>
<br>
Barry<br>
<br>
><br>
> Frank<br>
><br>
><br>
> On
09/09/2016
12:38 PM,
Barry Smith
wrote:<br>
>> Why
does
ksp_view2.txt
have two KSP
views in it
while
ksp_view1.txt
has only one
KSPView in it?
Did you run
two different
solves in the
2 case but not
the one?<br>
>><br>
>>
Barry<br>
>><br>
>><br>
>><br>
>>>
On Sep 9,
2016, at 10:56
AM, frank <<a>hengjiew@uci.edu</a>> wrote:<br>
>>><br>
>>>
Hi,<br>
>>><br>
>>> I
want to
continue
digging into
the memory
problem here.<br>
>>> I
did find a
work around in
the past,
which is to
use less cores
per node so
that each core
has 8G memory.
However this
is deficient
and expensive.
I hope to
locate the
place that
uses the most
memory.<br>
>>><br>
>>>
Here is a
brief summary
of the tests I
did in past:<br>
>>>> Test1: Mesh 1536*128*384 | Process Mesh 48*4*12<br>
>>>
Maximum (over
computational
time) process
memory:
total
7.0727e+08<br>
>>>
Current
process
memory:
total
7.0727e+08<br>
>>>
Maximum (over
computational
time) space
PetscMalloc()ed:
total
6.3908e+11<br>
>>>
Current space
PetscMalloc()ed: total
1.8275e+09<br>
>>><br>
>>>> Test2: Mesh 1536*128*384 | Process Mesh 96*8*24<br>
>>>
Maximum (over
computational
time) process
memory:
total
5.9431e+09<br>
>>>
Current
process
memory:
total
5.9431e+09<br>
>>>
Maximum (over
computational
time) space
PetscMalloc()ed:
total
5.3202e+12<br>
>>>
Current space
PetscMalloc()ed: total
5.4844e+09<br>
>>><br>
>>>> Test3: Mesh 3072*256*768 | Process Mesh 96*8*24<br>
>>>
OOM( Out Of
Memory )
killer of the
supercomputer
terminated the
job during
"KSPSolve".<br>
>>><br>
>>> I
attached the
output of
ksp_view( the
third test's
output is from
ksp_view_pre
), memory_view
and also the
petsc options.<br>
>>><br>
>>>
In all the
tests, each
core can
access about
2G memory. In
test3, there
are 4223139840
non-zeros in
the matrix.
This will
consume about
1.74M, using
double
precision.
Considering
some extra
memory used to
store integer
index, 2G
memory should
still be way
enough.<br>
>>><br>
>>>
Is there a way
to find out
which part of
KSPSolve uses
the most
memory?<br>
>>>
Thank you so
much.<br>
>>><br>
>>>
BTW, there are
4 options
remains unused
and I don't
understand why
they are
omitted:<br>
>>>
-mg_coarse_telescope_mg_coarse<wbr>_ksp_type
value: preonly<br>
>>>
-mg_coarse_telescope_mg_coarse<wbr>_pc_type
value: bjacobi<br>
>>>
-mg_coarse_telescope_mg_levels<wbr>_ksp_max_it
value: 1<br>
>>>
-mg_coarse_telescope_mg_levels<wbr>_ksp_type
value:
richardson<br>
>>><br>
>>><br>
>>>
Regards,<br>
>>>
Frank<br>
>>><br>
>>>
On 07/13/2016
05:47 PM, Dave
May wrote:<br>
>>>><br>
>>>> On 14 July 2016 at 01:07, frank <<a>hengjiew@uci.edu</a>>
wrote:<br>
>>>> Hi Dave,<br>
>>>><br>
>>>> Sorry for the late reply.<br>
>>>> Thank you so much for your detailed reply.<br>
>>>><br>
>>>> I have a question about the estimation of the memory
usage. There
are 4223139840
allocated
non-zeros and
18432 MPI
processes.
Double
precision is
used. So the
memory per
process is:<br>
>>>> 4223139840 * 8bytes / 18432 / 1024 / 1024 = 1.74M ?<br>
>>>> Did I do sth wrong here? Because this seems too small.<br>
>>>><br>
>>>> No - I totally f***ed it up. You are correct. That'll
teach me for
fumbling
around with my
iphone
calculator and
not using my
brain. (Note
that to
convert to MB
just divide by
1e6, not
1024^2 -
although I
apparently
cannot convert
between units
correctly....)<br>
>>>><br>
>>>> From the PETSc objects associated with the solver, It
looks like it
_should_ run
with 2GB per
MPI rank.
Sorry for my
mistake.
Possibilities
are: somewhere
in your usage
of PETSc
you've
introduced a
memory leak;
PETSc is doing
a huge over
allocation
(e.g. as per
our discussion
of MatPtAP);
or in your
application
code there are
other objects
you have
forgotten to
log the memory
for.<br>
>>>><br>
>>>><br>
>>>><br>
>>>> I am running this job on Bluewater<br>
>>>> I am using the 7 points FD stencil in 3D.<br>
>>>><br>
>>>> I thought so on both counts.<br>
>>>><br>
>>>> I apologize that I made a stupid mistake in computing
the memory per
core. My
settings
render each
core can
access only 2G
memory on
average
instead of 8G
which I
mentioned in
previous
email. I
re-run the job
with 8G memory
per core on
average and
there is no
"Out Of
Memory" error.
I would do
more test to
see if there
is still some
memory issue.<br>
>>>><br>
>>>> Ok. I'd still like to know where the memory was being
used since my
estimates were
off.<br>
>>>><br>
>>>><br>
>>>> Thanks,<br>
>>>> Dave<br>
>>>><br>
>>>> Regards,<br>
>>>> Frank<br>
>>>><br>
>>>><br>
>>>><br>
>>>> On 07/11/2016 01:18 PM, Dave May wrote:<br>
>>>>> Hi Frank,<br>
>>>>><br>
>>>>><br>
>>>>> On 11 July 2016 at 19:14, frank <<a>hengjiew@uci.edu</a>>
wrote:<br>
>>>>> Hi Dave,<br>
>>>>><br>
>>>>> I re-run the test using bjacobi as the
preconditioner
on the coarse
mesh of
telescope. The
Grid is
3072*256*768
and process
mesh is
96*8*24. The
petsc option
file is
attached.<br>
>>>>> I still got the "Out Of Memory" error. The error
occurred
before the
linear solver
finished one
step. So I
don't have the
full info from
ksp_view. The
info from
ksp_view_pre
is attached.<br>
>>>>><br>
>>>>> Okay - that is essentially useless (sorry)<br>
>>>>><br>
>>>>> It seems to me that the error occurred when the
decomposition
was going to
be changed.<br>
>>>>><br>
>>>>> Based on what information?<br>
>>>>> Running with -info would give us more clues, but
will create a
ton of output.<br>
>>>>> Please try running the case which failed with -info<br>
>>>>> I had another test with a grid of 1536*128*384 and
the same
process mesh
as above.
There was no
error. The
ksp_view info
is attached
for
comparison.<br>
>>>>> Thank you.<br>
>>>>><br>
>>>>><br>
>>>>> [3] Here is my crude estimate of your memory usage.<br>
>>>>> I'll target the biggest memory hogs only to get an
order of
magnitude
estimate<br>
>>>>><br>
>>>>> * The Fine grid operator contains 4223139840
non-zeros
--> 1.8 GB
per MPI rank
assuming
double
precision.<br>
>>>>> The indices for the AIJ could amount to another 0.3
GB (assuming
32 bit
integers)<br>
>>>>><br>
>>>>> * You use 5 levels of coarsening, so the other
operators
should
represent
(collectively)<br>
>>>>> 2.1 / 8 + 2.1/8^2 + 2.1/8^3 + 2.1/8^4 ~ 300 MB per
MPI rank on
the
communicator
with 18432
ranks.<br>
>>>>> The coarse grid should consume ~ 0.5 MB per MPI
rank on the
communicator
with 18432
ranks.<br>
>>>>><br>
>>>>> * You use a reduction factor of 64, making the new
communicator
with 288 MPI
ranks.<br>
>>>>> PCTelescope will first gather a temporary matrix
associated
with your
coarse level
operator
assuming a
comm size of
288 living on
the comm with
size 18432.<br>
>>>>> This matrix will require approximately 0.5 * 64 =
32 MB per core
on the 288
ranks.<br>
>>>>> This matrix is then used to form a new MPIAIJ
matrix on the
subcomm, thus
require
another 32 MB
per rank.<br>
>>>>> The temporary matrix is now destroyed.<br>
>>>>><br>
>>>>> * Because a DMDA is detected, a permutation matrix
is assembled.<br>
>>>>> This requires 2 doubles per point in the DMDA.<br>
>>>>> Your coarse DMDA contains 92 x 16 x 48 points.<br>
>>>>> Thus the permutation matrix will require < 1 MB
per MPI rank
on the
sub-comm.<br>
>>>>><br>
>>>>> * Lastly, the matrix is permuted. This uses
MatPtAP(), but
the resulting
operator will
have the same
memory
footprint as
the unpermuted
matrix (32
MB). At any
stage in
PCTelescope,
only 2
operators of
size 32 MB are
held in memory
when the DMDA
is provided.<br>
>>>>><br>
>>>>> From my rough estimates, the worst case memory foot
print for any
given core,
given your
options is
approximately<br>
>>>>> 2100 MB + 300 MB + 32 MB + 32 MB + 1 MB = 2465 MB<br>
>>>>> This is way below 8 GB.<br>
>>>>><br>
>>>>> Note this estimate completely ignores:<br>
>>>>> (1) the memory required for the restriction
operator,<br>
>>>>> (2) the potential growth in the number of non-zeros
per row due to
Galerkin
coarsening (I
wished
-ksp_view_pre
reported the
output from
MatView so we
could see the
number of
non-zeros
required by
the coarse
level
operators)<br>
>>>>> (3) all temporary vectors required by the CG
solver, and
those required
by the
smoothers.<br>
>>>>> (4) internal memory allocated by MatPtAP<br>
>>>>> (5) memory associated with IS's used within
PCTelescope<br>
>>>>><br>
>>>>> So either I am completely off in my estimates, or
you have not
carefully
estimated the
memory usage
of your
application
code.
Hopefully
others might
examine/correct
my rough
estimates<br>
>>>>><br>
>>>>> Since I don't have your code I cannot access the
latter.<br>
>>>>> Since I don't have access to the same machine you
are running
on, I think we
need to take a
step back.<br>
>>>>><br>
>>>>> [1] What machine are you running on? Send me a URL
if its
available<br>
>>>>><br>
>>>>> [2] What discretization are you using? (I am
guessing a
scalar 7 point
FD stencil)<br>
>>>>> If it's a 7 point FD stencil, we should be able to
examine the
memory usage
of your solver
configuration
using a
standard,
light weight
existing PETSc
example, run
on your
machine at the
same scale.<br>
>>>>> This would hopefully enable us to correctly
evaluate the
actual memory
usage required
by the solver
configuration
you are using.<br>
>>>>><br>
>>>>> Thanks,<br>
>>>>> Dave<br>
>>>>><br>
>>>>><br>
>>>>> Frank<br>
>>>>><br>
>>>>><br>
>>>>><br>
>>>>><br>
>>>>> On 07/08/2016 10:38 PM, Dave May wrote:<br>
>>>>>><br>
>>>>>> On Saturday, 9 July 2016, frank <<a>hengjiew@uci.edu</a>>
wrote:<br>
>>>>>> Hi Barry and Dave,<br>
>>>>>><br>
>>>>>> Thank both of you for the advice.<br>
>>>>>><br>
>>>>>> @Barry<br>
>>>>>> I made a mistake in the file names in last
email. I
attached the
correct files
this time.<br>
>>>>>> For all the three tests, 'Telescope' is used as
the coarse
preconditioner.<br>
>>>>>><br>
>>>>>> == Test1: Grid: 1536*128*384, Process Mesh:
48*4*12<br>
>>>>>> Part of the memory usage: Vector 125
124
3971904 0.<br>
>>>>>>
Matrix 101
101
9462372 0<br>
>>>>>><br>
>>>>>> == Test2: Grid: 1536*128*384, Process Mesh:
96*8*24<br>
>>>>>> Part of the memory usage: Vector 125
124
681672 0.<br>
>>>>>>
Matrix 101
101
1462180 0.<br>
>>>>>><br>
>>>>>> In theory, the memory usage in Test1 should be
8 times of
Test2. In my
case, it is
about 6 times.<br>
>>>>>><br>
>>>>>> == Test3: Grid: 3072*256*768, Process Mesh:
96*8*24.
Sub-domain per
process:
32*32*32<br>
>>>>>> Here I get the out of memory error.<br>
>>>>>><br>
>>>>>> I tried to use -mg_coarse jacobi. In this way,
I don't need
to set
-mg_coarse_ksp_type
and
-mg_coarse_pc_type
explicitly,
right?<br>
>>>>>> The linear solver didn't work in this case.
Petsc output
some errors.<br>
>>>>>><br>
>>>>>> @Dave<br>
>>>>>> In test3, I use only one instance of
'Telescope'.
On the coarse
mesh of
'Telescope', I
used LU as the
preconditioner
instead of
SVD.<br>
>>>>>> If my set the levels correctly, then on the
last coarse
mesh of MG
where it calls
'Telescope',
the sub-domain
per process is
2*2*2.<br>
>>>>>> On the last coarse mesh of 'Telescope', there
is only one
grid point per
process.<br>
>>>>>> I still got the OOM error. The detailed petsc
option file is
attached.<br>
>>>>>><br>
>>>>>> Do you understand the expected memory usage for
the particular
parallel LU
implementation
you are using?
I don't
(seriously).
Replace LU
with bjacobi
and re-run
this test. My
point about
solver
debugging is
still valid.<br>
>>>>>><br>
>>>>>> And please send the result of KSPView so we can
see what is
actually used
in the
computations<br>
>>>>>><br>
>>>>>> Thanks<br>
>>>>>> Dave<br>
>>>>>><br>
>>>>>><br>
>>>>>> Thank you so much.<br>
>>>>>><br>
>>>>>> Frank<br>
>>>>>><br>
>>>>>><br>
>>>>>><br>
>>>>>> On 07/06/2016 02:51 PM, Barry Smith wrote:<br>
>>>>>> On Jul 6, 2016, at 4:19 PM, frank <<a>hengjiew@uci.edu</a>>
wrote:<br>
>>>>>><br>
>>>>>> Hi Barry,<br>
>>>>>><br>
>>>>>> Thank you for you advice.<br>
>>>>>> I tried three test. In the 1st test, the grid
is
3072*256*768
and the
process mesh
is 96*8*24.<br>
>>>>>> The linear solver is 'cg' the preconditioner is
'mg' and
'telescope' is
used as the
preconditioner
at the coarse
mesh.<br>
>>>>>> The system gives me the "Out of Memory" error
before the
linear system
is completely
solved.<br>
>>>>>> The info from '-ksp_view_pre' is attached. I
seems to me
that the error
occurs when it
reaches the
coarse mesh.<br>
>>>>>><br>
>>>>>> The 2nd test uses a grid of 1536*128*384 and
process mesh
is 96*8*24.
The 3rd
test
uses the same
grid but a
different
process mesh
48*4*12.<br>
>>>>>> Are you sure this is right? The total
matrix and
vector memory
usage goes
from 2nd test<br>
>>>>>> Vector 384 383
8,193,712
0.<br>
>>>>>> Matrix 103 103
11,508,688
0.<br>
>>>>>> to 3rd test<br>
>>>>>> Vector 384 383
1,590,520
0.<br>
>>>>>> Matrix 103 103
3,508,664
0.<br>
>>>>>> that is the memory usage got smaller but if you
have only
1/8th the
processes and
the same grid
it should have
gotten about 8
times bigger.
Did you maybe
cut the grid
by a factor of
8 also? If so
that still
doesn't
explain it
because the
memory usage
changed by a
factor of 5
something for
the vectors
and 3
something for
the matrices.<br>
>>>>>><br>
>>>>>><br>
>>>>>> The linear solver and petsc options in 2nd and
3rd tests are
the same in
1st test. The
linear solver
works fine in
both test.<br>
>>>>>> I attached the memory usage of the 2nd and 3rd
tests. The
memory info is
from the
option
'-log_summary'.
I tried to use
'-momery_info'
as you
suggested, but
in my case
petsc treated
it as an
unused option.
It output
nothing about
the memory. Do
I need to add
sth to my code
so I can use
'-memory_info'?<br>
>>>>>> Sorry, my mistake the option is
-memory_view<br>
>>>>>><br>
>>>>>> Can you run the one case with -memory_view
and -mg_coarse
jacobi
-ksp_max_it 1
(just so it
doesn't
iterate
forever) to
</blockquote></div></blockquote></div></blockquote></blockquote></div></blockquote></div></div></div></blockquote></div></div></blockquote></div></blockquote></div></div></blockquote></div></div></div></blockquote></div></div></div></div></div></blockquote></div></div></div></blockquote></div></blockquote><div> </div>