<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Thu, Oct 6, 2016 at 7:33 PM, frank <span dir="ltr"><<a href="mailto:hengjiew@uci.edu" target="_blank">hengjiew@uci.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
<p>Dear Dave,</p>
Follow your advice, I solve the identical equation twice and time
two steps separately. The result is below:<br>
<br>
Test: 1024^3 grid points<br>
Cores# reduction factor MG levels# time of 1st solve 2nd
time<br>
4096 64 6 + 3
3.85 <wbr> 1.75<br>
8192 128 5 + 3
5.52 <wbr> 0.91<br>
16384 256 5 + 3 5.37
0.52<br>
32768 512 5 + 4
3.03 0.36<br>
32768 64 | 8 4 | 3 | 3 2.80
0.43<br>
65536 1024 5 + 4 3.38
0.59<br>
65536 32 | 32 4 | 4 | 3 2.14
0.22<br>
<br>
I also attached the log_view info from all the run. The file
is names by the cores# + reduction factor.<br>
The ksp_view and petsc_options for the 1st run are also included.
Others are similar. The only differences are the reduction factor
and mg levels.<br>
<br>
** The time for the 1st solve is generally much larger. Is this
because the ksp solver on the sub-communicator is set up during the
1st solve?<br></div></blockquote><div><br></div><div>All setup is done in the first solve.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">
** The time for 1st solve does not scale. <br>
In practice, I am solving a variable coefficient Poisson
equation. I need to build the matrix every time step. Therefore,
each step is similar to the 1st solve which does not scale. Is there
a way I can improve the performance?<br></div></blockquote><div><br></div><div>You could use rediscretization instead of Galerkin to produce the coarse operators.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">
** The 2nd solve scales but not quite well for more than 16384
cores.<br></div></blockquote><div><br></div><div>How well were you looking for? This is strong scaling, which is has an Amdahl's Law limit.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">
It seems to me that the performance depends on the tuning of MG
levels on the sub-communicator(s).<br>
Is there some general strategies regarding how to distribute the
levels? or when to use multiple sub-communicators ? <br></div></blockquote><div><br></div><div>Also, you use CG/MG when FMG by itself would probably be faster. Your smoother is likely not strong enough, and you</div><div>should use something like V(2,2). There is a lot of tuning that is possible, but difficult to automate.</div><div><br></div><div> Thanks,</div><div><br></div><div> Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">
Thank you.<br>
<br>
Regards,<br>
Frank<div><div class="h5"><br>
<br>
<br>
<br>
<br>
<div class="m_-3012109709631955293moz-cite-prefix">On 10/04/2016 12:56 PM, Dave May wrote:<br>
</div>
<blockquote type="cite"><br>
<br>
On Tuesday, 4 October 2016, frank <<a href="mailto:hengjiew@uci.edu" target="_blank">hengjiew@uci.edu</a>> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
<p>Hi,</p>
This question is follow-up of the thread "Question about
memory usage in Multigrid preconditioner".<br>
I used to have the "Out of Memory(OOM)" problem when using the
CG+Telescope MG solver with 32768 cores. Adding the "-matrap
0; -matptap_scalable" option did solve that problem. <br>
<br>
Then I test the scalability by solving a 3d poisson eqn for 1
step. I used one sub-communicator in all the tests. The
difference between the petsc options in those tests are: 1 the
pc_telescope_reduction_factor; 2 the number of multigrid
levels in the up/down solver. The function "ksp_solve" is
timed. It is kind of slow and doesn't scale at all. <br>
<br>
Test1: 512^3 grid points<br>
Core# telescope_reduction_factor <wbr> MG levels#
for up/down solver Time for KSPSolve (s)<br>
512 8 <wbr>
4 / 3 <wbr> 6.2466<br>
4096 64 <wbr>
5 / 3 <wbr> 0.9361<br>
32768 64 <wbr>
4 / 3 <wbr> 4.8914<br>
<br>
Test2: 1024^3 grid points<br>
Core# telescope_reduction_factor <wbr> MG levels#
for up/down solver Time for KSPSolve (s)<br>
4096 64 <wbr>
5 / 4 <wbr> 3.4139<br>
8192 128 <wbr>
5 / 4 <wbr> 2.4196<br>
16384 32 <wbr>
5 / 3 <wbr> 5.4150<br>
32768 64 <wbr>
5 / 3 <wbr> 5.6067<br>
65536 128 <wbr>
5 / 3 <wbr> 6.5219</div>
</blockquote>
<div><br>
</div>
<div>You have to be very careful how you interpret these numbers.
Your solver contains nested calls to KSPSolve, and unfortunately
as a result the numbers you report include setup time. This will
remain true even if you call KSPSetUp on the outermost KSP. </div>
<div><br>
</div>
<div>Your email concerns scalability of the silver application, so
let's focus on that issue.</div>
<div><br>
</div>
<div>The only way to clearly separate setup from solve time is
to perform two identical solves. The second solve will not
require any setup. You should monitor the second solve via a new
PetscStage.</div>
<div><br>
</div>
<div>This was what I did in the telescope paper. It was the only
way to understand the setup cost (and scaling) cf the solve time
(and scaling).</div>
<div><br>
</div>
<div>Thanks</div>
<div> Dave</div>
<div>
<div>
<div><br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> I guess I didn't set
the MG levels properly. What would be the efficient way to
arrange the MG levels?<br>
Also which preconditionr at the coarse mesh of the 2nd
communicator should I use to improve the performance? <br>
<br>
I attached the test code and the petsc options file for
the 1024^3 cube with 32768 cores. <br>
<br>
Thank you.<br>
<br>
Regards,<br>
Frank<br>
<br>
<br>
<br>
<br>
<br>
<br>
<div>On 09/15/2016 03:35 AM, Dave May wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>
<div>
<div>
<div>
<div>HI all,<br>
<br>
</div>
<div>I the only unexpected memory usage I can
see is associated with the call to
MatPtAP().<br>
</div>
<div>Here is something you can try
immediately.<br>
</div>
</div>
Run your code with the additional options<br>
-matrap 0 -matptap_scalable<br>
<br>
</div>
<div>I didn't realize this before, but the default
behaviour of MatPtAP in parallel is actually to
to explicitly form the transpose of P (e.g.
assemble R = P^T) and then compute R.A.P. <br>
You don't want to do this. The option -matrap 0
resolves this issue.<br>
</div>
<div><br>
</div>
<div>The implementation of P^T.A.P has two
variants. <br>
The scalable implementation (with respect to
memory usage) is selected via the second option
-matptap_scalable.</div>
<div><br>
</div>
<div>Try it out - I see a significant memory
reduction using these options for particular
mesh sizes / partitions.<br>
</div>
<div><br>
</div>
I've attached a cleaned up version of the code you
sent me.<br>
</div>
There were a number of memory leaks and other
issues.<br>
</div>
<div>The main points being<br>
</div>
* You should call DMDAVecGetArrayF90() before
VecAssembly{Begin,End}<br>
* You should call PetscFinalize(), otherwise the
option -log_summary (-log_view) will not display
anything once the program has completed.<br>
<div>
<div>
<div><br>
<br>
</div>
<div>Thanks,<br>
</div>
<div> Dave<br>
</div>
<div>
<div>
<div><br>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On 15 September 2016 at
08:03, Hengjie Wang <span dir="ltr"><<a>hengjiew@uci.edu</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> Hi Dave,<br>
<br>
Sorry, I should have put more comment to explain
the code. <br>
The number of process in each dimension is the
same: Px = Py=Pz=P. So is the domain size.<br>
So if the you want to run the code for a 512^3
grid points on 16^3 cores, you need to set "-N
512 -P 16" in the command line.<br>
I add more comments and also fix an error in the
attached code. ( The error only effects the
accuracy of solution but not the memory usage. )
<br>
<div><br>
Thank you.<span><font color="#888888"><br>
Frank</font></span>
<div>
<div><br>
<br>
On 9/14/2016 9:05 PM, Dave May wrote:<br>
</div>
</div>
</div>
<div>
<div>
<blockquote type="cite"><br>
<br>
On Thursday, 15 September 2016, Dave May
<<a>dave.mayhem23@gmail.com</a>>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
<br>
On Thursday, 15 September 2016, frank
<<a>hengjiew@uci.edu</a>>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
Hi, <br>
<br>
I write a simple code to re-produce
the error. I hope this can help to
diagnose the problem.<br>
The code just solves a 3d poisson
equation. </div>
</blockquote>
<div><br>
</div>
<div>Why is the stencil width a runtime
parameter?? And why is the default
value 2? For 7-pnt FD Laplace, you
only need a stencil width of 1. </div>
<div><br>
</div>
<div>Was this choice made to mimic
something in the real application
code?</div>
</blockquote>
<div><br>
</div>
Please ignore - I misunderstood your usage
of the param set by -P
<div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"><br>
I run the code on a 1024^3 mesh.
The process partition is 32 * 32 *
32. That's when I re-produce the
OOM error. Each core has about 2G
memory.<br>
I also run the code on a 512^3
mesh with 16 * 16 * 16 processes.
The ksp solver works fine. <br>
I attached the code,
ksp_view_pre's output and my petsc
option file.<br>
<br>
Thank you.<br>
Frank<br>
<div><br>
On 09/09/2016 06:38 PM, Hengjie
Wang wrote:<br>
</div>
<blockquote type="cite">Hi Barry,
<div><br>
</div>
<div>I checked. On the
supercomputer, I had the
option "-ksp_view_pre" but it
is not in file I sent you. I
am sorry for the confusion.</div>
<div><br>
</div>
<div>Regards,</div>
<div>Frank<span></span><br>
<br>
On Friday, September 9, 2016,
Barry Smith <<a>bsmith@mcs.anl.gov</a>>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
> On Sep 9, 2016, at 3:11
PM, frank <<a>hengjiew@uci.edu</a>>
wrote:<br>
><br>
> Hi Barry,<br>
><br>
> I think the first KSP
view output is from
-ksp_view_pre. Before I
submitted the test, I was
not sure whether there would
be OOM error or not. So I
added both -ksp_view_pre and
-ksp_view.<br>
<br>
But the options file you
sent specifically does NOT
list the -ksp_view_pre so
how could it be from that?<br>
<br>
Sorry to be pedantic but
I've spent too much time in
the past trying to debug
from incorrect information
and want to make sure that
the information I have is
correct before thinking.
Please recheck exactly what
happened. Rerun with the
exact input file you emailed
if that is needed.<br>
<br>
Barry<br>
<br>
><br>
> Frank<br>
><br>
><br>
> On 09/09/2016 12:38 PM,
Barry Smith wrote:<br>
>> Why does
ksp_view2.txt have two KSP
views in it while
ksp_view1.txt has only one
KSPView in it? Did you run
two different solves in the
2 case but not the one?<br>
>><br>
>> Barry<br>
>><br>
>><br>
>><br>
>>> On Sep 9, 2016,
at 10:56 AM, frank <<a>hengjiew@uci.edu</a>>
wrote:<br>
>>><br>
>>> Hi,<br>
>>><br>
>>> I want to
continue digging into the
memory problem here.<br>
>>> I did find a
work around in the past,
which is to use less cores
per node so that each core
has 8G memory. However this
is deficient and expensive.
I hope to locate the place
that uses the most memory.<br>
>>><br>
>>> Here is a brief
summary of the tests I did
in past:<br>
>>>> Test1:
Mesh 1536*128*384 |
Process Mesh 48*4*12<br>
>>> Maximum (over
computational time) process
memory: total
7.0727e+08<br>
>>> Current process
memory:
total 7.0727e+08<br>
>>> Maximum (over
computational time) space
PetscMalloc()ed: total
6.3908e+11<br>
>>> Current space
PetscMalloc()ed:
total 1.8275e+09<br>
>>><br>
>>>> Test2:
Mesh 1536*128*384 |
Process Mesh 96*8*24<br>
>>> Maximum (over
computational time) process
memory: total
5.9431e+09<br>
>>> Current process
memory:
total 5.9431e+09<br>
>>> Maximum (over
computational time) space
PetscMalloc()ed: total
5.3202e+12<br>
>>> Current space
PetscMalloc()ed:
total 5.4844e+09<br>
>>><br>
>>>> Test3:
Mesh 3072*256*768 |
Process Mesh 96*8*24<br>
>>> OOM( Out Of
Memory ) killer of the
supercomputer terminated the
job during "KSPSolve".<br>
>>><br>
>>> I attached the
output of ksp_view( the
third test's output is from
ksp_view_pre ), memory_view
and also the petsc options.<br>
>>><br>
>>> In all the
tests, each core can access
about 2G memory. In test3,
there are 4223139840
non-zeros in the matrix.
This will consume about
1.74M, using double
precision. Considering some
extra memory used to store
integer index, 2G memory
should still be way enough.<br>
>>><br>
>>> Is there a way
to find out which part of
KSPSolve uses the most
memory?<br>
>>> Thank you so
much.<br>
>>><br>
>>> BTW, there are
4 options remains unused and
I don't understand why they
are omitted:<br>
>>>
-mg_coarse_telescope_mg_coarse<wbr>_ksp_type
value: preonly<br>
>>>
-mg_coarse_telescope_mg_coarse<wbr>_pc_type
value: bjacobi<br>
>>>
-mg_coarse_telescope_mg_levels<wbr>_ksp_max_it
value: 1<br>
>>>
-mg_coarse_telescope_mg_levels<wbr>_ksp_type
value: richardson<br>
>>><br>
>>><br>
>>> Regards,<br>
>>> Frank<br>
>>><br>
>>> On 07/13/2016
05:47 PM, Dave May wrote:<br>
>>>><br>
>>>> On 14 July
2016 at 01:07, frank <<a>hengjiew@uci.edu</a>>
wrote:<br>
>>>> Hi Dave,<br>
>>>><br>
>>>> Sorry for
the late reply.<br>
>>>> Thank you
so much for your detailed
reply.<br>
>>>><br>
>>>> I have a
question about the
estimation of the memory
usage. There are 4223139840
allocated non-zeros and
18432 MPI processes. Double
precision is used. So the
memory per process is:<br>
>>>>
4223139840 * 8bytes / 18432
/ 1024 / 1024 = 1.74M ?<br>
>>>> Did I do
sth wrong here? Because this
seems too small.<br>
>>>><br>
>>>> No - I
totally f***ed it up. You
are correct. That'll teach
me for fumbling around with
my iphone calculator and not
using my brain. (Note that
to convert to MB just divide
by 1e6, not 1024^2 -
although I apparently cannot
convert between units
correctly....)<br>
>>>><br>
>>>> From the
PETSc objects associated
with the solver, It looks
like it _should_ run with
2GB per MPI rank. Sorry for
my mistake. Possibilities
are: somewhere in your usage
of PETSc you've introduced a
memory leak; PETSc is doing
a huge over allocation (e.g.
as per our discussion of
MatPtAP); or in your
application code there are
other objects you have
forgotten to log the memory
for.<br>
>>>><br>
>>>><br>
>>>><br>
>>>> I am
running this job on
Bluewater<br>
>>>> I am using
the 7 points FD stencil in
3D.<br>
>>>><br>
>>>> I thought
so on both counts.<br>
>>>><br>
>>>> I apologize
that I made a stupid mistake
in computing the memory per
core. My settings render
each core can access only 2G
memory on average instead of
8G which I mentioned in
previous email. I re-run the
job with 8G memory per core
on average and there is no
"Out Of Memory" error. I
would do more test to see if
there is still some memory
issue.<br>
>>>><br>
>>>> Ok. I'd
still like to know where the
memory was being used since
my estimates were off.<br>
>>>><br>
>>>><br>
>>>> Thanks,<br>
>>>> Dave<br>
>>>><br>
>>>> Regards,<br>
>>>> Frank<br>
>>>><br>
>>>><br>
>>>><br>
>>>> On
07/11/2016 01:18 PM, Dave
May wrote:<br>
>>>>> Hi
Frank,<br>
>>>>><br>
>>>>><br>
>>>>> On 11
July 2016 at 19:14, frank
<<a>hengjiew@uci.edu</a>>
wrote:<br>
>>>>> Hi
Dave,<br>
>>>>><br>
>>>>> I
re-run the test using
bjacobi as the
preconditioner on the coarse
mesh of telescope. The Grid
is 3072*256*768 and process
mesh is 96*8*24. The petsc
option file is attached.<br>
>>>>> I still
got the "Out Of Memory"
error. The error occurred
before the linear solver
finished one step. So I
don't have the full info
from ksp_view. The info from
ksp_view_pre is attached.<br>
>>>>><br>
>>>>> Okay -
that is essentially useless
(sorry)<br>
>>>>><br>
>>>>> It
seems to me that the error
occurred when the
decomposition was going to
be changed.<br>
>>>>><br>
>>>>> Based
on what information?<br>
>>>>> Running
with -info would give us
more clues, but will create
a ton of output.<br>
>>>>> Please
try running the case which
failed with -info<br>
>>>>> I had
another test with a grid of
1536*128*384 and the same
process mesh as above. There
was no error. The ksp_view
info is attached for
comparison.<br>
>>>>> Thank
you.<br>
>>>>><br>
>>>>><br>
>>>>> [3]
Here is my crude estimate of
your memory usage.<br>
>>>>> I'll
target the biggest memory
hogs only to get an order of
magnitude estimate<br>
>>>>><br>
>>>>> * The
Fine grid operator contains
4223139840 non-zeros -->
1.8 GB per MPI rank assuming
double precision.<br>
>>>>> The
indices for the AIJ could
amount to another 0.3 GB
(assuming 32 bit integers)<br>
>>>>><br>
>>>>> * You
use 5 levels of coarsening,
so the other operators
should represent
(collectively)<br>
>>>>> 2.1 / 8
+ 2.1/8^2 + 2.1/8^3 +
2.1/8^4 ~ 300 MB per MPI
rank on the communicator
with 18432 ranks.<br>
>>>>> The
coarse grid should consume ~
0.5 MB per MPI rank on the
communicator with 18432
ranks.<br>
>>>>><br>
>>>>> * You
use a reduction factor of
64, making the new
communicator with 288 MPI
ranks.<br>
>>>>>
PCTelescope will first
gather a temporary matrix
associated with your coarse
level operator assuming a
comm size of 288 living on
the comm with size 18432.<br>
>>>>> This
matrix will require
approximately 0.5 * 64 = 32
MB per core on the 288
ranks.<br>
>>>>> This
matrix is then used to form
a new MPIAIJ matrix on the
subcomm, thus require
another 32 MB per rank.<br>
>>>>> The
temporary matrix is now
destroyed.<br>
>>>>><br>
>>>>> *
Because a DMDA is detected,
a permutation matrix is
assembled.<br>
>>>>> This
requires 2 doubles per point
in the DMDA.<br>
>>>>> Your
coarse DMDA contains 92 x 16
x 48 points.<br>
>>>>> Thus
the permutation matrix will
require < 1 MB per MPI
rank on the sub-comm.<br>
>>>>><br>
>>>>> *
Lastly, the matrix is
permuted. This uses
MatPtAP(), but the resulting
operator will have the same
memory footprint as the
unpermuted matrix (32 MB).
At any stage in PCTelescope,
only 2 operators of size 32
MB are held in memory when
the DMDA is provided.<br>
>>>>><br>
>>>>> From my
rough estimates, the worst
case memory foot print for
any given core, given your
options is approximately<br>
>>>>> 2100 MB
+ 300 MB + 32 MB + 32 MB + 1
MB = 2465 MB<br>
>>>>> This is
way below 8 GB.<br>
>>>>><br>
>>>>> Note
this estimate completely
ignores:<br>
>>>>> (1) the
memory required for the
restriction operator,<br>
>>>>> (2) the
potential growth in the
number of non-zeros per row
due to Galerkin coarsening
(I wished -ksp_view_pre
reported the output from
MatView so we could see the
number of non-zeros required
by the coarse level
operators)<br>
>>>>> (3) all
temporary vectors required
by the CG solver, and those
required by the smoothers.<br>
>>>>> (4)
internal memory allocated by
MatPtAP<br>
>>>>> (5)
memory associated with IS's
used within PCTelescope<br>
>>>>><br>
>>>>> So
either I am completely off
in my estimates, or you have
not carefully estimated the
memory usage of your
application code. Hopefully
others might examine/correct
my rough estimates<br>
>>>>><br>
>>>>> Since I
don't have your code I
cannot access the latter.<br>
>>>>> Since I
don't have access to the
same machine you are running
on, I think we need to take
a step back.<br>
>>>>><br>
>>>>> [1]
What machine are you running
on? Send me a URL if its
available<br>
>>>>><br>
>>>>> [2]
What discretization are you
using? (I am guessing a
scalar 7 point FD stencil)<br>
>>>>> If it's
a 7 point FD stencil, we
should be able to examine
the memory usage of your
solver configuration using a
standard, light weight
existing PETSc example, run
on your machine at the same
scale.<br>
>>>>> This
would hopefully enable us to
correctly evaluate the
actual memory usage required
by the solver configuration
you are using.<br>
>>>>><br>
>>>>> Thanks,<br>
>>>>> Dave<br>
>>>>><br>
>>>>><br>
>>>>> Frank<br>
>>>>><br>
>>>>><br>
>>>>><br>
>>>>><br>
>>>>> On
07/08/2016 10:38 PM, Dave
May wrote:<br>
>>>>>><br>
>>>>>> On
Saturday, 9 July 2016, frank
<<a>hengjiew@uci.edu</a>>
wrote:<br>
>>>>>> Hi
Barry and Dave,<br>
>>>>>><br>
>>>>>>
Thank both of you for the
advice.<br>
>>>>>><br>
>>>>>>
@Barry<br>
>>>>>> I
made a mistake in the file
names in last email. I
attached the correct files
this time.<br>
>>>>>> For
all the three tests,
'Telescope' is used as the
coarse preconditioner.<br>
>>>>>><br>
>>>>>> ==
Test1: Grid:
1536*128*384, Process
Mesh: 48*4*12<br>
>>>>>>
Part of the memory usage:
Vector 125 124
3971904 0.<br>
>>>>>>
Matrix 101
101 9462372 0<br>
>>>>>><br>
>>>>>> ==
Test2: Grid: 1536*128*384,
Process Mesh: 96*8*24<br>
>>>>>>
Part of the memory usage:
Vector 125 124
681672 0.<br>
>>>>>>
Matrix 101
101 1462180 0.<br>
>>>>>><br>
>>>>>> In
theory, the memory usage in
Test1 should be 8 times of
Test2. In my case, it is
about 6 times.<br>
>>>>>><br>
>>>>>> ==
Test3: Grid: 3072*256*768,
Process Mesh: 96*8*24.
Sub-domain per process:
32*32*32<br>
>>>>>>
Here I get the out of memory
error.<br>
>>>>>><br>
>>>>>> I
tried to use -mg_coarse
jacobi. In this way, I don't
need to set
-mg_coarse_ksp_type and
-mg_coarse_pc_type
explicitly, right?<br>
>>>>>> The
linear solver didn't work in
this case. Petsc output some
errors.<br>
>>>>>><br>
>>>>>>
@Dave<br>
>>>>>> In
test3, I use only one
instance of 'Telescope'. On
the coarse mesh of
'Telescope', I used LU as
the preconditioner instead
of SVD.<br>
>>>>>> If
my set the levels correctly,
then on the last coarse mesh
of MG where it calls
'Telescope', the sub-domain
per process is 2*2*2.<br>
>>>>>> On
the last coarse mesh of
'Telescope', there is only
one grid point per process.<br>
>>>>>> I
still got the OOM error. The
detailed petsc option file
is attached.<br>
>>>>>><br>
>>>>>> Do
you understand the expected
memory usage for the
particular parallel LU
implementation you are
using? I don't (seriously).
Replace LU with bjacobi and
re-run this test. My point
about solver debugging is
still valid.<br>
>>>>>><br>
>>>>>> And
please send the result of
KSPView so we can see what
is actually used in the
computations<br>
>>>>>><br>
>>>>>>
Thanks<br>
>>>>>>
Dave<br>
>>>>>><br>
>>>>>><br>
>>>>>>
Thank you so much.<br>
>>>>>><br>
>>>>>>
Frank<br>
>>>>>><br>
>>>>>><br>
>>>>>><br>
>>>>>> On
07/06/2016 02:51 PM, Barry
Smith wrote:<br>
>>>>>> On
Jul 6, 2016, at 4:19 PM,
frank <<a>hengjiew@uci.edu</a>>
wrote:<br>
>>>>>><br>
>>>>>> Hi
Barry,<br>
>>>>>><br>
>>>>>>
Thank you for you advice.<br>
>>>>>> I
tried three test. In the 1st
test, the grid is
3072*256*768 and the process
mesh is 96*8*24.<br>
>>>>>> The
linear solver is 'cg' the
preconditioner is 'mg' and
'telescope' is used as the
preconditioner at the coarse
mesh.<br>
>>>>>> The
system gives me the "Out of
Memory" error before the
linear system is completely
solved.<br>
>>>>>> The
info from '-ksp_view_pre' is
attached. I seems to me that
the error occurs when it
reaches the coarse mesh.<br>
>>>>>><br>
>>>>>> The
2nd test uses a grid of
1536*128*384 and process
mesh is 96*8*24. The 3rd
test uses the
same grid but a different
process mesh 48*4*12.<br>
>>>>>>
Are you sure this is right?
The total matrix and vector
memory usage goes from 2nd
test<br>
>>>>>>
Vector 384
383 8,193,712
0.<br>
>>>>>>
Matrix 103
103 11,508,688
0.<br>
>>>>>> to
3rd test<br>
>>>>>>
Vector 384
383 1,590,520
0.<br>
>>>>>>
Matrix 103
103 3,508,664
0.<br>
>>>>>>
that is the memory usage got
smaller but if you have only
1/8th the processes and the
same grid it should have
gotten about 8 times bigger.
Did you maybe cut the grid
by a factor of 8 also? If so
that still doesn't explain
it because the memory usage
changed by a factor of 5
something for the vectors
and 3 something for the
matrices.<br>
>>>>>><br>
>>>>>><br>
>>>>>> The
linear solver and petsc
options in 2nd and 3rd tests
are the same in 1st test.
The linear solver works fine
in both test.<br>
>>>>>> I
attached the memory usage of
the 2nd and 3rd tests. The
memory info is from the
option '-log_summary'. I
tried to use '-momery_info'
as you suggested, but in my
case petsc treated it as an
unused option. It output
nothing about the memory. Do
I need to add sth to my code
so I can use '-memory_info'?<br>
>>>>>>
Sorry, my mistake the
option is -memory_view<br>
>>>>>><br>
>>>>>>
Can you run the one case
with -memory_view and
-mg_coarse jacobi
-ksp_max_it 1 (just so it
doesn't iterate forever) to
see how much memory is used
without the telescope? Also
run case 2 the same way.<br>
>>>>>><br>
>>>>>>
Barry<br>
>>>>>><br>
>>>>>><br>
>>>>>><br>
>>>>>> In
both tests the memory usage
is not large.<br>
>>>>>><br>
>>>>>> It
seems to me that it might be
the 'telescope'
preconditioner that
allocated a lot of memory
and caused the error in the
1st test.<br>
>>>>>> Is
there is a way to show how
much memory it allocated?<br>
>>>>>><br>
>>>>>>
Frank<br>
>>>>>><br>
>>>>>> On
07/05/2016 03:37 PM, Barry
Smith wrote:<br>
>>>>>>
Frank,<br>
>>>>>><br>
>>>>>>
You can run with
-ksp_view_pre to have it
"view" the KSP before the
solve so hopefully it gets
that far.<br>
>>>>>><br>
>>>>>>
Please run the problem
that does fit with
-memory_info when the
problem completes it will
show the "high water mark"
for PETSc allocated memory
and total memory used. We
first want to look at these
numbers to see if it is
using more memory than you
expect. You could also run
with say half the grid
spacing to see how the
memory usage scaled with the
increase in grid points.
Make the runs also with
-log_view and send all the
output from these options.<br>
>>>>>><br>
>>>>>>
Barry<br>
>>>>>><br>
>>>>>> On
Jul 5, 2016, at 5:23 PM,
frank <<a>hengjiew@uci.edu</a>>
wrote:<br>
>>>>>><br>
>>>>>> Hi,<br>
>>>>>><br>
>>>>>> I
am using the CG ksp solver
and Multigrid
preconditioner to solve a
linear system in parallel.<br>
>>>>>> I
chose to use the 'Telescope'
as the preconditioner on the
coarse mesh for its good
performance.<br>
>>>>>> The
petsc options file is
attached.<br>
>>>>>><br>
>>>>>> The
domain is a 3d box.<br>
>>>>>> It
works well when the grid is
1536*128*384 and the process
mesh is 96*8*24. When I
double the size of grid and
keep the
same process mesh and petsc
options, I get an "out of
memory" error from the
super-cluster I am using.<br>
>>>>>>
Each process has access to
at least 8G memory, which
should be more than enough
for my application. I am
sure that all the other
parts of my code( except the
linear solver ) do not use
much memory. So I doubt if
there is something wrong
with the linear solver.<br>
>>>>>> The
error occurs before the
linear system is completely
solved so I don't have the
info from ksp view. I am not
able to re-produce the error
with a smaller problem
either.<br>
>>>>>> In
addition, I tried to use
the block jacobi as the
preconditioner with the same
grid and same decomposition.
The linear solver runs
extremely slow but there is
no memory error.<br>
>>>>>><br>
>>>>>> How
can I diagnose what exactly
cause the error?<br>
>>>>>>
Thank you so much.<br>
>>>>>><br>
>>>>>>
Frank<br>
>>>>>>
<petsc_options.txt><br>
>>>>>>
<ksp_view_pre.txt><memory_test<wbr>2.txt><memory_test3.txt><petsc<wbr>_options.txt><br>
>>>>>><br>
>>>>><br>
>>>><br>
>>>
<ksp_view1.txt><ksp_view2.txt><wbr><ksp_view3.txt><memory1.txt><m<wbr>emory2.txt><petsc_options1.txt<wbr>><petsc_options2.txt><petsc_op<wbr>tions3.txt><br>
><br>
<br>
</blockquote>
</div>
</blockquote>
<br>
</div>
</blockquote>
<div> </div>
</blockquote>
</div>
</blockquote>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</div>
</blockquote>
<div> </div>
<div> </div>
<div> </div>
</div>
</div>
</blockquote>
<br>
</div></div></div>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature">What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div>
</div></div>