<p>
Hello MOAB dev,
</p>
<p>
I've attached a simplified version of my program that crashes presumably after particular numbers calls of exchange_tags
</p>
<p>
I ran it couple of times on Mira on 1024 cores (64 nodes in --mode c16)
</p>
<p>
It performs around 524378 iterations and then crushes (error file attached)
</p>
<p>
Can you please take a look at what Scott Parker from ALCF suggests about it:
</p>
<p>
-------- Original Message --------
</p>
<table border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<th align="right" valign="baseline">Subject: </th>
<td>Re: Job exiting early [ThermHydraX]</td>
</tr>
<tr>
<th align="right" valign="baseline">Date: </th>
<td>Fri, 9 May 2014 18:48:25 -0500</td>
</tr>
<tr>
<th align="right" valign="baseline">From: </th>
<td>Scott Parker </td>
</tr>
<tr>
<th align="right" valign="baseline">To: </th>
<td></td>
</tr>
</tbody>
</table>
<br />
<br />
Anton-<br />
<br />
I took a look at the core files and from the stack trace it appears
that the code is failing in an MPI_Isend call<br />
that is called from moab.ParallelComm::recv_buffer which is called
from moab::ParallelComm::exchange_tags<br />
called from main(). The stack looks like:<br />
<br />
main<br />
moab::ParallelComm::exchange_tags<br />
moab.ParallelComm::recv_buffer<br />
MPI_Isend<br />
<br />
I've been able to get the same failure and error message you are
seeing by having an MPI process call MPI_Isend<br />
when there are no matching receives. After 2 million Isends the
program exits with the error you are seeing. So<br />
I'm pretty sure your ending up with a large number of outstanding
requests and the program is failing because<br />
it can't allocate space for new MPI_Request objects.<br />
<br />
I'd suggest looking at how Moab is calling MPI and how many requests
might be outstanding at any one time.<br />
Since the code is running for 5 hours and looks to be executing
hundreds of thousands of iterations I wonder<br />
if there is some sort of send-receive mismatch that is letting
requests accumulate. I think your best bet is to<br />
talk to the Moab folks and see if they have any ideas about why this
might be happening. <br />
<br />
One possibility is a load imbalance between processes - if you don't
have any MPI_Barriers or other collectives in<br />
your code you could try adding a barrier to synchronize the
processes.<br />
<br />
If the Moab guys can't help you and adding a barrier doesn't help I
can work with you to instrument the code to<br />
collect more information on how MPI is being called and we could
possibly pin down the source of the problem<br />
that way.<br />
<br />
<br />
Scott<br />
<br />
<br />
<div class="moz-cite-prefix">
On 5/2/14, 11:14 PM, <a href="mailto:kanaev@ibrae.ac.ru" class="moz-txt-link-abbreviated">kanaev@ibrae.ac.ru</a>
wrote:<br />
</div>
<blockquote>
<p>
Hello Scott
</p>
<p>
The dir is cd /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit
</p>
<div>
The run produced core files is 253035
</div>
<div>
</div>
<div>
I took another run with the line
MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD);
commented and it stopped the same very time iteration #524378,
just passed some more lines
</div>
<div>
</div>
<div>
I use MOAB library and its function of exchanging data between
processors so i think i cannot really count MPI requests
</div>
<div>
</div>
<div>
Anton
</div>
<p>
<br />
On Mon, 28 Apr 2014 16:45:41 -0500, Scott Parker wrote:
</p>
<blockquote style="padding-left: 5px; border-left-color: #1010ff; border-left-width: 2px; border-left-style: solid; margin-left: 5px; width: 100%">
<br />
Can you point me to the directory where your binary and core
files are?<br />
<br />
The stack trace you sent shows a call to MPI_Waitany, do you
know how many MPI requests<br />
the code generally has outstanding at any time?<br />
<br />
-Scott<br />
<br />
<div class="moz-cite-prefix">
On 4/28/14, 4:30 PM, <a href="mailto:kanaev@ibrae.ac.ru" class="moz-txt-link-abbreviated">kanaev@ibrae.ac.ru</a>
wrote:<br />
</div>
<blockquote>
<p>
Hello Scott,
</p>
<p>
I took rerun with the mentioned keys. The code was freshly
compiled with makefile attached just in case.
</p>
<p>
I've got 1024 core files. Two of them are attached.
</p>
<p>
I run <span style="font-family: 'Lucida Grande', Verdana, Arial, Helvetica, sans-serif">bgq_stack for core.0 and
here's what i got:</span>
</p>
<p>
<span style="font-family: 'Lucida Grande', Verdana, Arial, Helvetica, sans-serif"></span> [akanaev@miralac1
pinit]$bgq_stack pinit core.0
</p>
<p>
------------------------------------------------------------------------
</p>
<p>
Program :
/gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit/pinit
</p>
<p>
------------------------------------------------------------------------
</p>
<p>
+++ID Rank: 0, TGID: 1, Core: 0, HWTID:0 TID: 1 State: RUN
</p>
<p>
</p>
<p>
00000000018334c0
</p>
<p>
_ZN4PAMI6Device2MU7Factory12advance_implEPNS1_7ContextEmm
</p>
<p>
/bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/components/devices/bgq/mu2/Factory.h:269
</p>
<p>
</p>
<p>
000000000170da28
</p>
<p>
PAMI_Context_trylock_advancev
</p>
<p>
/bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/api/c/pami.cc:554
</p>
<p>
</p>
<p>
000000000155d0dc
</p>
<p>
PMPI_Waitany
</p>
<p>
/bgsys/source/srcV1R2M1.17463/comm/lib/dev/mpich2/src/mpid/pamid/include/../src/mpid_progress.h:239
</p>
<p>
</p>
<p>
00000000010e84e4
</p>
<p>
00000072.long_branch_r2off.H5Dget_space+0
</p>
<p>
:0
</p>
<p>
</p>
<p>
0000000001042be0
</p>
<p>
00000012.long_branch_r2off._ZNSt14basic_ifstreamIcSt11char_traitsIcEEC1EPKcSt13_Ios_Openmode+0
</p>
<p>
:0
</p>
<p>
</p>
<p>
00000000019de058
</p>
<p>
generic_start_main
</p>
<p>
/bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../csu/libc-start.c:226
</p>
<p>
</p>
<p>
00000000019de354
</p>
<p>
__libc_start_main
</p>
<p>
/bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:194
</p>
<p>
</p>
<p>
0000000000000000
</p>
<p>
??
</p>
<p>
??:0
</p>
<div>
<br />
</div>
<p>
> Have these sort of runs succeeded in the past using
the same code base with no changes and similar input data?
</p>
<p>
That is the first time i'm trying to run this code for
that long time
</p>
<p>
</p>
<p>
Thanks
</p>
<p>
Anton
</p>
<p>
<br />
On Thu, 24 Apr 2014 15:45:49 -0500, Scott Parker wrote:
</p>
<blockquote style="padding-left: 5px; border-left-color: #1010ff; border-left-width: 2px; border-left-style: solid; margin-left: 5px; width: 1109px">
<br />
Anton-<br />
<br />
Thanks, it's aborting because of a runtime error that
appears to be in the mpich layer. <br />
<br />
Can you rerun with "--env
BG_COREDUMPONEXIT=1:BG_COREDUMPONERROR=1" added to your qsub
line - that should<br />
generate some core files on which you can run bgq_stack.<br />
<br />
The system software (driver) on Mira was updated this week
and I'd like to get a clearer picture of <br />
whether that could be related to you problem, so<br />
<br />
Has your code been recompiled since Monday? If not can
you recompile and try running again<br />
<br />
Have these sort of runs succeeded in the past using the
same code base with no changes and similar input data?<br />
<br />
-Scott<br />
<br />
<br />
<br />
<div class="moz-cite-prefix">
On 4/24/14, 2:59 PM, <a href="mailto:kanaev@ibrae.ac.ru" class="moz-txt-link-abbreviated">kanaev@ibrae.ac.ru</a> wrote:<br />
</div>
<blockquote>
<p>
Sorry about the attached files, here the are
</p>
<p>
There's no core files after exiting, looks like
stopping because of time requested expires but you can
see from cobaltlog it's about 5 hours passed (10 hours
was requested) before exit
</p>
<p>
Anton
</p>
<p>
On Thu, 24 Apr 2014 14:07:07 -0500, Scott Parker
wrote:
</p>
<blockquote style="padding-left: 5px; border-left-color: #1010ff; border-left-width: 2px; border-left-style: solid; margin-left: 5px; width: 1029px">
<br />
Anton-<br />
<br />
Please send these emails to <a href="mailto:support@alcf.anl.gov" class="moz-txt-link-abbreviated">support@alcf.anl.gov</a> as
I may not always be available to investigate.<br />
<br />
I don't see the cobalt or error files attached, so I
can't really say anything about why your job may be<br />
failing. Do you get core files when the job crashed? If
so I'd recommend using 'bgq_stack '<br />
to try and get the file and line number where the
failure occurred. Knowing the line may be enough<br />
to let you figure it out, if not you'll need to dump the
values of the variables at the time of the crash<br />
to get a clearer picture of what is going on.<br />
<br />
Scott<br />
<br />
<div class="moz-cite-prefix">
On 4/24/14, 1:36 PM, <a href="mailto:kanaev@ibrae.ac.ru" class="moz-txt-link-abbreviated">kanaev@ibrae.ac.ru</a> wrote:<br />
</div>
<blockquote>
<p>
Hello Scott,
</p>
<p>
I've tried twice to run 10 hours 1024 cores job on
Mira in mode c16 with
</p>
<p>
qsub -n 64 -t 10:00:00 --mode c16 -A ThermHydraX
pinit Lid_128x128x1p1024.h5m
</p>
<p>
Both times job exited earlier than expected on the
same iteration after the same error during executing
the following section (it's between two couts):
</p>
<p>
...
</p>
<p>
//LOOP OVER OWNED CELLS
</p>
<p>
double r = 0;
</p>
<p>
for (moab::Range::iterator it =
owned_ents.begin(); it != owned_ents.end(); it++) {
</p>
<p>
EntityHandle ent = *it;
</p>
<p>
int cellid = mb->id_from_handle(ent);
</p>
<p>
double Vol_;
</p>
<p>
double u_;
</p>
<p>
double v_;
</p>
<p>
double w_;
</p>
<p>
double r1,r2,r3;
</p>
<p>
double tmp;
</p>
<p>
</p>
<p>
result = mb->tag_get_data(u, &ent, 1,
&u_);
</p>
<p>
PRINT_LAST_ERROR;
</p>
<p>
result = mb->tag_get_data(v, &ent, 1,
&v_);
</p>
<p>
PRINT_LAST_ERROR;
</p>
<p>
result = mb->tag_get_data(w, &ent, 1,
&w_);
</p>
<p>
PRINT_LAST_ERROR;
</p>
<p>
</p>
<p>
double result;
</p>
<p>
SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][2].lx,CG[cellid][2].ly,CG[cellid][2].lz);
</p>
<p>
</p>
<p>
r1 = (sound +
fabs(result))/CG[cellid][2].length;
</p>
<p>
</p>
<p>
SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][3].lx,CG[cellid][3].ly,CG[cellid][3].lz);
</p>
<p>
</p>
<p>
r2 = (sound +
fabs(result))/CG[cellid][3].length;
</p>
<p>
</p>
<p>
SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][5].lx,CG[cellid][5].ly,CG[cellid][5].lz);
</p>
<p>
</p>
<p>
r3 = (sound +
fabs(result))/CG[cellid][5].length;
</p>
<p>
</p>
<p>
tmp = MAX3(r1,r2,r3);
</p>
<p>
</p>
<p>
r = MAX(tmp,r);
</p>
<p>
}
</p>
<p>
</p>
<p>
double rmax;
</p>
<p>
MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD);
</p>
<p>
tau = CFL/rmax;
</p>
<p>
ttime+=tau;
</p>
<p>
...
</p>
<p>
</p>
<p>
So it may be Allreduce
</p>
<p>
I've attached cobaltlog and error files of both
runs
</p>
<p>
Can you please take a look and suggest a further
debugging
</p>
<p>
</p>
<p>
Thanks
</p>
<p>
Anton
</p>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
<br />
</blockquote>
<p>
</p>
</blockquote>
<br />