<html><body><div style="color:#000; background-color:#fff; font-family:verdana, helvetica, sans-serif;font-size:10pt"><div class="" style=""><span class="" style="">Here is the testcase, Please patch the Makefile.am in moab/test/parallel for the new testcase, and patch for moab/src/parallel/ParallelComm.cpp, in which I moved the call of filtering out duplicate vertices into a lower level subroutine to ensure both versions of assign_global_id() work the same way.</span></div><div style="color: rgb(0, 0, 0); font-size: 13px; font-family: verdana, helvetica, sans-serif; background-color: transparent; font-style: normal;" class=""><span class="" style=""><br class="" style=""></span></div><div style="color: rgb(0, 0, 0); font-size: 13px; font-family: verdana, helvetica, sans-serif; background-color: transparent; font-style: normal;" class=""><span class="" style="">After moab parallel build, and testcase build, run</span></div><div style="color: rgb(0, 0,
0); font-size: 13px; font-family: verdana, helvetica, sans-serif; background-color: transparent; font-style: normal;" class=""><span class="" style="">mpiexec -np 4 migrate_test</span></div><div style="color: rgb(0, 0, 0); font-size: 13px; font-family: verdana, helvetica, sans-serif; background-color: transparent; font-style: normal;" class=""><span class="" style=""><br class="" style=""></span></div><div style="color: rgb(0, 0, 0); font-size: 13px; font-family: verdana, helvetica, sans-serif; background-color: transparent; font-style: normal;" class=""><span class="" style="">and I got following output </span></div><div class="" style=""></div><div class="" style=""> </div><div class="" style=""><div class="" style="">~/moab/parallelbuild/test/parallel (master) >mpiexec -np 4 migrate_test</div><div class="" style=""><br class="" style=""></div><div class="" style="">Running test_create_tag ...</div><div class="" style="">Running
test_create_tag ...</div><div class="" style="">Running test_create_tag ...</div><div class="" style="">edges.size() = edges.size() = 4</div><div class="" style="">4</div><div class="" style="">edges.size() = 4</div><div class="" style="">Running test_create_tag ...</div><div class="" style="">edges.size() = 4</div><div class="" style="">rank = 2, vertice's global ids = 9<span class="" style="white-space:pre"> </span>10<span class="" style="white-space:pre"> </span>0<span class="" style="white-space:pre"> </span>9</div><div class="" style="">rank = 0, vertice's global ids = 1<span class="" style="white-space:pre"> </span>2<span class="" style="white-space:pre"> </span>3<span class="" style="white-space:pre"> </span>4</div><div class="" style="">rank = 1, vertice's global ids = 5<span class="" style="white-space:pre"> </span>6<span class="" style="white-space:pre"> </span>0<span class="" style="white-space:pre"> </span>5</div><div
class="" style="">rank = 3, vertice's global ids = 13<span class="" style="white-space:pre"> </span>14<span class="" style="white-space:pre"> </span>0<span class="" style="white-space:pre"> </span>13</div><div class="" style=""><br></div><div class="" style="">Thank you if you are looking into the global id problems.</div><div class="" style=""><br class="" style=""></div></div><div class="" style="">Jane </div><div class="" style=""><br class="" style=""><br class="" style=""></div><div class="" style="">Asst. Researcher<br class="" style="">Dept. of Engineering Physics<br class="" style="">UW @ Madison</div><br class="" style=""><div class="" style=""><br class="" style=""></div><div class="" style="">"And we know that for those who love God, that is, for those who are called according to his purpose, all things are working together for good." (Romans 8:28)</div><div class="yahoo_quoted" style="display: block;"> <div style="font-family:
verdana, helvetica, sans-serif; font-size: 10pt;" class=""> <div style="font-family: HelveticaNeue, Helvetica Neue, Helvetica, Arial, Lucida Grande, sans-serif; font-size: 12pt;" class=""> <div dir="ltr" class="" style=""> <font size="2" face="Arial" class="" style=""> On Wednesday, May 21, 2014 9:49 AM, Vijay S. Mahadevan <vijay.m@gmail.com> wrote:<br class="" style=""> </font> </div> <br class="" style=""><br class="" style=""> <div class="" style=""><div id="yiv5710073571" class="" style=""><div class="" style=""><div dir="ltr" class="" style="">Jane,<div class="" style=""><br clear="none" class="" style=""></div><div class="" style=""><div style="color:rgb(0,0,0);font-family:verdana, helvetica, sans-serif;font-size:13px;background-color:transparent;" class="">// Global Ids
</div>
<div style="color:rgb(0,0,0);font-family:verdana, helvetica, sans-serif;font-size:13px;background-color:transparent;" class=""> // 2.0 4 ------- 3 -------- 6 -------- 10 --------- 14 </div><div style="color:rgb(0,0,0);font-family:verdana, helvetica, sans-serif;font-size:13px;background-color:transparent;" class="">
// | | | | | </div><div style="color:rgb(0,0,0);font-family:verdana, helvetica, sans-serif;font-size:13px;background-color:transparent;" class=""> // | | | | | </div>
<div style="color:rgb(0,0,0);font-family:verdana, helvetica, sans-serif;font-size:13px;background-color:transparent;" class=""> // | | | | | </div><div style="color:rgb(0,0,0);font-family:verdana, helvetica, sans-serif;font-size:13px;background-color:transparent;" class="">
// 1.0 1 ------- 2 -------- 5 -------- 9 --------- 13</div><div style="color:rgb(0,0,0);font-family:verdana, helvetica, sans-serif;font-size:13px;background-color:transparent;" class=""> // 0.0 1.0 2.0 3.0 4.0 </div>
</div><div class="" style=""><br clear="none" class="" style=""></div><div class="" style="">This is certainly a bug and should not happen to start with. If you have this test case available, do send it to the list so that we can find out the actual reason for the misnumber. Anton's issue might or not be directly related to this bug but his reference to GID being discontinuous shows that there is an outstanding issue here.</div>
<div class="" style=""><br clear="none" class="" style=""></div><div class="" style="">Vijay<br clear="none" class="" style=""><br clear="none" class="" style=""><div class="" id="yiv5710073571yqtfd95608" style=""><div class="" style="">On Wed, May 21, 2014 at 8:06 PM, Jiangtao Hu <span dir="ltr" class="" style=""><<a rel="nofollow" shape="rect" ymailto="mailto:jiangtao_ma@yahoo.com" target="_blank" href="mailto:jiangtao_ma@yahoo.com" class="" style="">jiangtao_ma@yahoo.com</a>></span> wrote:<br clear="none" class="" style="">
<blockquote class="" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex;"><div class="" style=""><div style="color:rgb(0,0,0);background-color:rgb(255,255,255);font-family:verdana, helvetica, sans-serif;font-size:10pt;" class="">
<div class="" style=""><div class="" style=""><div style="color:rgb(0,0,0);background-color:rgb(255,255,255);font-family:verdana, helvetica, sans-serif;font-size:10pt;" class=""><div class="" style=""><span class="" style="">On unit test for my project, I am trying to use ParallelComm::assign_global_id() to get the global ids for vertices of a simple grid</span></div>
<div style="background-color:transparent;" class=""> // Mesh Ids
</div><div style="background-color:transparent;" class=""> // 2.0 6 ------- 7 -------- 8 -------- 9 --------- 10
</div><div style="background-color:transparent;" class=""> // | | | | | </div><div style="background-color:transparent;" class=""> // | 1 | 2 | 3 | 4 | </div>
<div style="background-color:transparent;" class=""> // |
| |
| | </div><div style="background-color:transparent;" class=""> // 1.0 1 ------- 2 -------- 3 -------- 4 --------- 5</div><div style="background-color:transparent;" class=""><span class="" style=""></span></div>
<div style="background-color:transparent;" class=""> // 0.0 1.0 2.0 3.0 4.0 </div><div style="background-color:transparent;" class=""><br clear="none" class="" style=""></div><div style="background-color:transparent;color:rgb(0,0,0);font-size:13px;font-family:verdana, helvetica, sans-serif;font-style:normal;" class="">
and got global ids as following</div><div style="background-color:transparent;" class="">// Global Ids </div><div style="background-color:transparent;" class=""> // 2.0 4 ------- 3 -------- 6 -------- 10 --------- 14
</div><div style="background-color:transparent;" class=""> // | |
| | | </div><div style="background-color:transparent;" class=""> // | | | | | </div><div style="background-color:transparent;" class="">
// | | | | |
</div><div style="background-color:transparent;" class=""> // 1.0 1 ------- 2 -------- 5 -------- 9 --------- 13</div><div style="background-color:transparent;" class=""> // 0.0 1.0 2.0 3.0 4.0 </div>
<div style="background-color:transparent;" class=""><br clear="none" class="" style=""></div><div style="background-color:transparent;color:rgb(0,0,0);font-size:13px;font-family:verdana, helvetica, sans-serif;font-style:normal;" class="">Don't know if this is what
Anton got through. </div><div style="background-color:transparent;color:rgb(0,0,0);font-size:13px;font-family:verdana, helvetica, sans-serif;font-style:normal;" class=""><br clear="none" class="" style=""></div><div style="background-color:transparent;color:rgb(0,0,0);font-size:13px;font-family:verdana, helvetica, sans-serif;font-style:normal;" class="">
Iulian, if you are interested to see the test, please let me know, and I'll send to you.</div><div class="" style=""></div><div class="" style=""> </div><div class="" style="">Jane </div><div class="" style=""><br clear="none" class="" style=""><br clear="none" class="" style=""></div><div class="" style="">Asst. Researcher<br clear="none" class="" style="">
Dept. of Engineering Physics<br clear="none" class="" style="">
UW @ Madison</div><br clear="none" class="" style=""><div class="" style=""><br clear="none" class="" style=""></div><div class="" style="">"And we know that for those who love God, that is, for those who are called according to his purpose, all things are working together for
good." (Romans 8:28)</div><div class="" style=""><div class="" style=""><div class="" style=""><div style="display:block;" class=""> <div style="font-family:verdana, helvetica, sans-serif;font-size:10pt;" class=""> <div style="font-family:HelveticaNeue, 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif;font-size:12pt;" class="">
<div dir="ltr" class="" style=""> <font face="Arial" class="" style=""> On Tuesday, May 20, 2014 5:36 PM, "Grindeanu, Iulian R." <<a rel="nofollow" shape="rect" ymailto="mailto:iulian@mcs.anl.gov" target="_blank" href="mailto:iulian@mcs.anl.gov" class="" style="">iulian@mcs.anl.gov</a>> wrote:<br clear="none" class="" style=""> </font> </div>
<br clear="none" class="" style=""><br clear="none" class="" style=""> <div class="" style=""><div class="" style=""><div class="" style="">
<div style="direction:ltr;font-family:Tahoma;color:rgb(0,0,0);font-size:10pt;" class="">hmmm,<br clear="none" class="" style="">
this is not good :(<br clear="none" class="" style="">
<br clear="none" class="" style="">
Are you running this on mira? Do you have a small file for a laptop/workstation?<br clear="none" class="" style="">
Maybe I can create one similar.<br clear="none" class="" style="">
<br clear="none" class="" style="">
Do you see this only on 1024 processes or can it be lower count?<br clear="none" class="" style="">
<br clear="none" class="" style="">
How does your model look like?<br clear="none" class="" style="">
Any processor should not communicate with more than 64 other processes, maybe after ghosting this number is reached.<br clear="none" class="" style="">
<br clear="none" class="" style="">
Can you run a debug version of this ? maybe some asserts are not triggered in optimized mode.<br clear="none" class="" style="">
<br clear="none" class="" style="">
Is your file somewhere on mira I can get to it? <br clear="none" class="" style="">
<br clear="none" class="" style="">
Iulian<br clear="none" class="" style="">
<br clear="none" class="" style="">
<div class="" style=""><div style="font-family:'Times New Roman';color:rgb(0,0,0);font-size:16px;" class="">
<hr class="" style="">
<div style="direction:ltr;" class=""><font color="#000000" face="Tahoma" class="" style=""><b class="" style="">From:</b> <a rel="nofollow" shape="rect" ymailto="mailto:moab-dev-bounces@mcs.anl.gov" target="_blank" href="mailto:moab-dev-bounces@mcs.anl.gov" class="" style="">moab-dev-bounces@mcs.anl.gov</a> [<a rel="nofollow" shape="rect" ymailto="mailto:moab-dev-bounces@mcs.anl.gov" target="_blank" href="mailto:moab-dev-bounces@mcs.anl.gov" class="" style="">moab-dev-bounces@mcs.anl.gov</a>] on behalf of <a rel="nofollow" shape="rect" ymailto="mailto:kanaev@ibrae.ac.ru" target="_blank" href="mailto:kanaev@ibrae.ac.ru" class="" style="">kanaev@ibrae.ac.ru</a> [<a rel="nofollow" shape="rect" ymailto="mailto:kanaev@ibrae.ac.ru" target="_blank" href="mailto:kanaev@ibrae.ac.ru" class="" style="">kanaev@ibrae.ac.ru</a>]<br clear="none" class="" style="">
<b class="" style="">Sent:</b> Tuesday, May 20, 2014 5:05 PM<br clear="none" class="" style="">
<b class="" style="">To:</b> MOAB dev<br clear="none" class="" style="">
<b class="" style="">Subject:</b> Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]<br clear="none" class="" style="">
</font><br clear="none" class="" style="">
</div>
<div class="" style=""></div>
<div class="" style="">
<div class="" style="">The problem is still here </div>
<div class="" style="">I've made a simple program performing certain numbers of exchange_tags calls within a loop
</div>
<div class="" style="">If you run it on several processors with any mesh file it will eventually crash with the following message from every core:
</div>
<div class="" style=""> </div>
<div class="" style="">Fatal error in PMPI_Isend: Internal MPI error!, error stack: </div>
<div class="" style="">PMPI_Isend(148): MPI_Isend(buf=0xd0f300, count=4, MPI_UNSIGNED_CHAR, dest=1, tag=6, MPI_COMM_WORLD, request=0xcde354) failed
</div>
<div class="" style="">(unknown)(): Internal MPI error! </div>
<div class="" style=""> </div>
<div class="" style="">Thanks </div>
<div class="" style="">Anton </div>
<div class="" style=""><br clear="none" class="" style="">
</div>
<div class="" style="">On Tue, 20 May 2014 04:40:03 -0400, wrote: </div>
<blockquote style="padding-left:5px;border-left-color:rgb(16,16,255);border-left-width:2px;border-left-style:solid;margin-left:5px;width:100%;" class="">
<div class="" style="">Please disregard that, the global_id space for Quads was incontinuous in my mesh file
</div>
<div class="" style="">Will check back with correct mesh </div>
<div class="" style="">Anton<br clear="none" class="" style="">
<br clear="none" class="" style="">
On Mon, 19 May 2014 15:17:14 -0400, wrote: </div>
<blockquote style="padding-left:5px;border-left-color:rgb(16,16,255);border-left-width:2px;border-left-style:solid;margin-left:5px;width:100%;" class="">
<div class="" style="">Hello MOAB dev, </div>
<div class="" style="">I've attached a simplified version of my program that crashes presumably after particular numbers calls of exchange_tags
</div>
<div class="" style="">I ran it couple of times on Mira on 1024 cores (64 nodes in --mode c16) </div>
<div class="" style="">It performs around 524378 iterations and then crushes (error file attached) </div>
<div class="" style="">Can you please take a look at what Scott Parker from ALCF suggests about it: </div>
<div class="" style="">-------- Original Message -------- </div>
<table border="0" cellpadding="0" cellspacing="0" class="" style=""><tbody class="" style=""><tr class="" style=""><th align="right" colspan="1" rowspan="1" valign="baseline" class="" style="">Subject: </th><td colspan="1" rowspan="1" class="" style="">Re: Job exiting early [ThermHydraX]</td></tr><tr class="" style=""><th align="right" colspan="1" rowspan="1" valign="baseline" class="" style="">
Date: </th><td colspan="1" rowspan="1" class="" style="">Fri, 9 May 2014 18:48:25 -0500</td></tr><tr class="" style=""><th align="right" colspan="1" rowspan="1" valign="baseline" class="" style="">From: </th><td colspan="1" rowspan="1" class="" style="">Scott Parker </td></tr><tr class="" style=""><th align="right" colspan="1" rowspan="1" valign="baseline" class="" style="">
To: </th><td colspan="1" rowspan="1" class="" style=""> </td></tr></tbody></table>
<br clear="none" class="" style="">
<br clear="none" class="" style="">
Anton-<br clear="none" class="" style="">
<br clear="none" class="" style="">
I took a look at the core files and from the stack trace it appears that the code is failing in an MPI_Isend call<br clear="none" class="" style="">
that is called from moab.ParallelComm::recv_buffer which is called from moab::ParallelComm::exchange_tags<br clear="none" class="" style="">
called from main(). The stack looks like:<br clear="none" class="" style="">
<br clear="none" class="" style="">
main<br clear="none" class="" style="">
moab::ParallelComm::exchange_tags<br clear="none" class="" style="">
moab.ParallelComm::recv_buffer<br clear="none" class="" style="">
MPI_Isend<br clear="none" class="" style="">
<br clear="none" class="" style="">
I've been able to get the same failure and error message you are seeing by having an MPI process call MPI_Isend<br clear="none" class="" style="">
when there are no matching receives. After 2 million Isends the program exits with the error you are seeing. So<br clear="none" class="" style="">
I'm pretty sure your ending up with a large number of outstanding requests and the program is failing because<br clear="none" class="" style="">
it can't allocate space for new MPI_Request objects.<br clear="none" class="" style="">
<br clear="none" class="" style="">
I'd suggest looking at how Moab is calling MPI and how many requests might be outstanding at any one time.<br clear="none" class="" style="">
Since the code is running for 5 hours and looks to be executing hundreds of thousands of iterations I wonder<br clear="none" class="" style="">
if there is some sort of send-receive mismatch that is letting requests accumulate. I think your best bet is to<br clear="none" class="" style="">
talk to the Moab folks and see if they have any ideas about why this might be happening.
<br clear="none" class="" style="">
<br clear="none" class="" style="">
One possibility is a load imbalance between processes - if you don't have any MPI_Barriers or other collectives in<br clear="none" class="" style="">
your code you could try adding a barrier to synchronize the processes.<br clear="none" class="" style="">
<br clear="none" class="" style="">
If the Moab guys can't help you and adding a barrier doesn't help I can work with you to instrument the code to<br clear="none" class="" style="">
collect more information on how MPI is being called and we could possibly pin down the source of the problem<br clear="none" class="" style="">
that way.<br clear="none" class="" style="">
<br clear="none" class="" style="">
<br clear="none" class="" style="">
Scott<br clear="none" class="" style="">
<br clear="none" class="" style="">
<br clear="none" class="" style="">
<div class="" style="">On 5/2/14, 11:14 PM, <a rel="nofollow" shape="rect" ymailto="mailto:kanaev@ibrae.ac.ru" target="_blank" href="mailto:kanaev@ibrae.ac.ru" class="" style="">
kanaev@ibrae.ac.ru</a> wrote:<br clear="none" class="" style="">
</div>
<blockquote class="" style="">
<div class="" style="">Hello Scott </div>
<div class="" style="">The dir is cd /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit </div>
<div class="" style="">The run produced core files is 253035 </div>
<div class="" style=""> </div>
<div class="" style="">I took another run with the line MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD); commented and it stopped the same very time iteration #524378, just passed some more lines
</div>
<div class="" style=""> </div>
<div class="" style="">I use MOAB library and its function of exchanging data between processors so i think i cannot really count MPI requests
</div>
<div class="" style=""> </div>
<div class="" style="">Anton </div>
<div class="" style=""><br clear="none" class="" style="">
On Mon, 28 Apr 2014 16:45:41 -0500, Scott Parker wrote: </div>
<blockquote style="padding-left:5px;border-left-color:rgb(16,16,255);border-left-width:2px;border-left-style:solid;margin-left:5px;width:100%;" class="">
<br clear="none" class="" style="">
Can you point me to the directory where your binary and core files are?<br clear="none" class="" style="">
<br clear="none" class="" style="">
The stack trace you sent shows a call to MPI_Waitany, do you know how many MPI requests<br clear="none" class="" style="">
the code generally has outstanding at any time?<br clear="none" class="" style="">
<br clear="none" class="" style="">
-Scott<br clear="none" class="" style="">
<br clear="none" class="" style="">
<div class="" style="">On 4/28/14, 4:30 PM, <a rel="nofollow" shape="rect" ymailto="mailto:kanaev@ibrae.ac.ru" target="_blank" href="mailto:kanaev@ibrae.ac.ru" class="" style="">
kanaev@ibrae.ac.ru</a> wrote:<br clear="none" class="" style="">
</div>
<blockquote class="" style="">
<div class="" style="">Hello Scott, </div>
<div class="" style="">I took rerun with the mentioned keys. The code was freshly compiled with makefile attached just in case.
</div>
<div class="" style="">I've got 1024 core files. Two of them are attached. </div>
<div class="" style="">I run <span style="font-family:'Lucida Grande', Verdana, Arial, Helvetica, sans-serif;" class="">bgq_stack for core.0 and here's what i got:</span>
</div>
<div class="" style=""><span style="font-family:'Lucida Grande', Verdana, Arial, Helvetica, sans-serif;" class=""></span> [akanaev@miralac1 pinit]$bgq_stack pinit core.0
</div>
<div class="" style="">------------------------------------------------------------------------ </div>
<div class="" style="">Program : /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit/pinit </div>
<div class="" style="">------------------------------------------------------------------------ </div>
<div class="" style="">+++ID Rank: 0, TGID: 1, Core: 0, HWTID:0 TID: 1 State: RUN </div>
<div class="" style=""> </div>
<div class="" style="">00000000018334c0 </div>
<div class="" style="">_ZN4PAMI6Device2MU7Factory12advance_implEPNS1_7ContextEmm </div>
<div class="" style="">/bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/components/devices/bgq/mu2/Factory.h:269
</div>
<div class="" style=""> </div>
<div class="" style="">000000000170da28 </div>
<div class="" style="">PAMI_Context_trylock_advancev </div>
<div class="" style="">/bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/api/c/pami.cc:554 </div>
<div class="" style=""> </div>
<div class="" style="">000000000155d0dc </div>
<div class="" style="">PMPI_Waitany </div>
<div class="" style="">/bgsys/source/srcV1R2M1.17463/comm/lib/dev/mpich2/src/mpid/pamid/include/../src/mpid_progress.h:239
</div>
<div class="" style=""> </div>
<div class="" style="">00000000010e84e4 </div>
<div class="" style="">00000072.long_branch_r2off.H5Dget_space+0 </div>
<div class="" style="">:0 </div>
<div class="" style=""> </div>
<div class="" style="">0000000001042be0 </div>
<div class="" style="">00000012.long_branch_r2off._ZNSt14basic_ifstreamIcSt11char_traitsIcEEC1EPKcSt13_Ios_Openmode+0
</div>
<div class="" style="">:0 </div>
<div class="" style=""> </div>
<div class="" style="">00000000019de058 </div>
<div class="" style="">generic_start_main </div>
<div class="" style="">/bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../csu/libc-start.c:226
</div>
<div class="" style=""> </div>
<div class="" style="">00000000019de354 </div>
<div class="" style="">__libc_start_main </div>
<div class="" style="">/bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:194
</div>
<div class="" style=""> </div>
<div class="" style="">0000000000000000 </div>
<div class="" style="">?? </div>
<div class="" style="">??:0 </div>
<div class="" style=""><br clear="none" class="" style="">
</div>
<div class="" style="">> Have these sort of runs succeeded in the past using the same code base with no changes and similar input data?
</div>
<div class="" style="">That is the first time i'm trying to run this code for that long time </div>
<div class="" style=""> </div>
<div class="" style="">Thanks </div>
<div class="" style="">Anton </div>
<div class="" style=""><br clear="none" class="" style="">
On Thu, 24 Apr 2014 15:45:49 -0500, Scott Parker wrote: </div>
<blockquote style="padding-left:5px;border-left-color:rgb(16,16,255);border-left-width:2px;border-left-style:solid;margin-left:5px;width:1109px;" class="">
<br clear="none" class="" style="">
Anton-<br clear="none" class="" style="">
<br clear="none" class="" style="">
Thanks, it's aborting because of a runtime error that appears to be in the mpich layer. <br clear="none" class="" style="">
<br clear="none" class="" style="">
Can you rerun with "--env BG_COREDUMPONEXIT=1:BG_COREDUMPONERROR=1" added to your qsub line - that should<br clear="none" class="" style="">
generate some core files on which you can run bgq_stack.<br clear="none" class="" style="">
<br clear="none" class="" style="">
The system software (driver) on Mira was updated this week and I'd like to get a clearer picture of <br clear="none" class="" style="">
whether that could be related to you problem, so<br clear="none" class="" style="">
<br clear="none" class="" style="">
Has your code been recompiled since Monday? If not can you recompile and try running again<br clear="none" class="" style="">
<br clear="none" class="" style="">
Have these sort of runs succeeded in the past using the same code base with no changes and similar input data?<br clear="none" class="" style="">
<br clear="none" class="" style="">
-Scott<br clear="none" class="" style="">
<br clear="none" class="" style="">
<br clear="none" class="" style="">
<br clear="none" class="" style="">
<div class="" style="">On 4/24/14, 2:59 PM, <a rel="nofollow" shape="rect" ymailto="mailto:kanaev@ibrae.ac.ru" target="_blank" href="mailto:kanaev@ibrae.ac.ru" class="" style="">kanaev@ibrae.ac.ru</a> wrote:<br clear="none" class="" style="">
</div>
<blockquote class="" style="">
<div class="" style="">Sorry about the attached files, here the are </div>
<div class="" style="">There's no core files after exiting, looks like stopping because of time requested expires but you can see from cobaltlog it's about 5 hours passed (10 hours was requested) before exit
</div>
<div class="" style="">Anton </div>
<div class="" style="">On Thu, 24 Apr 2014 14:07:07 -0500, Scott Parker wrote: </div>
<blockquote style="padding-left:5px;border-left-color:rgb(16,16,255);border-left-width:2px;border-left-style:solid;margin-left:5px;width:1029px;" class="">
<br clear="none" class="" style="">
Anton-<br clear="none" class="" style="">
<br clear="none" class="" style="">
Please send these emails to <a rel="nofollow" shape="rect" ymailto="mailto:support@alcf.anl.gov" target="_blank" href="mailto:support@alcf.anl.gov" class="" style="">support@alcf.anl.gov</a> as I may not always be available to investigate.<br clear="none" class="" style="">
<br clear="none" class="" style="">
I don't see the cobalt or error files attached, so I can't really say anything about why your job may be<br clear="none" class="" style="">
failing. Do you get core files when the job crashed? If so I'd recommend using 'bgq_stack '<br clear="none" class="" style="">
to try and get the file and line number where the failure occurred. Knowing the line may be enough<br clear="none" class="" style="">
to let you figure it out, if not you'll need to dump the values of the variables at the time of the crash<br clear="none" class="" style="">
to get a clearer picture of what is going on.<br clear="none" class="" style="">
<br clear="none" class="" style="">
Scott<br clear="none" class="" style="">
<br clear="none" class="" style="">
<div class="" style="">On 4/24/14, 1:36 PM, <a rel="nofollow" shape="rect" ymailto="mailto:kanaev@ibrae.ac.ru" target="_blank" href="mailto:kanaev@ibrae.ac.ru" class="" style="">kanaev@ibrae.ac.ru</a> wrote:<br clear="none" class="" style="">
</div>
<blockquote class="" style="">
<div class="" style="">Hello Scott, </div>
<div class="" style="">I've tried twice to run 10 hours 1024 cores job on Mira in mode c16 with </div>
<div class="" style="">qsub -n 64 -t 10:00:00 --mode c16 -A ThermHydraX pinit Lid_128x128x1p1024.h5m </div>
<div class="" style="">Both times job exited earlier than expected on the same iteration after the same error during executing the following section (it's between two couts):
</div>
<div class="" style=""> ... </div>
<div class="" style="">//LOOP OVER OWNED CELLS </div>
<div class="" style=""> double r = 0; </div>
<div class="" style=""> for (moab::Range::iterator it = owned_ents.begin(); it != owned_ents.end(); it++) {
</div>
<div class="" style=""> EntityHandle ent = *it; </div>
<div class="" style=""> int cellid = mb->id_from_handle(ent); </div>
<div class="" style=""> double Vol_; </div>
<div class="" style=""> double u_; </div>
<div class="" style=""> double v_; </div>
<div class="" style=""> double w_; </div>
<div class="" style=""> double r1,r2,r3; </div>
<div class="" style=""> double tmp; </div>
<div class="" style=""> </div>
<div class="" style=""> result = mb->tag_get_data(u, &ent, 1, &u_); </div>
<div class="" style=""> PRINT_LAST_ERROR; </div>
<div class="" style=""> result = mb->tag_get_data(v, &ent, 1, &v_); </div>
<div class="" style=""> PRINT_LAST_ERROR; </div>
<div class="" style=""> result = mb->tag_get_data(w, &ent, 1, &w_); </div>
<div class="" style=""> PRINT_LAST_ERROR; </div>
<div class="" style=""> </div>
<div class="" style=""> double result; </div>
<div class="" style=""> SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][2].lx,CG[cellid][2].ly,CG[cellid][2].lz);
</div>
<div class="" style=""> </div>
<div class="" style=""> r1 = (sound + fabs(result))/CG[cellid][2].length; </div>
<div class="" style=""> </div>
<div class="" style=""> SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][3].lx,CG[cellid][3].ly,CG[cellid][3].lz);
</div>
<div class="" style=""> </div>
<div class="" style=""> r2 = (sound + fabs(result))/CG[cellid][3].length; </div>
<div class="" style=""> </div>
<div class="" style=""> SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][5].lx,CG[cellid][5].ly,CG[cellid][5].lz);
</div>
<div class="" style=""> </div>
<div class="" style=""> r3 = (sound + fabs(result))/CG[cellid][5].length; </div>
<div class="" style=""> </div>
<div class="" style=""> tmp = MAX3(r1,r2,r3); </div>
<div class="" style=""> </div>
<div class="" style=""> r = MAX(tmp,r); </div>
<div class="" style=""> } </div>
<div class="" style=""> </div>
<div class="" style=""> double rmax; </div>
<div class="" style=""> MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD); </div>
<div class="" style=""> tau = CFL/rmax; </div>
<div class="" style=""> ttime+=tau; </div>
<div class="" style="">... </div>
<div class="" style=""> </div>
<div class="" style="">So it may be Allreduce </div>
<div class="" style="">I've attached cobaltlog and error files of both runs </div>
<div class="" style="">Can you please take a look and suggest a further debugging </div>
<div class="" style=""> </div>
<div class="" style="">Thanks </div>
<div class="" style="">Anton </div>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
<br clear="none" class="" style="">
</blockquote>
<div class="" style=""> </div>
</blockquote>
<br clear="none" class="" style="">
</blockquote>
<div class="" style=""> </div>
</blockquote>
<div class="" style=""> </div>
</div>
</div></div>
</div>
</div></div><br clear="none" class="" style=""><br clear="none" class="" style=""></div> </div> </div> </div></div> </div></div></div></div></div></div></div></blockquote></div><br clear="none" class="" style=""></div></div></div></div></div><br class="" style=""><br class="" style=""></div> </div> </div> </div> </div></body></html>