<html dir="ltr">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<style id="owaParaStyle" type="text/css">P {margin-top:0;margin-bottom:0;}</style>

</head>

<body ocsi="0" fpstyle="1">

<div style="direction: ltr;font-family: Tahoma;color: #000000;font-size: 10pt;">hmmm,<br>

this is not good :(<br>

<br>

Are you running this on mira? Do you have a small file for a laptop/workstation?<br>

Maybe I can create one similar.<br>

<br>

Do you see this only on 1024 processes or can it be lower count?<br>

<br>

How does your model look like?<br>

Any processor should not communicate with more than 64 other processes, maybe after ghosting this number is reached.<br>

<br>

Can you run a debug version of this ? maybe some asserts are not triggered in optimized mode.<br>

<br>

Is your file somewhere on mira I can get to it? <br>

<br>

Iulian<br>

<br>

<div style="font-family: Times New Roman; color: #000000; font-size: 16px">

<hr tabindex="-1">

<div style="direction: ltr;" id="divRpF370222"><font color="#000000" face="Tahoma" size="2"><b>From:</b> moab-dev-bounces@mcs.anl.gov [moab-dev-bounces@mcs.anl.gov] on behalf of kanaev@ibrae.ac.ru [kanaev@ibrae.ac.ru]<br>

<b>Sent:</b> Tuesday, May 20, 2014 5:05 PM<br>

<b>To:</b> MOAB dev<br>

<b>Subject:</b> Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]<br>

</font><br>

</div>

<div></div>

<div>

<p>The problem is still here </p>

<p>I've made a simple program performing certain numbers of exchange_tags calls within a loop

</p>

<p>If you run it on several processors with any mesh file it will eventually crash with the following message from every core:

</p>

<p>  </p>

<p>Fatal error in PMPI_Isend: Internal MPI error!, error stack: </p>

<p>PMPI_Isend(148): MPI_Isend(buf=0xd0f300, count=4, MPI_UNSIGNED_CHAR, dest=1, tag=6, MPI_COMM_WORLD, request=0xcde354) failed

</p>

<p>(unknown)(): Internal MPI error! </p>

<p>  </p>

<p>Thanks </p>

<p>Anton  </p>

<div><br>

</div>

<p>On Tue, 20 May 2014 04:40:03 -0400, wrote: </p>

<blockquote style="padding-left:5px; border-left-color:#1010ff; border-left-width:2px; border-left-style:solid; margin-left:5px; width:100%">

<p>Please disregard that, the global_id space for Quads was incontinuous in my mesh file

</p>

<p>Will check back with correct mesh </p>

<p>Anton<br>

<br>

On Mon, 19 May 2014 15:17:14 -0400, wrote: </p>

<blockquote style="padding-left:5px; border-left-color:#1010ff; border-left-width:2px; border-left-style:solid; margin-left:5px; width:100%">

<p>Hello MOAB dev, </p>

<p>I've attached a simplified version of my program that crashes presumably after particular numbers calls of exchange_tags

</p>

<p>I ran it couple of times on Mira on 1024 cores (64 nodes in --mode c16) </p>

<p>It performs around 524378 iterations and then crushes (error file attached) </p>

<p>Can you please take a look at what Scott Parker from ALCF suggests about it:  </p>

<p>-------- Original Message -------- </p>

<table border="0" cellpadding="0" cellspacing="0">

<tbody>

<tr>

<th align="right" valign="baseline">Subject: </th>

<td>Re: Job exiting early [ThermHydraX]</td>

</tr>

<tr>

<th align="right" valign="baseline">Date: </th>

<td>Fri, 9 May 2014 18:48:25 -0500</td>

</tr>

<tr>

<th align="right" valign="baseline">From: </th>

<td>Scott Parker </td>

</tr>

<tr>

<th align="right" valign="baseline">To: </th>

<td> </td>

</tr>

</tbody>

</table>

<br>

<br>

Anton-<br>

<br>

I took a look at the core files and from the stack trace it appears that the code is failing in an MPI_Isend call<br>

that is called from moab.ParallelComm::recv_buffer which is called from moab::ParallelComm::exchange_tags<br>

called from main(). The stack looks like:<br>

   <br>

  main<br>

    moab::ParallelComm::exchange_tags<br>

      moab.ParallelComm::recv_buffer<br>

        MPI_Isend<br>

<br>

I've been able to get the same failure and error message you are seeing by having an MPI process call MPI_Isend<br>

when there are no matching receives. After 2 million Isends the program exits with the error you are seeing. So<br>

I'm pretty sure your ending up with a large number of outstanding requests and the program is failing because<br>

it can't allocate space for new MPI_Request objects.<br>

<br>

I'd suggest looking at how Moab is calling MPI and how many requests might be outstanding at any one time.<br>

Since the code is running for 5 hours and looks to be executing hundreds of thousands of iterations I wonder<br>

if there is some sort of send-receive mismatch that is letting requests accumulate. I think your best bet is to<br>

talk to the Moab folks and see if they have any ideas about why this might be happening.

<br>

<br>

One possibility is a load imbalance between processes - if you don't have any MPI_Barriers or other collectives in<br>

your code you could try adding a barrier to synchronize the processes.<br>

<br>

If the Moab guys can't help you and adding a barrier doesn't help I can work with you to instrument the code to<br>

collect more information on how MPI is being called and we could possibly pin down the source of the problem<br>

that way.<br>

<br>

<br>

Scott<br>

<br>

<br>

<div class="moz-cite-prefix">On 5/2/14, 11:14 PM, <a href="mailto:kanaev@ibrae.ac.ru" class="moz-txt-link-abbreviated" target="_blank">

kanaev@ibrae.ac.ru</a> wrote:<br>

</div>

<blockquote>

<p>Hello Scott </p>

<p>The dir is cd /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit </p>

<div>The run produced core files is  253035 </div>

<div>  </div>

<div>I took another run with the line  MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD); commented and it stopped the same very time iteration #524378,  just passed some more lines

</div>

<div>   </div>

<div>I use MOAB library and its function of exchanging data between processors so i think i cannot really count MPI requests

</div>

<div>  </div>

<div>Anton  </div>

<p><br>

On Mon, 28 Apr 2014 16:45:41 -0500, Scott Parker wrote: </p>

<blockquote style="padding-left:5px; border-left-color:#1010ff; border-left-width:2px; border-left-style:solid; margin-left:5px; width:100%">

<br>

Can you point me to the directory where your binary and core files are?<br>

<br>

The stack trace you sent shows a call to MPI_Waitany, do you know how many MPI requests<br>

the code generally has outstanding at any time?<br>

<br>

-Scott<br>

<br>

<div class="moz-cite-prefix">On 4/28/14, 4:30 PM, <a href="mailto:kanaev@ibrae.ac.ru" class="moz-txt-link-abbreviated" target="_blank">

kanaev@ibrae.ac.ru</a> wrote:<br>

</div>

<blockquote>

<p>Hello Scott, </p>

<p>I took rerun with the mentioned keys. The code was freshly compiled with makefile attached just in case.

</p>

<p>I've got 1024 core files. Two of them are attached. </p>

<p>I run <span style="font-family:'Lucida Grande',Verdana,Arial,Helvetica,sans-serif">bgq_stack for core.0 and here's what i got:</span>

</p>

<p><span style="font-family:'Lucida Grande',Verdana,Arial,Helvetica,sans-serif"></span> [akanaev@miralac1 pinit]$bgq_stack pinit core.0

</p>

<p>------------------------------------------------------------------------ </p>

<p>Program   : /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit/pinit </p>

<p>------------------------------------------------------------------------ </p>

<p>+++ID Rank: 0, TGID: 1, Core: 0, HWTID:0 TID: 1 State: RUN </p>

<p>  </p>

<p>00000000018334c0 </p>

<p>_ZN4PAMI6Device2MU7Factory12advance_implEPNS1_7ContextEmm </p>

<p>/bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/components/devices/bgq/mu2/Factory.h:269

</p>

<p>  </p>

<p>000000000170da28 </p>

<p>PAMI_Context_trylock_advancev </p>

<p>/bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/api/c/pami.cc:554 </p>

<p>  </p>

<p>000000000155d0dc </p>

<p>PMPI_Waitany </p>

<p>/bgsys/source/srcV1R2M1.17463/comm/lib/dev/mpich2/src/mpid/pamid/include/../src/mpid_progress.h:239

</p>

<p>  </p>

<p>00000000010e84e4 </p>

<p>00000072.long_branch_r2off.H5Dget_space+0 </p>

<p>:0 </p>

<p>  </p>

<p>0000000001042be0 </p>

<p>00000012.long_branch_r2off._ZNSt14basic_ifstreamIcSt11char_traitsIcEEC1EPKcSt13_Ios_Openmode+0

</p>

<p>:0 </p>

<p>  </p>

<p>00000000019de058 </p>

<p>generic_start_main </p>

<p>/bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../csu/libc-start.c:226

</p>

<p>  </p>

<p>00000000019de354 </p>

<p>__libc_start_main </p>

<p>/bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:194

</p>

<p>  </p>

<p>0000000000000000 </p>

<p>?? </p>

<p>??:0 </p>

<div><br>

</div>

<p>>   Have these sort of runs succeeded in the past using the same code base with no changes and similar input data? 

</p>

<p>That is the first time i'm trying to run this code for that long time </p>

<p>  </p>

<p>Thanks </p>

<p>Anton  </p>

<p><br>

On Thu, 24 Apr 2014 15:45:49 -0500, Scott Parker wrote: </p>

<blockquote style="padding-left:5px; border-left-color:#1010ff; border-left-width:2px; border-left-style:solid; margin-left:5px; width:1109px">

<br>

Anton-<br>

<br>

Thanks, it's aborting because of a runtime error that appears to be in the mpich layer. <br>

<br>

Can you rerun with  "--env  BG_COREDUMPONEXIT=1:BG_COREDUMPONERROR=1" added to your qsub line - that should<br>

generate some core files on which you can run bgq_stack.<br>

<br>

The system software (driver) on Mira was updated this week and I'd like to get a clearer picture of <br>

whether that could be related to you problem, so<br>

<br>

   Has your code been recompiled since Monday? If not can you recompile and try running again<br>

   <br>

   Have these sort of runs succeeded in the past using the same code base with no changes and similar input data?<br>

<br>

-Scott<br>

<br>

  <br>

<br>

<div class="moz-cite-prefix">On 4/24/14, 2:59 PM, <a href="mailto:kanaev@ibrae.ac.ru" class="moz-txt-link-abbreviated" target="_blank">kanaev@ibrae.ac.ru</a> wrote:<br>

</div>

<blockquote>

<p>Sorry about the attached files, here the are </p>

<p>There's no core files after exiting, looks like stopping because of time requested expires but you can see from cobaltlog it's about 5 hours passed (10 hours was requested) before exit

</p>

<p>Anton  </p>

<p>On Thu, 24 Apr 2014 14:07:07 -0500, Scott Parker wrote: </p>

<blockquote style="padding-left:5px; border-left-color:#1010ff; border-left-width:2px; border-left-style:solid; margin-left:5px; width:1029px">

<br>

Anton-<br>

<br>

Please send these emails to <a href="mailto:support@alcf.anl.gov" class="moz-txt-link-abbreviated" target="_blank">support@alcf.anl.gov</a> as I may not always be available to investigate.<br>

<br>

I don't see the cobalt or error files attached, so I can't really say anything about why your job may be<br>

failing. Do you get core files when the job crashed? If so I'd recommend using 'bgq_stack '<br>

to try and get the file and line number where the failure occurred. Knowing the line may be enough<br>

to let you figure it out, if not you'll need to dump the values of the variables at the time of the crash<br>

to get a clearer picture of what is going on.<br>

<br>

Scott<br>

<br>

<div class="moz-cite-prefix">On 4/24/14, 1:36 PM, <a href="mailto:kanaev@ibrae.ac.ru" class="moz-txt-link-abbreviated" target="_blank">kanaev@ibrae.ac.ru</a> wrote:<br>

</div>

<blockquote>

<p>Hello Scott, </p>

<p>I've tried twice to run 10 hours 1024 cores job on Mira in mode c16 with </p>

<p>qsub -n 64 -t 10:00:00 --mode c16 -A ThermHydraX pinit Lid_128x128x1p1024.h5m </p>

<p>Both times job exited earlier than expected on the same iteration after the same error during executing the following section (it's between two couts):

</p>

<p> ... </p>

<p>//LOOP OVER OWNED CELLS </p>

<p>     double r = 0; </p>

<p>     for (moab::Range::iterator it = owned_ents.begin(); it != owned_ents.end(); it++) {

</p>

<p>      EntityHandle ent = *it; </p>

<p>      int cellid = mb->id_from_handle(ent); </p>

<p>      double Vol_; </p>

<p>      double u_; </p>

<p>      double v_; </p>

<p>      double w_; </p>

<p>      double r1,r2,r3; </p>

<p>      double tmp; </p>

<p>  </p>

<p>      result = mb->tag_get_data(u, &ent, 1, &u_); </p>

<p>      PRINT_LAST_ERROR; </p>

<p>      result = mb->tag_get_data(v, &ent, 1, &v_); </p>

<p>      PRINT_LAST_ERROR; </p>

<p>      result = mb->tag_get_data(w, &ent, 1, &w_); </p>

<p>      PRINT_LAST_ERROR;  </p>

<p>  </p>

<p>      double result; </p>

<p>      SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][2].lx,CG[cellid][2].ly,CG[cellid][2].lz);

</p>

<p>  </p>

<p>      r1 = (sound + fabs(result))/CG[cellid][2].length; </p>

<p>  </p>

<p>      SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][3].lx,CG[cellid][3].ly,CG[cellid][3].lz);

</p>

<p>  </p>

<p>      r2 = (sound + fabs(result))/CG[cellid][3].length;      </p>

<p>  </p>

<p>      SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][5].lx,CG[cellid][5].ly,CG[cellid][5].lz);

</p>

<p>  </p>

<p>      r3 = (sound + fabs(result))/CG[cellid][5].length; </p>

<p>  </p>

<p>      tmp = MAX3(r1,r2,r3); </p>

<p>  </p>

<p>      r = MAX(tmp,r); </p>

<p> } </p>

<p>  </p>

<p> double rmax; </p>

<p> MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD); </p>

<p> tau = CFL/rmax; </p>

<p> ttime+=tau;   </p>

<p>... </p>

<p>  </p>

<p>So it may be Allreduce </p>

<p>I've attached cobaltlog and error files of both runs </p>

<p>Can you please take a look and suggest a further debugging </p>

<p>  </p>

<p>Thanks </p>

<p>Anton  </p>

</blockquote>

</blockquote>

</blockquote>

</blockquote>

</blockquote>

<br>

</blockquote>

<p>  </p>

</blockquote>

<br>

</blockquote>

<p>  </p>

</blockquote>

<p>  </p>

</div>

</div>

</div>

</body>

</html>