<p>
Hello MOAB dev,
</p>
<p>
I've attached a simplified version of my program that crashes presumably after particular numbers calls of exchange_tags
</p>
<p>
I ran it couple of times on Mira on 1024 cores (64 nodes in --mode c16)
</p>
<p>
It performs around 524378 iterations and then crushes (error file attached)
</p>
<p>
Can you please take a look at what Scott Parker from ALCF suggests about it: 
</p>
<p>
-------- Original Message --------
</p>
<table border="0" cellspacing="0" cellpadding="0">
        <tbody>
                <tr>
                        <th align="right" valign="baseline">Subject: </th>
                        <td>Re: Job exiting early [ThermHydraX]</td>
                </tr>
                <tr>
                        <th align="right" valign="baseline">Date: </th>
                        <td>Fri, 9 May 2014 18:48:25 -0500</td>
                </tr>
                <tr>
                        <th align="right" valign="baseline">From: </th>
                        <td>Scott Parker </td>
                </tr>
                <tr>
                        <th align="right" valign="baseline">To: </th>
                        <td></td>
                </tr>
        </tbody>
</table>
<br />
<br />
Anton-<br />
<br />
I took a look at the core files and from the stack trace it appears
that the code is failing in an MPI_Isend call<br />
that is called from moab.ParallelComm::recv_buffer which is called
from moab::ParallelComm::exchange_tags<br />
called from main(). The stack looks like:<br />
   <br />
  main<br />
    moab::ParallelComm::exchange_tags<br />
      moab.ParallelComm::recv_buffer<br />
        MPI_Isend<br />
<br />
I've been able to get the same failure and error message you are
seeing by having an MPI process call MPI_Isend<br />
when there are no matching receives. After 2 million Isends the
program exits with the error you are seeing. So<br />
I'm pretty sure your ending up with a large number of outstanding
requests and the program is failing because<br />
it can't allocate space for new MPI_Request objects.<br />
<br />
I'd suggest looking at how Moab is calling MPI and how many requests
might be outstanding at any one time.<br />
Since the code is running for 5 hours and looks to be executing
hundreds of thousands of iterations I wonder<br />
if there is some sort of send-receive mismatch that is letting
requests accumulate. I think your best bet is to<br />
talk to the Moab folks and see if they have any ideas about why this
might be happening. <br />
<br />
One possibility is a load imbalance between processes - if you don't
have any MPI_Barriers or other collectives in<br />
your code you could try adding a barrier to synchronize the
processes.<br />
<br />
If the Moab guys can't help you and adding a barrier doesn't help I
can work with you to instrument the code to<br />
collect more information on how MPI is being called and we could
possibly pin down the source of the problem<br />
that way.<br />
<br />
<br />
Scott<br />
<br />
<br />
<div class="moz-cite-prefix">
On 5/2/14, 11:14 PM, <a href="mailto:kanaev@ibrae.ac.ru" class="moz-txt-link-abbreviated">kanaev@ibrae.ac.ru</a>
wrote:<br />
</div>
<blockquote>
        <p>
        Hello Scott
        </p>
        <p>
        The dir is cd /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit
        </p>
        <div>
        The run produced core files is  253035
        </div>
        <div>
         
        </div>
        <div>
        I took another run with the line
         MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD);
        commented and it stopped the same very time iteration #524378,
         just passed some more lines
        </div>
        <div>
          
        </div>
        <div>
        I use MOAB library and its function of exchanging data between
        processors so i think i cannot really count MPI requests
        </div>
        <div>
         
        </div>
        <div>
        Anton 
        </div>
        <p>
        <br />
        On Mon, 28 Apr 2014 16:45:41 -0500, Scott Parker wrote:
        </p>
        <blockquote style="padding-left: 5px; border-left-color: #1010ff; border-left-width: 2px; border-left-style: solid; margin-left: 5px; width: 100%">
                 <br />
                Can you point me to the directory where your binary and core
                files are?<br />
                <br />
                The stack trace you sent shows a call to MPI_Waitany, do you
                know how many MPI requests<br />
                the code generally has outstanding at any time?<br />
                <br />
                -Scott<br />
                <br />
                <div class="moz-cite-prefix">
                 On 4/28/14, 4:30 PM, <a href="mailto:kanaev@ibrae.ac.ru" class="moz-txt-link-abbreviated">kanaev@ibrae.ac.ru</a>
                wrote:<br />
                </div>
                <blockquote>
                        <p>
                         Hello Scott, 
                        </p>
                        <p>
                         I took rerun with the mentioned keys. The code was freshly
                        compiled with makefile attached just in case. 
                        </p>
                        <p>
                         I've got 1024 core files. Two of them are attached. 
                        </p>
                        <p>
                         I run <span style="font-family: 'Lucida Grande', Verdana, Arial, Helvetica, sans-serif">bgq_stack for core.0 and
                        here's what i got:</span> 
                        </p>
                        <p>
                         <span style="font-family: 'Lucida Grande', Verdana, Arial, Helvetica, sans-serif"></span> [akanaev@miralac1
                        pinit]$bgq_stack pinit core.0 
                        </p>
                        <p>
                        ------------------------------------------------------------------------
                        </p>
                        <p>
                         Program   :
                        /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit/pinit 
                        </p>
                        <p>
                        ------------------------------------------------------------------------
                        </p>
                        <p>
                         +++ID Rank: 0, TGID: 1, Core: 0, HWTID:0 TID: 1 State: RUN
                        </p>
                        <p>
                         
                        </p>
                        <p>
                         00000000018334c0 
                        </p>
                        <p>
                         _ZN4PAMI6Device2MU7Factory12advance_implEPNS1_7ContextEmm
                        </p>
                        <p>
                        /bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/components/devices/bgq/mu2/Factory.h:269
                        </p>
                        <p>
                         
                        </p>
                        <p>
                         000000000170da28 
                        </p>
                        <p>
                         PAMI_Context_trylock_advancev 
                        </p>
                        <p>
                        /bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/api/c/pami.cc:554
                        </p>
                        <p>
                         
                        </p>
                        <p>
                         000000000155d0dc 
                        </p>
                        <p>
                         PMPI_Waitany 
                        </p>
                        <p>
                        /bgsys/source/srcV1R2M1.17463/comm/lib/dev/mpich2/src/mpid/pamid/include/../src/mpid_progress.h:239
                        </p>
                        <p>
                         
                        </p>
                        <p>
                         00000000010e84e4 
                        </p>
                        <p>
                         00000072.long_branch_r2off.H5Dget_space+0 
                        </p>
                        <p>
                         :0 
                        </p>
                        <p>
                         
                        </p>
                        <p>
                         0000000001042be0 
                        </p>
                        <p>
                        00000012.long_branch_r2off._ZNSt14basic_ifstreamIcSt11char_traitsIcEEC1EPKcSt13_Ios_Openmode+0
                        </p>
                        <p>
                         :0 
                        </p>
                        <p>
                         
                        </p>
                        <p>
                         00000000019de058 
                        </p>
                        <p>
                         generic_start_main 
                        </p>
                        <p>
                        /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../csu/libc-start.c:226
                        </p>
                        <p>
                         
                        </p>
                        <p>
                         00000000019de354 
                        </p>
                        <p>
                         __libc_start_main 
                        </p>
                        <p>
                        /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:194
                        </p>
                        <p>
                         
                        </p>
                        <p>
                         0000000000000000 
                        </p>
                        <p>
                         ?? 
                        </p>
                        <p>
                         ??:0 
                        </p>
                        <div>
                         <br />
                        </div>
                        <p>
                         >   Have these sort of runs succeeded in the past using
                        the same code base with no changes and similar input data? 
                        </p>
                        <p>
                         That is the first time i'm trying to run this code for
                        that long time 
                        </p>
                        <p>
                         
                        </p>
                        <p>
                         Thanks 
                        </p>
                        <p>
                         Anton  
                        </p>
                        <p>
                         <br />
                        On Thu, 24 Apr 2014 15:45:49 -0500, Scott Parker wrote: 
                        </p>
                        <blockquote style="padding-left: 5px; border-left-color: #1010ff; border-left-width: 2px; border-left-style: solid; margin-left: 5px; width: 1109px">
                                 <br />
                                Anton-<br />
                                <br />
                                Thanks, it's aborting because of a runtime error that
                                appears to be in the mpich layer. <br />
                                <br />
                                Can you rerun with  "--env 
                                BG_COREDUMPONEXIT=1:BG_COREDUMPONERROR=1" added to your qsub
                                line - that should<br />
                                generate some core files on which you can run bgq_stack.<br />
                                <br />
                                The system software (driver) on Mira was updated this week
                                and I'd like to get a clearer picture of <br />
                                whether that could be related to you problem, so<br />
                                <br />
                                   Has your code been recompiled since Monday? If not can
                                you recompile and try running again<br />
                                   <br />
                                   Have these sort of runs succeeded in the past using the
                                same code base with no changes and similar input data?<br />
                                <br />
                                -Scott<br />
                                <br />
                                  <br />
                                <br />
                                <div class="moz-cite-prefix">
                                 On 4/24/14, 2:59 PM, <a href="mailto:kanaev@ibrae.ac.ru" class="moz-txt-link-abbreviated">kanaev@ibrae.ac.ru</a> wrote:<br />
                                </div>
                                <blockquote>
                                        <p>
                                         Sorry about the attached files, here the are 
                                        </p>
                                        <p>
                                         There's no core files after exiting, looks like
                                        stopping because of time requested expires but you can
                                        see from cobaltlog it's about 5 hours passed (10 hours
                                        was requested) before exit 
                                        </p>
                                        <p>
                                         Anton  
                                        </p>
                                        <p>
                                         On Thu, 24 Apr 2014 14:07:07 -0500, Scott Parker
                                        wrote: 
                                        </p>
                                        <blockquote style="padding-left: 5px; border-left-color: #1010ff; border-left-width: 2px; border-left-style: solid; margin-left: 5px; width: 1029px">
                                                 <br />
                                                Anton-<br />
                                                <br />
                                                Please send these emails to <a href="mailto:support@alcf.anl.gov" class="moz-txt-link-abbreviated">support@alcf.anl.gov</a> as
                                                I may not always be available to investigate.<br />
                                                <br />
                                                I don't see the cobalt or error files attached, so I
                                                can't really say anything about why your job may be<br />
                                                failing. Do you get core files when the job crashed? If
                                                so I'd recommend using 'bgq_stack '<br />
                                                to try and get the file and line number where the
                                                failure occurred. Knowing the line may be enough<br />
                                                to let you figure it out, if not you'll need to dump the
                                                values of the variables at the time of the crash<br />
                                                to get a clearer picture of what is going on.<br />
                                                <br />
                                                Scott<br />
                                                <br />
                                                <div class="moz-cite-prefix">
                                                 On 4/24/14, 1:36 PM, <a href="mailto:kanaev@ibrae.ac.ru" class="moz-txt-link-abbreviated">kanaev@ibrae.ac.ru</a> wrote:<br />
                                                </div>
                                                <blockquote>
                                                        <p>
                                                         Hello Scott, 
                                                        </p>
                                                        <p>
                                                         I've tried twice to run 10 hours 1024 cores job on
                                                        Mira in mode c16 with 
                                                        </p>
                                                        <p>
                                                         qsub -n 64 -t 10:00:00 --mode c16 -A ThermHydraX
                                                        pinit Lid_128x128x1p1024.h5m 
                                                        </p>
                                                        <p>
                                                         Both times job exited earlier than expected on the
                                                        same iteration after the same error during executing
                                                        the following section (it's between two couts): 
                                                        </p>
                                                        <p>
                                                          ... 
                                                        </p>
                                                        <p>
                                                         //LOOP OVER OWNED CELLS 
                                                        </p>
                                                        <p>
                                                              double r = 0; 
                                                        </p>
                                                        <p>
                                                              for (moab::Range::iterator it =
                                                        owned_ents.begin(); it != owned_ents.end(); it++) {
                                                        </p>
                                                        <p>
                                                               EntityHandle ent = *it; 
                                                        </p>
                                                        <p>
                                                               int cellid = mb->id_from_handle(ent); 
                                                        </p>
                                                        <p>
                                                               double Vol_; 
                                                        </p>
                                                        <p>
                                                               double u_; 
                                                        </p>
                                                        <p>
                                                               double v_; 
                                                        </p>
                                                        <p>
                                                               double w_; 
                                                        </p>
                                                        <p>
                                                               double r1,r2,r3; 
                                                        </p>
                                                        <p>
                                                               double tmp; 
                                                        </p>
                                                        <p>
                                                         
                                                        </p>
                                                        <p>
                                                               result = mb->tag_get_data(u, &ent, 1,
                                                        &u_); 
                                                        </p>
                                                        <p>
                                                               PRINT_LAST_ERROR; 
                                                        </p>
                                                        <p>
                                                               result = mb->tag_get_data(v, &ent, 1,
                                                        &v_); 
                                                        </p>
                                                        <p>
                                                               PRINT_LAST_ERROR; 
                                                        </p>
                                                        <p>
                                                               result = mb->tag_get_data(w, &ent, 1,
                                                        &w_); 
                                                        </p>
                                                        <p>
                                                               PRINT_LAST_ERROR;  
                                                        </p>
                                                        <p>
                                                         
                                                        </p>
                                                        <p>
                                                               double result; 
                                                        </p>
                                                        <p>
                                                              
                                                        SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][2].lx,CG[cellid][2].ly,CG[cellid][2].lz);
                                                        </p>
                                                        <p>
                                                         
                                                        </p>
                                                        <p>
                                                               r1 = (sound +
                                                        fabs(result))/CG[cellid][2].length; 
                                                        </p>
                                                        <p>
                                                         
                                                        </p>
                                                        <p>
                                                              
                                                        SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][3].lx,CG[cellid][3].ly,CG[cellid][3].lz);
                                                        </p>
                                                        <p>
                                                         
                                                        </p>
                                                        <p>
                                                               r2 = (sound +
                                                        fabs(result))/CG[cellid][3].length;      
                                                        </p>
                                                        <p>
                                                         
                                                        </p>
                                                        <p>
                                                              
                                                        SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][5].lx,CG[cellid][5].ly,CG[cellid][5].lz);
                                                        </p>
                                                        <p>
                                                         
                                                        </p>
                                                        <p>
                                                               r3 = (sound +
                                                        fabs(result))/CG[cellid][5].length; 
                                                        </p>
                                                        <p>
                                                         
                                                        </p>
                                                        <p>
                                                               tmp = MAX3(r1,r2,r3); 
                                                        </p>
                                                        <p>
                                                         
                                                        </p>
                                                        <p>
                                                               r = MAX(tmp,r); 
                                                        </p>
                                                        <p>
                                                          } 
                                                        </p>
                                                        <p>
                                                         
                                                        </p>
                                                        <p>
                                                          double rmax; 
                                                        </p>
                                                        <p>
                                                         MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD);
                                                        </p>
                                                        <p>
                                                          tau = CFL/rmax; 
                                                        </p>
                                                        <p>
                                                          ttime+=tau;   
                                                        </p>
                                                        <p>
                                                         ... 
                                                        </p>
                                                        <p>
                                                         
                                                        </p>
                                                        <p>
                                                         So it may be Allreduce 
                                                        </p>
                                                        <p>
                                                         I've attached cobaltlog and error files of both
                                                        runs 
                                                        </p>
                                                        <p>
                                                         Can you please take a look and suggest a further
                                                        debugging 
                                                        </p>
                                                        <p>
                                                         
                                                        </p>
                                                        <p>
                                                         Thanks 
                                                        </p>
                                                        <p>
                                                         Anton  
                                                        </p>
                                                </blockquote>
                                        </blockquote>
                                </blockquote>
                        </blockquote>
                </blockquote>
                <br />
        </blockquote>
        <p>
         
        </p>
</blockquote>
<br />