<p>
Please disregard that, the global_id space for Quads was incontinuous in my mesh file
</p>
<p>
Will check back with correct mesh
</p>
<p>
Anton<br />
<br />
On Mon, 19 May 2014 15:17:14 -0400,  wrote:
</p>
<blockquote style="padding-left: 5px; border-left-color: #1010ff; border-left-width: 2px; border-left-style: solid; margin-left: 5px; width: 100%">
        <p>
        Hello MOAB dev,
        </p>
        <p>
        I've attached a simplified version of my program that crashes presumably after particular numbers calls of exchange_tags
        </p>
        <p>
        I ran it couple of times on Mira on 1024 cores (64 nodes in --mode c16)
        </p>
        <p>
        It performs around 524378 iterations and then crushes (error file attached)
        </p>
        <p>
        Can you please take a look at what Scott Parker from ALCF suggests about it: 
        </p>
        <p>
        -------- Original Message --------
        </p>
        <table border="0" cellspacing="0" cellpadding="0">
                <tbody>
                        <tr>
                                <th align="right" valign="baseline">Subject: </th>
                                <td>Re: Job exiting early [ThermHydraX]</td>
                        </tr>
                        <tr>
                                <th align="right" valign="baseline">Date: </th>
                                <td>Fri, 9 May 2014 18:48:25 -0500</td>
                        </tr>
                        <tr>
                                <th align="right" valign="baseline">From: </th>
                                <td>Scott Parker </td>
                        </tr>
                        <tr>
                                <th align="right" valign="baseline">To: </th>
                                <td> </td>
                        </tr>
                </tbody>
        </table>
        <br />
        <br />
        Anton-<br />
        <br />
        I took a look at the core files and from the stack trace it appears
        that the code is failing in an MPI_Isend call<br />
        that is called from moab.ParallelComm::recv_buffer which is called
        from moab::ParallelComm::exchange_tags<br />
        called from main(). The stack looks like:<br />
           <br />
          main<br />
            moab::ParallelComm::exchange_tags<br />
              moab.ParallelComm::recv_buffer<br />
                MPI_Isend<br />
        <br />
        I've been able to get the same failure and error message you are
        seeing by having an MPI process call MPI_Isend<br />
        when there are no matching receives. After 2 million Isends the
        program exits with the error you are seeing. So<br />
        I'm pretty sure your ending up with a large number of outstanding
        requests and the program is failing because<br />
        it can't allocate space for new MPI_Request objects.<br />
        <br />
        I'd suggest looking at how Moab is calling MPI and how many requests
        might be outstanding at any one time.<br />
        Since the code is running for 5 hours and looks to be executing
        hundreds of thousands of iterations I wonder<br />
        if there is some sort of send-receive mismatch that is letting
        requests accumulate. I think your best bet is to<br />
        talk to the Moab folks and see if they have any ideas about why this
        might be happening. <br />
        <br />
        One possibility is a load imbalance between processes - if you don't
        have any MPI_Barriers or other collectives in<br />
        your code you could try adding a barrier to synchronize the
        processes.<br />
        <br />
        If the Moab guys can't help you and adding a barrier doesn't help I
        can work with you to instrument the code to<br />
        collect more information on how MPI is being called and we could
        possibly pin down the source of the problem<br />
        that way.<br />
        <br />
        <br />
        Scott<br />
        <br />
        <br />
        <div class="moz-cite-prefix">
        On 5/2/14, 11:14 PM, <a href="mailto:kanaev@ibrae.ac.ru" class="moz-txt-link-abbreviated">kanaev@ibrae.ac.ru</a>
        wrote:<br />
        </div>
        <blockquote>
                <p>
                Hello Scott
                </p>
                <p>
                The dir is cd /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit
                </p>
                <div>
                The run produced core files is  253035
                </div>
                <div>
                 
                </div>
                <div>
                I took another run with the line
                 MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD);
                commented and it stopped the same very time iteration #524378,
                 just passed some more lines
                </div>
                <div>
                  
                </div>
                <div>
                I use MOAB library and its function of exchanging data between
                processors so i think i cannot really count MPI requests
                </div>
                <div>
                 
                </div>
                <div>
                Anton 
                </div>
                <p>
                <br />
                On Mon, 28 Apr 2014 16:45:41 -0500, Scott Parker wrote:
                </p>
                <blockquote style="padding-left: 5px; border-left-color: #1010ff; border-left-width: 2px; border-left-style: solid; margin-left: 5px; width: 100%">
                        <br />
                        Can you point me to the directory where your binary and core
                        files are?<br />
                        <br />
                        The stack trace you sent shows a call to MPI_Waitany, do you
                        know how many MPI requests<br />
                        the code generally has outstanding at any time?<br />
                        <br />
                        -Scott<br />
                        <br />
                        <div class="moz-cite-prefix">
                        On 4/28/14, 4:30 PM, <a href="mailto:kanaev@ibrae.ac.ru" class="moz-txt-link-abbreviated">kanaev@ibrae.ac.ru</a>
                        wrote:<br />
                        </div>
                        <blockquote>
                                <p>
                                Hello Scott, 
                                </p>
                                <p>
                                I took rerun with the mentioned keys. The code was freshly
                                compiled with makefile attached just in case. 
                                </p>
                                <p>
                                I've got 1024 core files. Two of them are attached. 
                                </p>
                                <p>
                                I run <span style="font-family: 'Lucida Grande', Verdana, Arial, Helvetica, sans-serif">bgq_stack for core.0 and
                                here's what i got:</span> 
                                </p>
                                <p>
                                <span style="font-family: 'Lucida Grande', Verdana, Arial, Helvetica, sans-serif"></span> [akanaev@miralac1
                                pinit]$bgq_stack pinit core.0 
                                </p>
                                <p>
                                ------------------------------------------------------------------------
                                </p>
                                <p>
                                Program   :
                                /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit/pinit 
                                </p>
                                <p>
                                ------------------------------------------------------------------------
                                </p>
                                <p>
                                +++ID Rank: 0, TGID: 1, Core: 0, HWTID:0 TID: 1 State: RUN
                                </p>
                                <p>
                                 
                                </p>
                                <p>
                                00000000018334c0 
                                </p>
                                <p>
                                _ZN4PAMI6Device2MU7Factory12advance_implEPNS1_7ContextEmm
                                </p>
                                <p>
                                /bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/components/devices/bgq/mu2/Factory.h:269
                                </p>
                                <p>
                                 
                                </p>
                                <p>
                                000000000170da28 
                                </p>
                                <p>
                                PAMI_Context_trylock_advancev 
                                </p>
                                <p>
                                /bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/api/c/pami.cc:554
                                </p>
                                <p>
                                 
                                </p>
                                <p>
                                000000000155d0dc 
                                </p>
                                <p>
                                PMPI_Waitany 
                                </p>
                                <p>
                                /bgsys/source/srcV1R2M1.17463/comm/lib/dev/mpich2/src/mpid/pamid/include/../src/mpid_progress.h:239
                                </p>
                                <p>
                                 
                                </p>
                                <p>
                                00000000010e84e4 
                                </p>
                                <p>
                                00000072.long_branch_r2off.H5Dget_space+0 
                                </p>
                                <p>
                                :0 
                                </p>
                                <p>
                                 
                                </p>
                                <p>
                                0000000001042be0 
                                </p>
                                <p>
                                00000012.long_branch_r2off._ZNSt14basic_ifstreamIcSt11char_traitsIcEEC1EPKcSt13_Ios_Openmode+0
                                </p>
                                <p>
                                :0 
                                </p>
                                <p>
                                 
                                </p>
                                <p>
                                00000000019de058 
                                </p>
                                <p>
                                generic_start_main 
                                </p>
                                <p>
                                /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../csu/libc-start.c:226
                                </p>
                                <p>
                                 
                                </p>
                                <p>
                                00000000019de354 
                                </p>
                                <p>
                                __libc_start_main 
                                </p>
                                <p>
                                /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:194
                                </p>
                                <p>
                                 
                                </p>
                                <p>
                                0000000000000000 
                                </p>
                                <p>
                                ?? 
                                </p>
                                <p>
                                ??:0 
                                </p>
                                <div>
                                <br />
                                </div>
                                <p>
                                >   Have these sort of runs succeeded in the past using
                                the same code base with no changes and similar input data? 
                                </p>
                                <p>
                                That is the first time i'm trying to run this code for
                                that long time 
                                </p>
                                <p>
                                 
                                </p>
                                <p>
                                Thanks 
                                </p>
                                <p>
                                Anton  
                                </p>
                                <p>
                                <br />
                                On Thu, 24 Apr 2014 15:45:49 -0500, Scott Parker wrote: 
                                </p>
                                <blockquote style="padding-left: 5px; border-left-color: #1010ff; border-left-width: 2px; border-left-style: solid; margin-left: 5px; width: 1109px">
                                        <br />
                                        Anton-<br />
                                        <br />
                                        Thanks, it's aborting because of a runtime error that
                                        appears to be in the mpich layer. <br />
                                        <br />
                                        Can you rerun with  "--env 
                                        BG_COREDUMPONEXIT=1:BG_COREDUMPONERROR=1" added to your qsub
                                        line - that should<br />
                                        generate some core files on which you can run bgq_stack.<br />
                                        <br />
                                        The system software (driver) on Mira was updated this week
                                        and I'd like to get a clearer picture of <br />
                                        whether that could be related to you problem, so<br />
                                        <br />
                                           Has your code been recompiled since Monday? If not can
                                        you recompile and try running again<br />
                                           <br />
                                           Have these sort of runs succeeded in the past using the
                                        same code base with no changes and similar input data?<br />
                                        <br />
                                        -Scott<br />
                                        <br />
                                          <br />
                                        <br />
                                        <div class="moz-cite-prefix">
                                        On 4/24/14, 2:59 PM, <a href="mailto:kanaev@ibrae.ac.ru" class="moz-txt-link-abbreviated">kanaev@ibrae.ac.ru</a> wrote:<br />
                                        </div>
                                        <blockquote>
                                                <p>
                                                Sorry about the attached files, here the are 
                                                </p>
                                                <p>
                                                There's no core files after exiting, looks like
                                                stopping because of time requested expires but you can
                                                see from cobaltlog it's about 5 hours passed (10 hours
                                                was requested) before exit 
                                                </p>
                                                <p>
                                                Anton  
                                                </p>
                                                <p>
                                                On Thu, 24 Apr 2014 14:07:07 -0500, Scott Parker
                                                wrote: 
                                                </p>
                                                <blockquote style="padding-left: 5px; border-left-color: #1010ff; border-left-width: 2px; border-left-style: solid; margin-left: 5px; width: 1029px">
                                                        <br />
                                                        Anton-<br />
                                                        <br />
                                                        Please send these emails to <a href="mailto:support@alcf.anl.gov" class="moz-txt-link-abbreviated">support@alcf.anl.gov</a> as
                                                        I may not always be available to investigate.<br />
                                                        <br />
                                                        I don't see the cobalt or error files attached, so I
                                                        can't really say anything about why your job may be<br />
                                                        failing. Do you get core files when the job crashed? If
                                                        so I'd recommend using 'bgq_stack '<br />
                                                        to try and get the file and line number where the
                                                        failure occurred. Knowing the line may be enough<br />
                                                        to let you figure it out, if not you'll need to dump the
                                                        values of the variables at the time of the crash<br />
                                                        to get a clearer picture of what is going on.<br />
                                                        <br />
                                                        Scott<br />
                                                        <br />
                                                        <div class="moz-cite-prefix">
                                                        On 4/24/14, 1:36 PM, <a href="mailto:kanaev@ibrae.ac.ru" class="moz-txt-link-abbreviated">kanaev@ibrae.ac.ru</a> wrote:<br />
                                                        </div>
                                                        <blockquote>
                                                                <p>
                                                                Hello Scott, 
                                                                </p>
                                                                <p>
                                                                I've tried twice to run 10 hours 1024 cores job on
                                                                Mira in mode c16 with 
                                                                </p>
                                                                <p>
                                                                qsub -n 64 -t 10:00:00 --mode c16 -A ThermHydraX
                                                                pinit Lid_128x128x1p1024.h5m 
                                                                </p>
                                                                <p>
                                                                Both times job exited earlier than expected on the
                                                                same iteration after the same error during executing
                                                                the following section (it's between two couts): 
                                                                </p>
                                                                <p>
                                                                 ... 
                                                                </p>
                                                                <p>
                                                                //LOOP OVER OWNED CELLS 
                                                                </p>
                                                                <p>
                                                                     double r = 0; 
                                                                </p>
                                                                <p>
                                                                     for (moab::Range::iterator it =
                                                                owned_ents.begin(); it != owned_ents.end(); it++) {
                                                                </p>
                                                                <p>
                                                                      EntityHandle ent = *it; 
                                                                </p>
                                                                <p>
                                                                      int cellid = mb->id_from_handle(ent); 
                                                                </p>
                                                                <p>
                                                                      double Vol_; 
                                                                </p>
                                                                <p>
                                                                      double u_; 
                                                                </p>
                                                                <p>
                                                                      double v_; 
                                                                </p>
                                                                <p>
                                                                      double w_; 
                                                                </p>
                                                                <p>
                                                                      double r1,r2,r3; 
                                                                </p>
                                                                <p>
                                                                      double tmp; 
                                                                </p>
                                                                <p>
                                                                 
                                                                </p>
                                                                <p>
                                                                      result = mb->tag_get_data(u, &ent, 1,
                                                                &u_); 
                                                                </p>
                                                                <p>
                                                                      PRINT_LAST_ERROR; 
                                                                </p>
                                                                <p>
                                                                      result = mb->tag_get_data(v, &ent, 1,
                                                                &v_); 
                                                                </p>
                                                                <p>
                                                                      PRINT_LAST_ERROR; 
                                                                </p>
                                                                <p>
                                                                      result = mb->tag_get_data(w, &ent, 1,
                                                                &w_); 
                                                                </p>
                                                                <p>
                                                                      PRINT_LAST_ERROR;  
                                                                </p>
                                                                <p>
                                                                 
                                                                </p>
                                                                <p>
                                                                      double result; 
                                                                </p>
                                                                <p>
                                                                     
                                                                SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][2].lx,CG[cellid][2].ly,CG[cellid][2].lz);
                                                                </p>
                                                                <p>
                                                                 
                                                                </p>
                                                                <p>
                                                                      r1 = (sound +
                                                                fabs(result))/CG[cellid][2].length; 
                                                                </p>
                                                                <p>
                                                                 
                                                                </p>
                                                                <p>
                                                                     
                                                                SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][3].lx,CG[cellid][3].ly,CG[cellid][3].lz);
                                                                </p>
                                                                <p>
                                                                 
                                                                </p>
                                                                <p>
                                                                      r2 = (sound +
                                                                fabs(result))/CG[cellid][3].length;      
                                                                </p>
                                                                <p>
                                                                 
                                                                </p>
                                                                <p>
                                                                     
                                                                SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][5].lx,CG[cellid][5].ly,CG[cellid][5].lz);
                                                                </p>
                                                                <p>
                                                                 
                                                                </p>
                                                                <p>
                                                                      r3 = (sound +
                                                                fabs(result))/CG[cellid][5].length; 
                                                                </p>
                                                                <p>
                                                                 
                                                                </p>
                                                                <p>
                                                                      tmp = MAX3(r1,r2,r3); 
                                                                </p>
                                                                <p>
                                                                 
                                                                </p>
                                                                <p>
                                                                      r = MAX(tmp,r); 
                                                                </p>
                                                                <p>
                                                                 } 
                                                                </p>
                                                                <p>
                                                                 
                                                                </p>
                                                                <p>
                                                                 double rmax; 
                                                                </p>
                                                                <p>
                                                                 MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD);
                                                                </p>
                                                                <p>
                                                                 tau = CFL/rmax; 
                                                                </p>
                                                                <p>
                                                                 ttime+=tau;   
                                                                </p>
                                                                <p>
                                                                ... 
                                                                </p>
                                                                <p>
                                                                 
                                                                </p>
                                                                <p>
                                                                So it may be Allreduce 
                                                                </p>
                                                                <p>
                                                                I've attached cobaltlog and error files of both
                                                                runs 
                                                                </p>
                                                                <p>
                                                                Can you please take a look and suggest a further
                                                                debugging 
                                                                </p>
                                                                <p>
                                                                 
                                                                </p>
                                                                <p>
                                                                Thanks 
                                                                </p>
                                                                <p>
                                                                Anton  
                                                                </p>
                                                        </blockquote>
                                                </blockquote>
                                        </blockquote>
                                </blockquote>
                        </blockquote>
                        <br />
                </blockquote>
                <p>
                 
                </p>
        </blockquote>
        <br />
</blockquote>
<p>
 
</p>