<p>
The problem is still here
</p>
<p>
I've made a simple program performing certain numbers of exchange_tags calls within a loop
</p>
<p>
If you run it on several processors with any mesh file it will eventually crash with the following message from every core:
</p>
<p>
 
</p>
<p>
Fatal error in PMPI_Isend: Internal MPI error!, error stack:
</p>
<p>
PMPI_Isend(148): MPI_Isend(buf=0xd0f300, count=4, MPI_UNSIGNED_CHAR, dest=1, tag=6, MPI_COMM_WORLD, request=0xcde354) failed
</p>
<p>
(unknown)(): Internal MPI error!
</p>
<p>
 
</p>
<p>
Thanks
</p>
<p>
Anton 
</p>
<div>
<br />
</div>
<p>
On Tue, 20 May 2014 04:40:03 -0400,  wrote:
</p>
<blockquote style="padding-left: 5px; border-left-color: #1010ff; border-left-width: 2px; border-left-style: solid; margin-left: 5px; width: 100%">
        <p>
        Please disregard that, the global_id space for Quads was incontinuous in my mesh file
        </p>
        <p>
        Will check back with correct mesh
        </p>
        <p>
        Anton<br />
        <br />
        On Mon, 19 May 2014 15:17:14 -0400,  wrote:
        </p>
        <blockquote style="padding-left: 5px; border-left-color: #1010ff; border-left-width: 2px; border-left-style: solid; margin-left: 5px; width: 100%">
                <p>
                Hello MOAB dev,
                </p>
                <p>
                I've attached a simplified version of my program that crashes presumably after particular numbers calls of exchange_tags
                </p>
                <p>
                I ran it couple of times on Mira on 1024 cores (64 nodes in --mode c16)
                </p>
                <p>
                It performs around 524378 iterations and then crushes (error file attached)
                </p>
                <p>
                Can you please take a look at what Scott Parker from ALCF suggests about it: 
                </p>
                <p>
                -------- Original Message --------
                </p>
                <table border="0" cellspacing="0" cellpadding="0">
                        <tbody>
                                <tr>
                                        <th align="right" valign="baseline">Subject: </th>
                                        <td>Re: Job exiting early [ThermHydraX]</td>
                                </tr>
                                <tr>
                                        <th align="right" valign="baseline">Date: </th>
                                        <td>Fri, 9 May 2014 18:48:25 -0500</td>
                                </tr>
                                <tr>
                                        <th align="right" valign="baseline">From: </th>
                                        <td>Scott Parker </td>
                                </tr>
                                <tr>
                                        <th align="right" valign="baseline">To: </th>
                                        <td> </td>
                                </tr>
                        </tbody>
                </table>
                <br />
                <br />
                Anton-<br />
                <br />
                I took a look at the core files and from the stack trace it appears
                that the code is failing in an MPI_Isend call<br />
                that is called from moab.ParallelComm::recv_buffer which is called
                from moab::ParallelComm::exchange_tags<br />
                called from main(). The stack looks like:<br />
                   <br />
                  main<br />
                    moab::ParallelComm::exchange_tags<br />
                      moab.ParallelComm::recv_buffer<br />
                        MPI_Isend<br />
                <br />
                I've been able to get the same failure and error message you are
                seeing by having an MPI process call MPI_Isend<br />
                when there are no matching receives. After 2 million Isends the
                program exits with the error you are seeing. So<br />
                I'm pretty sure your ending up with a large number of outstanding
                requests and the program is failing because<br />
                it can't allocate space for new MPI_Request objects.<br />
                <br />
                I'd suggest looking at how Moab is calling MPI and how many requests
                might be outstanding at any one time.<br />
                Since the code is running for 5 hours and looks to be executing
                hundreds of thousands of iterations I wonder<br />
                if there is some sort of send-receive mismatch that is letting
                requests accumulate. I think your best bet is to<br />
                talk to the Moab folks and see if they have any ideas about why this
                might be happening. <br />
                <br />
                One possibility is a load imbalance between processes - if you don't
                have any MPI_Barriers or other collectives in<br />
                your code you could try adding a barrier to synchronize the
                processes.<br />
                <br />
                If the Moab guys can't help you and adding a barrier doesn't help I
                can work with you to instrument the code to<br />
                collect more information on how MPI is being called and we could
                possibly pin down the source of the problem<br />
                that way.<br />
                <br />
                <br />
                Scott<br />
                <br />
                <br />
                <div class="moz-cite-prefix">
                On 5/2/14, 11:14 PM, <a href="mailto:kanaev@ibrae.ac.ru" class="moz-txt-link-abbreviated">kanaev@ibrae.ac.ru</a>
                wrote:<br />
                </div>
                <blockquote>
                        <p>
                        Hello Scott
                        </p>
                        <p>
                        The dir is cd /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit
                        </p>
                        <div>
                        The run produced core files is  253035
                        </div>
                        <div>
                         
                        </div>
                        <div>
                        I took another run with the line
                         MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD);
                        commented and it stopped the same very time iteration #524378,
                         just passed some more lines
                        </div>
                        <div>
                          
                        </div>
                        <div>
                        I use MOAB library and its function of exchanging data between
                        processors so i think i cannot really count MPI requests
                        </div>
                        <div>
                         
                        </div>
                        <div>
                        Anton 
                        </div>
                        <p>
                        <br />
                        On Mon, 28 Apr 2014 16:45:41 -0500, Scott Parker wrote:
                        </p>
                        <blockquote style="padding-left: 5px; border-left-color: #1010ff; border-left-width: 2px; border-left-style: solid; margin-left: 5px; width: 100%">
                                <br />
                                Can you point me to the directory where your binary and core
                                files are?<br />
                                <br />
                                The stack trace you sent shows a call to MPI_Waitany, do you
                                know how many MPI requests<br />
                                the code generally has outstanding at any time?<br />
                                <br />
                                -Scott<br />
                                <br />
                                <div class="moz-cite-prefix">
                                On 4/28/14, 4:30 PM, <a href="mailto:kanaev@ibrae.ac.ru" class="moz-txt-link-abbreviated">kanaev@ibrae.ac.ru</a>
                                wrote:<br />
                                </div>
                                <blockquote>
                                        <p>
                                        Hello Scott, 
                                        </p>
                                        <p>
                                        I took rerun with the mentioned keys. The code was freshly
                                        compiled with makefile attached just in case. 
                                        </p>
                                        <p>
                                        I've got 1024 core files. Two of them are attached. 
                                        </p>
                                        <p>
                                        I run <span style="font-family: 'Lucida Grande', Verdana, Arial, Helvetica, sans-serif">bgq_stack for core.0 and
                                        here's what i got:</span> 
                                        </p>
                                        <p>
                                        <span style="font-family: 'Lucida Grande', Verdana, Arial, Helvetica, sans-serif"></span> [akanaev@miralac1
                                        pinit]$bgq_stack pinit core.0 
                                        </p>
                                        <p>
                                        ------------------------------------------------------------------------
                                        </p>
                                        <p>
                                        Program   :
                                        /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit/pinit 
                                        </p>
                                        <p>
                                        ------------------------------------------------------------------------
                                        </p>
                                        <p>
                                        +++ID Rank: 0, TGID: 1, Core: 0, HWTID:0 TID: 1 State: RUN
                                        </p>
                                        <p>
                                         
                                        </p>
                                        <p>
                                        00000000018334c0 
                                        </p>
                                        <p>
                                        _ZN4PAMI6Device2MU7Factory12advance_implEPNS1_7ContextEmm
                                        </p>
                                        <p>
                                        /bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/components/devices/bgq/mu2/Factory.h:269
                                        </p>
                                        <p>
                                         
                                        </p>
                                        <p>
                                        000000000170da28 
                                        </p>
                                        <p>
                                        PAMI_Context_trylock_advancev 
                                        </p>
                                        <p>
                                        /bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/api/c/pami.cc:554
                                        </p>
                                        <p>
                                         
                                        </p>
                                        <p>
                                        000000000155d0dc 
                                        </p>
                                        <p>
                                        PMPI_Waitany 
                                        </p>
                                        <p>
                                        /bgsys/source/srcV1R2M1.17463/comm/lib/dev/mpich2/src/mpid/pamid/include/../src/mpid_progress.h:239
                                        </p>
                                        <p>
                                         
                                        </p>
                                        <p>
                                        00000000010e84e4 
                                        </p>
                                        <p>
                                        00000072.long_branch_r2off.H5Dget_space+0 
                                        </p>
                                        <p>
                                        :0 
                                        </p>
                                        <p>
                                         
                                        </p>
                                        <p>
                                        0000000001042be0 
                                        </p>
                                        <p>
                                        00000012.long_branch_r2off._ZNSt14basic_ifstreamIcSt11char_traitsIcEEC1EPKcSt13_Ios_Openmode+0
                                        </p>
                                        <p>
                                        :0 
                                        </p>
                                        <p>
                                         
                                        </p>
                                        <p>
                                        00000000019de058 
                                        </p>
                                        <p>
                                        generic_start_main 
                                        </p>
                                        <p>
                                        /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../csu/libc-start.c:226
                                        </p>
                                        <p>
                                         
                                        </p>
                                        <p>
                                        00000000019de354 
                                        </p>
                                        <p>
                                        __libc_start_main 
                                        </p>
                                        <p>
                                        /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:194
                                        </p>
                                        <p>
                                         
                                        </p>
                                        <p>
                                        0000000000000000 
                                        </p>
                                        <p>
                                        ?? 
                                        </p>
                                        <p>
                                        ??:0 
                                        </p>
                                        <div>
                                        <br />
                                        </div>
                                        <p>
                                        >   Have these sort of runs succeeded in the past using
                                        the same code base with no changes and similar input data? 
                                        </p>
                                        <p>
                                        That is the first time i'm trying to run this code for
                                        that long time 
                                        </p>
                                        <p>
                                         
                                        </p>
                                        <p>
                                        Thanks 
                                        </p>
                                        <p>
                                        Anton  
                                        </p>
                                        <p>
                                        <br />
                                        On Thu, 24 Apr 2014 15:45:49 -0500, Scott Parker wrote: 
                                        </p>
                                        <blockquote style="padding-left: 5px; border-left-color: #1010ff; border-left-width: 2px; border-left-style: solid; margin-left: 5px; width: 1109px">
                                                <br />
                                                Anton-<br />
                                                <br />
                                                Thanks, it's aborting because of a runtime error that
                                                appears to be in the mpich layer. <br />
                                                <br />
                                                Can you rerun with  "--env 
                                                BG_COREDUMPONEXIT=1:BG_COREDUMPONERROR=1" added to your qsub
                                                line - that should<br />
                                                generate some core files on which you can run bgq_stack.<br />
                                                <br />
                                                The system software (driver) on Mira was updated this week
                                                and I'd like to get a clearer picture of <br />
                                                whether that could be related to you problem, so<br />
                                                <br />
                                                   Has your code been recompiled since Monday? If not can
                                                you recompile and try running again<br />
                                                   <br />
                                                   Have these sort of runs succeeded in the past using the
                                                same code base with no changes and similar input data?<br />
                                                <br />
                                                -Scott<br />
                                                <br />
                                                  <br />
                                                <br />
                                                <div class="moz-cite-prefix">
                                                On 4/24/14, 2:59 PM, <a href="mailto:kanaev@ibrae.ac.ru" class="moz-txt-link-abbreviated">kanaev@ibrae.ac.ru</a> wrote:<br />
                                                </div>
                                                <blockquote>
                                                        <p>
                                                        Sorry about the attached files, here the are 
                                                        </p>
                                                        <p>
                                                        There's no core files after exiting, looks like
                                                        stopping because of time requested expires but you can
                                                        see from cobaltlog it's about 5 hours passed (10 hours
                                                        was requested) before exit 
                                                        </p>
                                                        <p>
                                                        Anton  
                                                        </p>
                                                        <p>
                                                        On Thu, 24 Apr 2014 14:07:07 -0500, Scott Parker
                                                        wrote: 
                                                        </p>
                                                        <blockquote style="padding-left: 5px; border-left-color: #1010ff; border-left-width: 2px; border-left-style: solid; margin-left: 5px; width: 1029px">
                                                                <br />
                                                                Anton-<br />
                                                                <br />
                                                                Please send these emails to <a href="mailto:support@alcf.anl.gov" class="moz-txt-link-abbreviated">support@alcf.anl.gov</a> as
                                                                I may not always be available to investigate.<br />
                                                                <br />
                                                                I don't see the cobalt or error files attached, so I
                                                                can't really say anything about why your job may be<br />
                                                                failing. Do you get core files when the job crashed? If
                                                                so I'd recommend using 'bgq_stack '<br />
                                                                to try and get the file and line number where the
                                                                failure occurred. Knowing the line may be enough<br />
                                                                to let you figure it out, if not you'll need to dump the
                                                                values of the variables at the time of the crash<br />
                                                                to get a clearer picture of what is going on.<br />
                                                                <br />
                                                                Scott<br />
                                                                <br />
                                                                <div class="moz-cite-prefix">
                                                                On 4/24/14, 1:36 PM, <a href="mailto:kanaev@ibrae.ac.ru" class="moz-txt-link-abbreviated">kanaev@ibrae.ac.ru</a> wrote:<br />
                                                                </div>
                                                                <blockquote>
                                                                        <p>
                                                                        Hello Scott, 
                                                                        </p>
                                                                        <p>
                                                                        I've tried twice to run 10 hours 1024 cores job on
                                                                        Mira in mode c16 with 
                                                                        </p>
                                                                        <p>
                                                                        qsub -n 64 -t 10:00:00 --mode c16 -A ThermHydraX
                                                                        pinit Lid_128x128x1p1024.h5m 
                                                                        </p>
                                                                        <p>
                                                                        Both times job exited earlier than expected on the
                                                                        same iteration after the same error during executing
                                                                        the following section (it's between two couts): 
                                                                        </p>
                                                                        <p>
                                                                         ... 
                                                                        </p>
                                                                        <p>
                                                                        //LOOP OVER OWNED CELLS 
                                                                        </p>
                                                                        <p>
                                                                             double r = 0; 
                                                                        </p>
                                                                        <p>
                                                                             for (moab::Range::iterator it =
                                                                        owned_ents.begin(); it != owned_ents.end(); it++) {
                                                                        </p>
                                                                        <p>
                                                                              EntityHandle ent = *it; 
                                                                        </p>
                                                                        <p>
                                                                              int cellid = mb->id_from_handle(ent); 
                                                                        </p>
                                                                        <p>
                                                                              double Vol_; 
                                                                        </p>
                                                                        <p>
                                                                              double u_; 
                                                                        </p>
                                                                        <p>
                                                                              double v_; 
                                                                        </p>
                                                                        <p>
                                                                              double w_; 
                                                                        </p>
                                                                        <p>
                                                                              double r1,r2,r3; 
                                                                        </p>
                                                                        <p>
                                                                              double tmp; 
                                                                        </p>
                                                                        <p>
                                                                         
                                                                        </p>
                                                                        <p>
                                                                              result = mb->tag_get_data(u, &ent, 1,
                                                                        &u_); 
                                                                        </p>
                                                                        <p>
                                                                              PRINT_LAST_ERROR; 
                                                                        </p>
                                                                        <p>
                                                                              result = mb->tag_get_data(v, &ent, 1,
                                                                        &v_); 
                                                                        </p>
                                                                        <p>
                                                                              PRINT_LAST_ERROR; 
                                                                        </p>
                                                                        <p>
                                                                              result = mb->tag_get_data(w, &ent, 1,
                                                                        &w_); 
                                                                        </p>
                                                                        <p>
                                                                              PRINT_LAST_ERROR;  
                                                                        </p>
                                                                        <p>
                                                                         
                                                                        </p>
                                                                        <p>
                                                                              double result; 
                                                                        </p>
                                                                        <p>
                                                                             
                                                                        SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][2].lx,CG[cellid][2].ly,CG[cellid][2].lz);
                                                                        </p>
                                                                        <p>
                                                                         
                                                                        </p>
                                                                        <p>
                                                                              r1 = (sound +
                                                                        fabs(result))/CG[cellid][2].length; 
                                                                        </p>
                                                                        <p>
                                                                         
                                                                        </p>
                                                                        <p>
                                                                             
                                                                        SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][3].lx,CG[cellid][3].ly,CG[cellid][3].lz);
                                                                        </p>
                                                                        <p>
                                                                         
                                                                        </p>
                                                                        <p>
                                                                              r2 = (sound +
                                                                        fabs(result))/CG[cellid][3].length;      
                                                                        </p>
                                                                        <p>
                                                                         
                                                                        </p>
                                                                        <p>
                                                                             
                                                                        SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][5].lx,CG[cellid][5].ly,CG[cellid][5].lz);
                                                                        </p>
                                                                        <p>
                                                                         
                                                                        </p>
                                                                        <p>
                                                                              r3 = (sound +
                                                                        fabs(result))/CG[cellid][5].length; 
                                                                        </p>
                                                                        <p>
                                                                         
                                                                        </p>
                                                                        <p>
                                                                              tmp = MAX3(r1,r2,r3); 
                                                                        </p>
                                                                        <p>
                                                                         
                                                                        </p>
                                                                        <p>
                                                                              r = MAX(tmp,r); 
                                                                        </p>
                                                                        <p>
                                                                         } 
                                                                        </p>
                                                                        <p>
                                                                         
                                                                        </p>
                                                                        <p>
                                                                         double rmax; 
                                                                        </p>
                                                                        <p>
                                                                         MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD);
                                                                        </p>
                                                                        <p>
                                                                         tau = CFL/rmax; 
                                                                        </p>
                                                                        <p>
                                                                         ttime+=tau;   
                                                                        </p>
                                                                        <p>
                                                                        ... 
                                                                        </p>
                                                                        <p>
                                                                         
                                                                        </p>
                                                                        <p>
                                                                        So it may be Allreduce 
                                                                        </p>
                                                                        <p>
                                                                        I've attached cobaltlog and error files of both
                                                                        runs 
                                                                        </p>
                                                                        <p>
                                                                        Can you please take a look and suggest a further
                                                                        debugging 
                                                                        </p>
                                                                        <p>
                                                                         
                                                                        </p>
                                                                        <p>
                                                                        Thanks 
                                                                        </p>
                                                                        <p>
                                                                        Anton  
                                                                        </p>
                                                                </blockquote>
                                                        </blockquote>
                                                </blockquote>
                                        </blockquote>
                                </blockquote>
                                <br />
                        </blockquote>
                        <p>
                         
                        </p>
                </blockquote>
                <br />
        </blockquote>
        <p>
         
        </p>
</blockquote>
<p>
 
</p>