<p>

The problem is still here

</p>

<p>

I've made a simple program performing certain numbers of exchange_tags calls within a loop

</p>

<p>

If you run it on several processors with any mesh file it will eventually crash with the following message from every core:

</p>

<p>

</p>

<p>

Fatal error in PMPI_Isend: Internal MPI error!, error stack:

</p>

<p>

PMPI_Isend(148): MPI_Isend(buf=0xd0f300, count=4, MPI_UNSIGNED_CHAR, dest=1, tag=6, MPI_COMM_WORLD, request=0xcde354) failed

</p>

<p>

(unknown)(): Internal MPI error!

</p>

<p>

</p>

<p>

Thanks

</p>

<p>

Anton 

</p>

<div>

<br />

</div>

<p>

On Tue, 20 May 2014 04:40:03 -0400,  wrote:

</p>

<blockquote style="padding-left: 5px; border-left-color: #1010ff; border-left-width: 2px; border-left-style: solid; margin-left: 5px; width: 100%">

        <p>

        Please disregard that, the global_id space for Quads was incontinuous in my mesh file

        </p>

        <p>

        Will check back with correct mesh

        </p>

        <p>

        Anton<br />

        <br />

        On Mon, 19 May 2014 15:17:14 -0400,  wrote:

        </p>

        <blockquote style="padding-left: 5px; border-left-color: #1010ff; border-left-width: 2px; border-left-style: solid; margin-left: 5px; width: 100%">

                <p>

                Hello MOAB dev,

                </p>

                <p>

                I've attached a simplified version of my program that crashes presumably after particular numbers calls of exchange_tags

                </p>

                <p>

                I ran it couple of times on Mira on 1024 cores (64 nodes in --mode c16)

                </p>

                <p>

                It performs around 524378 iterations and then crushes (error file attached)

                </p>

                <p>

                Can you please take a look at what Scott Parker from ALCF suggests about it: 

                </p>

                <p>

                -------- Original Message --------

                </p>

                <table border="0" cellspacing="0" cellpadding="0">

                        <tbody>

                                <tr>

                                        <th align="right" valign="baseline">Subject: </th>

                                        <td>Re: Job exiting early [ThermHydraX]</td>

                                </tr>

                                <tr>

                                        <th align="right" valign="baseline">Date: </th>

                                        <td>Fri, 9 May 2014 18:48:25 -0500</td>

                                </tr>

                                <tr>

                                        <th align="right" valign="baseline">From: </th>

                                        <td>Scott Parker </td>

                                </tr>

                                <tr>

                                        <th align="right" valign="baseline">To: </th>

                                        <td> </td>

                                </tr>

                        </tbody>

                </table>

                <br />

                <br />

                Anton-<br />

                <br />

                I took a look at the core files and from the stack trace it appears

                that the code is failing in an MPI_Isend call<br />

                that is called from moab.ParallelComm::recv_buffer which is called

                from moab::ParallelComm::exchange_tags<br />

                called from main(). The stack looks like:<br />

                   <br />

                  main<br />

                    moab::ParallelComm::exchange_tags<br />

                      moab.ParallelComm::recv_buffer<br />

                        MPI_Isend<br />

                <br />

                I've been able to get the same failure and error message you are

                seeing by having an MPI process call MPI_Isend<br />

                when there are no matching receives. After 2 million Isends the

                program exits with the error you are seeing. So<br />

                I'm pretty sure your ending up with a large number of outstanding

                requests and the program is failing because<br />

                it can't allocate space for new MPI_Request objects.<br />

                <br />

                I'd suggest looking at how Moab is calling MPI and how many requests

                might be outstanding at any one time.<br />

                Since the code is running for 5 hours and looks to be executing

                hundreds of thousands of iterations I wonder<br />

                if there is some sort of send-receive mismatch that is letting

                requests accumulate. I think your best bet is to<br />

                talk to the Moab folks and see if they have any ideas about why this

                might be happening. <br />

                <br />

                One possibility is a load imbalance between processes - if you don't

                have any MPI_Barriers or other collectives in<br />

                your code you could try adding a barrier to synchronize the

                processes.<br />

                <br />

                If the Moab guys can't help you and adding a barrier doesn't help I

                can work with you to instrument the code to<br />

                collect more information on how MPI is being called and we could

                possibly pin down the source of the problem<br />

                that way.<br />

                <br />

                <br />

                Scott<br />

                <br />

                <br />

                <div class="moz-cite-prefix">

                On 5/2/14, 11:14 PM, <a href="mailto:kanaev@ibrae.ac.ru" class="moz-txt-link-abbreviated">kanaev@ibrae.ac.ru</a>

                wrote:<br />

                </div>

                <blockquote>

                        <p>

                        Hello Scott

                        </p>

                        <p>

                        The dir is cd /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit

                        </p>

                        <div>

                        The run produced core files is  253035

                        </div>

                        <div>

                        </div>

                        <div>

                        I took another run with the line

                         MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD);

                        commented and it stopped the same very time iteration #524378,

                         just passed some more lines

                        </div>

                        <div>

                        </div>

                        <div>

                        I use MOAB library and its function of exchanging data between

                        processors so i think i cannot really count MPI requests

                        </div>

                        <div>

                        </div>

                        <div>

                        Anton 

                        </div>

                        <p>

                        <br />

                        On Mon, 28 Apr 2014 16:45:41 -0500, Scott Parker wrote:

                        </p>

                        <blockquote style="padding-left: 5px; border-left-color: #1010ff; border-left-width: 2px; border-left-style: solid; margin-left: 5px; width: 100%">

                                <br />

                                Can you point me to the directory where your binary and core

                                files are?<br />

                                <br />

                                The stack trace you sent shows a call to MPI_Waitany, do you

                                know how many MPI requests<br />

                                the code generally has outstanding at any time?<br />

                                <br />

                                -Scott<br />

                                <br />

                                <div class="moz-cite-prefix">

                                On 4/28/14, 4:30 PM, <a href="mailto:kanaev@ibrae.ac.ru" class="moz-txt-link-abbreviated">kanaev@ibrae.ac.ru</a>

                                wrote:<br />

                                </div>

                                <blockquote>

                                        <p>

                                        Hello Scott, 

                                        </p>

                                        <p>

                                        I took rerun with the mentioned keys. The code was freshly

                                        compiled with makefile attached just in case. 

                                        </p>

                                        <p>

                                        I've got 1024 core files. Two of them are attached. 

                                        </p>

                                        <p>

                                        I run <span style="font-family: 'Lucida Grande', Verdana, Arial, Helvetica, sans-serif">bgq_stack for core.0 and

                                        here's what i got:</span> 

                                        </p>

                                        <p>

                                        <span style="font-family: 'Lucida Grande', Verdana, Arial, Helvetica, sans-serif"></span> [akanaev@miralac1

                                        pinit]$bgq_stack pinit core.0 

                                        </p>

                                        <p>

                                        ------------------------------------------------------------------------

                                        </p>

                                        <p>

                                        Program   :

                                        /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit/pinit 

                                        </p>

                                        <p>

                                        ------------------------------------------------------------------------

                                        </p>

                                        <p>

                                        +++ID Rank: 0, TGID: 1, Core: 0, HWTID:0 TID: 1 State: RUN

                                        </p>

                                        <p>

                                        </p>

                                        <p>

                                        00000000018334c0 

                                        </p>

                                        <p>

                                        _ZN4PAMI6Device2MU7Factory12advance_implEPNS1_7ContextEmm

                                        </p>

                                        <p>

                                        /bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/components/devices/bgq/mu2/Factory.h:269

                                        </p>

                                        <p>

                                        </p>

                                        <p>

                                        000000000170da28 

                                        </p>

                                        <p>

                                        PAMI_Context_trylock_advancev 

                                        </p>

                                        <p>

                                        /bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/api/c/pami.cc:554

                                        </p>

                                        <p>

                                        </p>

                                        <p>

                                        000000000155d0dc 

                                        </p>

                                        <p>

                                        PMPI_Waitany 

                                        </p>

                                        <p>

                                        /bgsys/source/srcV1R2M1.17463/comm/lib/dev/mpich2/src/mpid/pamid/include/../src/mpid_progress.h:239

                                        </p>

                                        <p>

                                        </p>

                                        <p>

                                        00000000010e84e4 

                                        </p>

                                        <p>

                                        00000072.long_branch_r2off.H5Dget_space+0 

                                        </p>

                                        <p>

                                        :0 

                                        </p>

                                        <p>

                                        </p>

                                        <p>

                                        0000000001042be0 

                                        </p>

                                        <p>

                                        00000012.long_branch_r2off._ZNSt14basic_ifstreamIcSt11char_traitsIcEEC1EPKcSt13_Ios_Openmode+0

                                        </p>

                                        <p>

                                        :0 

                                        </p>

                                        <p>

                                        </p>

                                        <p>

                                        00000000019de058 

                                        </p>

                                        <p>

                                        generic_start_main 

                                        </p>

                                        <p>

                                        /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../csu/libc-start.c:226

                                        </p>

                                        <p>

                                        </p>

                                        <p>

                                        00000000019de354 

                                        </p>

                                        <p>

                                        __libc_start_main 

                                        </p>

                                        <p>

                                        /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:194

                                        </p>

                                        <p>

                                        </p>

                                        <p>

                                        0000000000000000 

                                        </p>

                                        <p>

                                        ?? 

                                        </p>

                                        <p>

                                        ??:0 

                                        </p>

                                        <div>

                                        <br />

                                        </div>

                                        <p>

                                        >   Have these sort of runs succeeded in the past using

                                        the same code base with no changes and similar input data? 

                                        </p>

                                        <p>

                                        That is the first time i'm trying to run this code for

                                        that long time 

                                        </p>

                                        <p>

                                        </p>

                                        <p>

                                        Thanks 

                                        </p>

                                        <p>

                                        Anton  

                                        </p>

                                        <p>

                                        <br />

                                        On Thu, 24 Apr 2014 15:45:49 -0500, Scott Parker wrote: 

                                        </p>

                                        <blockquote style="padding-left: 5px; border-left-color: #1010ff; border-left-width: 2px; border-left-style: solid; margin-left: 5px; width: 1109px">

                                                <br />

                                                Anton-<br />

                                                <br />

                                                Thanks, it's aborting because of a runtime error that

                                                appears to be in the mpich layer. <br />

                                                <br />

                                                Can you rerun with  "--env 

                                                BG_COREDUMPONEXIT=1:BG_COREDUMPONERROR=1" added to your qsub

                                                line - that should<br />

                                                generate some core files on which you can run bgq_stack.<br />

                                                <br />

                                                The system software (driver) on Mira was updated this week

                                                and I'd like to get a clearer picture of <br />

                                                whether that could be related to you problem, so<br />

                                                <br />

                                                   Has your code been recompiled since Monday? If not can

                                                you recompile and try running again<br />

                                                   <br />

                                                   Have these sort of runs succeeded in the past using the

                                                same code base with no changes and similar input data?<br />

                                                <br />

                                                -Scott<br />

                                                <br />

                                                  <br />

                                                <br />

                                                <div class="moz-cite-prefix">

                                                On 4/24/14, 2:59 PM, <a href="mailto:kanaev@ibrae.ac.ru" class="moz-txt-link-abbreviated">kanaev@ibrae.ac.ru</a> wrote:<br />

                                                </div>

                                                <blockquote>

                                                        <p>

                                                        Sorry about the attached files, here the are 

                                                        </p>

                                                        <p>

                                                        There's no core files after exiting, looks like

                                                        stopping because of time requested expires but you can

                                                        see from cobaltlog it's about 5 hours passed (10 hours

                                                        was requested) before exit 

                                                        </p>

                                                        <p>

                                                        Anton  

                                                        </p>

                                                        <p>

                                                        On Thu, 24 Apr 2014 14:07:07 -0500, Scott Parker

                                                        wrote: 

                                                        </p>

                                                        <blockquote style="padding-left: 5px; border-left-color: #1010ff; border-left-width: 2px; border-left-style: solid; margin-left: 5px; width: 1029px">

                                                                <br />

                                                                Anton-<br />

                                                                <br />

                                                                Please send these emails to <a href="mailto:support@alcf.anl.gov" class="moz-txt-link-abbreviated">support@alcf.anl.gov</a> as

                                                                I may not always be available to investigate.<br />

                                                                <br />

                                                                I don't see the cobalt or error files attached, so I

                                                                can't really say anything about why your job may be<br />

                                                                failing. Do you get core files when the job crashed? If

                                                                so I'd recommend using 'bgq_stack '<br />

                                                                to try and get the file and line number where the

                                                                failure occurred. Knowing the line may be enough<br />

                                                                to let you figure it out, if not you'll need to dump the

                                                                values of the variables at the time of the crash<br />

                                                                to get a clearer picture of what is going on.<br />

                                                                <br />

                                                                Scott<br />

                                                                <br />

                                                                <div class="moz-cite-prefix">

                                                                On 4/24/14, 1:36 PM, <a href="mailto:kanaev@ibrae.ac.ru" class="moz-txt-link-abbreviated">kanaev@ibrae.ac.ru</a> wrote:<br />

                                                                </div>

                                                                <blockquote>

                                                                        <p>

                                                                        Hello Scott, 

                                                                        </p>

                                                                        <p>

                                                                        I've tried twice to run 10 hours 1024 cores job on

                                                                        Mira in mode c16 with 

                                                                        </p>

                                                                        <p>

                                                                        qsub -n 64 -t 10:00:00 --mode c16 -A ThermHydraX

                                                                        pinit Lid_128x128x1p1024.h5m 

                                                                        </p>

                                                                        <p>

                                                                        Both times job exited earlier than expected on the

                                                                        same iteration after the same error during executing

                                                                        the following section (it's between two couts): 

                                                                        </p>

                                                                        <p>

                                                                         ... 

                                                                        </p>

                                                                        <p>

                                                                        //LOOP OVER OWNED CELLS 

                                                                        </p>

                                                                        <p>

                                                                             double r = 0; 

                                                                        </p>

                                                                        <p>

                                                                             for (moab::Range::iterator it =

                                                                        owned_ents.begin(); it != owned_ents.end(); it++) {

                                                                        </p>

                                                                        <p>

                                                                              EntityHandle ent = *it; 

                                                                        </p>

                                                                        <p>

                                                                              int cellid = mb->id_from_handle(ent); 

                                                                        </p>

                                                                        <p>

                                                                              double Vol_; 

                                                                        </p>

                                                                        <p>

                                                                              double u_; 

                                                                        </p>

                                                                        <p>

                                                                              double v_; 

                                                                        </p>

                                                                        <p>

                                                                              double w_; 

                                                                        </p>

                                                                        <p>

                                                                              double r1,r2,r3; 

                                                                        </p>

                                                                        <p>

                                                                              double tmp; 

                                                                        </p>

                                                                        <p>

                                                                        </p>

                                                                        <p>

                                                                              result = mb->tag_get_data(u, &ent, 1,

                                                                        &u_); 

                                                                        </p>

                                                                        <p>

                                                                              PRINT_LAST_ERROR; 

                                                                        </p>

                                                                        <p>

                                                                              result = mb->tag_get_data(v, &ent, 1,

                                                                        &v_); 

                                                                        </p>

                                                                        <p>

                                                                              PRINT_LAST_ERROR; 

                                                                        </p>

                                                                        <p>

                                                                              result = mb->tag_get_data(w, &ent, 1,

                                                                        &w_); 

                                                                        </p>

                                                                        <p>

                                                                              PRINT_LAST_ERROR;  

                                                                        </p>

                                                                        <p>

                                                                        </p>

                                                                        <p>

                                                                              double result; 

                                                                        </p>

                                                                        <p>

                                                                        SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][2].lx,CG[cellid][2].ly,CG[cellid][2].lz);

                                                                        </p>

                                                                        <p>

                                                                        </p>

                                                                        <p>

                                                                              r1 = (sound +

                                                                        fabs(result))/CG[cellid][2].length; 

                                                                        </p>

                                                                        <p>

                                                                        </p>

                                                                        <p>

                                                                        SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][3].lx,CG[cellid][3].ly,CG[cellid][3].lz);

                                                                        </p>

                                                                        <p>

                                                                        </p>

                                                                        <p>

                                                                              r2 = (sound +

                                                                        fabs(result))/CG[cellid][3].length;      

                                                                        </p>

                                                                        <p>

                                                                        </p>

                                                                        <p>

                                                                        SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][5].lx,CG[cellid][5].ly,CG[cellid][5].lz);

                                                                        </p>

                                                                        <p>

                                                                        </p>

                                                                        <p>

                                                                              r3 = (sound +

                                                                        fabs(result))/CG[cellid][5].length; 

                                                                        </p>

                                                                        <p>

                                                                        </p>

                                                                        <p>

                                                                              tmp = MAX3(r1,r2,r3); 

                                                                        </p>

                                                                        <p>

                                                                        </p>

                                                                        <p>

                                                                              r = MAX(tmp,r); 

                                                                        </p>

                                                                        <p>

                                                                         } 

                                                                        </p>

                                                                        <p>

                                                                        </p>

                                                                        <p>

                                                                         double rmax; 

                                                                        </p>

                                                                        <p>

                                                                         MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD);

                                                                        </p>

                                                                        <p>

                                                                         tau = CFL/rmax; 

                                                                        </p>

                                                                        <p>

                                                                         ttime+=tau;   

                                                                        </p>

                                                                        <p>

                                                                        ... 

                                                                        </p>

                                                                        <p>

                                                                        </p>

                                                                        <p>

                                                                        So it may be Allreduce 

                                                                        </p>

                                                                        <p>

                                                                        I've attached cobaltlog and error files of both

                                                                        runs 

                                                                        </p>

                                                                        <p>

                                                                        Can you please take a look and suggest a further

                                                                        debugging 

                                                                        </p>

                                                                        <p>

                                                                        </p>

                                                                        <p>

                                                                        Thanks 

                                                                        </p>

                                                                        <p>

                                                                        Anton  

                                                                        </p>

                                                                </blockquote>

                                                        </blockquote>

                                                </blockquote>

                                        </blockquote>

                                </blockquote>

                                <br />

                        </blockquote>

                        <p>

                        </p>

                </blockquote>

                <br />

        </blockquote>

        <p>

        </p>

</blockquote>

<p>

</p>