[MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]

Tim Tautges timothy.tautges at cd-adapco.com
Wed May 21 10:11:24 CDT 2014


IIRC, there's an option to wait at a barrier where appropriate.  All 
sends/receives should have matches, so if it doesn't make it through a 
barrier something is wrong.  There's no algorithm in there that has 
non-matching sends/receives (i.e. no polling for random messages).

- tim

On 05/21/2014 09:57 AM, Grindeanu, Iulian R. wrote:
> Anton's issue is with doing exchange tags repeatedly; it seems to be
> mismatched isend/ireceives; as they accumulate, they run the memory out.
> Or something else is happening with memory, maybe we need to do some
> flushes of the buffers.
> Jane's issue is different; assign ids might have problems.
> All these should deserve tickets.
> Iulian
> ------------------------------------------------------------------------
> *From:* Vijay S. Mahadevan [vijay.m at gmail.com]
> *Sent:* Wednesday, May 21, 2014 9:48 AM
> *To:* Jiangtao Hu
> *Cc:* Grindeanu, Iulian R.; kanaev at ibrae.ac.ru; MOAB dev
> *Subject:* Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
>
> Jane,
>
> //              Global Ids
>    //   2.0   4 ------- 3 -------- 6 -------- 10 --------- 14
>    //          |         |          |          |                 |
>    //          |         |          |          |                 |
>    //          |         |          |          |                 |
>    //   1.0   1 ------- 2 -------- 5 -------- 9 --------- 13
>    //        0.0       1.0        2.0        3.0          4.0
>
> This is certainly a bug and should not happen to start with. If you have
> this test case available, do send it to the list so that we can find out
> the actual reason for the misnumber. Anton's issue might or not be
> directly related to this bug but his reference to GID being
> discontinuous shows that there is an outstanding issue here.
>
> Vijay
>
> On Wed, May 21, 2014 at 8:06 PM, Jiangtao Hu <jiangtao_ma at yahoo.com
> <mailto:jiangtao_ma at yahoo.com>> wrote:
>
>     On unit test for my project, I am trying to use
>     ParallelComm::assign_global_id() to get the global ids for vertices
>     of a simple grid
>        //              Mesh Ids
>        //   2.0   6 ------- 7 -------- 8 -------- 9 --------- 10
>        //          |        |          |          |                 |
>        //          |    1  |    2    |     3    |      4         |
>        //          |        |          |          |                 |
>        //   1.0   1 ------- 2 -------- 3 -------- 4 --------- 5
>        //        0.0       1.0        2.0        3.0          4.0
>
>     and got global ids as following
>     //              Global Ids
>        //   2.0   4 ------- 3 -------- 6 -------- 10 --------- 14
>        //          |         |          |          |                 |
>        //          |         |          |          |                 |
>        //          |         |          |          |                 |
>        //   1.0   1 ------- 2 -------- 5 -------- 9 --------- 13
>        //        0.0       1.0        2.0        3.0          4.0
>
>     Don't know if this is what Anton got through.
>
>     Iulian, if you are interested to see the test, please let me know,
>     and I'll send to you.
>     Jane
>
>
>     Asst. Researcher
>     Dept. of Engineering Physics
>     UW @ Madison
>
>
>     "And we know that for those who love God, that is, for those who are
>     called according to his purpose, all things are working together for
>     good." (Romans 8:28)
>     On Tuesday, May 20, 2014 5:36 PM, "Grindeanu, Iulian R."
>     <iulian at mcs.anl.gov <mailto:iulian at mcs.anl.gov>> wrote:
>
>
>     hmmm,
>     this is not good :(
>
>     Are you running this on mira? Do you have a small file for a
>     laptop/workstation?
>     Maybe I can create one similar.
>
>     Do you see this only on 1024 processes or can it be lower count?
>
>     How does your model look like?
>     Any processor should not communicate with more than 64 other
>     processes, maybe after ghosting this number is reached.
>
>     Can you run a debug version of this ? maybe some asserts are not
>     triggered in optimized mode.
>
>     Is your file somewhere on mira I can get to it?
>
>     Iulian
>
>     ------------------------------------------------------------------------
>     *From:* moab-dev-bounces at mcs.anl.gov
>     <mailto:moab-dev-bounces at mcs.anl.gov> [moab-dev-bounces at mcs.anl.gov
>     <mailto:moab-dev-bounces at mcs.anl.gov>] on behalf of
>     kanaev at ibrae.ac.ru <mailto:kanaev at ibrae.ac.ru> [kanaev at ibrae.ac.ru
>     <mailto:kanaev at ibrae.ac.ru>]
>     *Sent:* Tuesday, May 20, 2014 5:05 PM
>     *To:* MOAB dev
>     *Subject:* Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
>
>     The problem is still here
>     I've made a simple program performing certain numbers of
>     exchange_tags calls within a loop
>     If you run it on several processors with any mesh file it will
>     eventually crash with the following message from every core:
>     Fatal error in PMPI_Isend: Internal MPI error!, error stack:
>     PMPI_Isend(148): MPI_Isend(buf=0xd0f300, count=4, MPI_UNSIGNED_CHAR,
>     dest=1, tag=6, MPI_COMM_WORLD, request=0xcde354) failed
>     (unknown)(): Internal MPI error!
>     Thanks
>     Anton
>
>     On Tue, 20 May 2014 04:40:03 -0400, wrote:
>
>         Please disregard that, the global_id space for Quads was
>         incontinuous in my mesh file
>         Will check back with correct mesh
>         Anton
>
>         On Mon, 19 May 2014 15:17:14 -0400, wrote:
>
>             Hello MOAB dev,
>             I've attached a simplified version of my program that
>             crashes presumably after particular numbers calls of
>             exchange_tags
>             I ran it couple of times on Mira on 1024 cores (64 nodes in
>             --mode c16)
>             It performs around 524378 iterations and then crushes (error
>             file attached)
>             Can you please take a look at what Scott Parker from ALCF
>             suggests about it:
>             -------- Original Message --------
>             Subject: 	Re: Job exiting early [ThermHydraX]
>             Date: 	Fri, 9 May 2014 18:48:25 -0500
>             From: 	Scott Parker
>             To: 	
>
>
>
>             Anton-
>
>             I took a look at the core files and from the stack trace it
>             appears that the code is failing in an MPI_Isend call
>             that is called from moab.ParallelComm::recv_buffer which is
>             called from moab::ParallelComm::exchange_tags
>             called from main(). The stack looks like:
>
>                main
>                  moab::ParallelComm::exchange_tags
>                    moab.ParallelComm::recv_buffer
>                      MPI_Isend
>
>             I've been able to get the same failure and error message you
>             are seeing by having an MPI process call MPI_Isend
>             when there are no matching receives. After 2 million Isends
>             the program exits with the error you are seeing. So
>             I'm pretty sure your ending up with a large number of
>             outstanding requests and the program is failing because
>             it can't allocate space for new MPI_Request objects.
>
>             I'd suggest looking at how Moab is calling MPI and how many
>             requests might be outstanding at any one time.
>             Since the code is running for 5 hours and looks to be
>             executing hundreds of thousands of iterations I wonder
>             if there is some sort of send-receive mismatch that is
>             letting requests accumulate. I think your best bet is to
>             talk to the Moab folks and see if they have any ideas about
>             why this might be happening.
>
>             One possibility is a load imbalance between processes - if
>             you don't have any MPI_Barriers or other collectives in
>             your code you could try adding a barrier to synchronize the
>             processes.
>
>             If the Moab guys can't help you and adding a barrier doesn't
>             help I can work with you to instrument the code to
>             collect more information on how MPI is being called and we
>             could possibly pin down the source of the problem
>             that way.
>
>
>             Scott
>
>
>             On 5/2/14, 11:14 PM, kanaev at ibrae.ac.ru
>             <mailto:kanaev at ibrae.ac.ru> wrote:
>
>                 Hello Scott
>                 The dir is cd
>                 /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit
>                 The run produced core files is  253035
>                 I took another run with the line
>                   MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD); commented and it stopped the same very time iteration #524378,  just passed some more lines
>                 I use MOAB library and its function of exchanging data
>                 between processors so i think i cannot really count MPI
>                 requests
>                 Anton
>
>                 On Mon, 28 Apr 2014 16:45:41 -0500, Scott Parker wrote:
>
>
>                     Can you point me to the directory where your binary
>                     and core files are?
>
>                     The stack trace you sent shows a call to
>                     MPI_Waitany, do you know how many MPI requests
>                     the code generally has outstanding at any time?
>
>                     -Scott
>
>                     On 4/28/14, 4:30 PM, kanaev at ibrae.ac.ru
>                     <mailto:kanaev at ibrae.ac.ru> wrote:
>
>                         Hello Scott,
>                         I took rerun with the mentioned keys. The code
>                         was freshly compiled with makefile attached just
>                         in case.
>                         I've got 1024 core files. Two of them are attached.
>                         I run bgq_stack for core.0 and here's what i got:
>                           [akanaev at miralac1 pinit]$bgq_stack pinit core.0
>                         ------------------------------------------------------------------------
>
>                         Program   :
>                         /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit/pinit
>
>                         ------------------------------------------------------------------------
>
>                         +++ID Rank: 0, TGID: 1, Core: 0, HWTID:0 TID: 1
>                         State: RUN
>                         00000000018334c0
>                         _ZN4PAMI6Device2MU7Factory12advance_implEPNS1_7ContextEmm
>
>                         /bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/components/devices/bgq/mu2/Factory.h:269
>
>                         000000000170da28
>                         PAMI_Context_trylock_advancev
>                         /bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/api/c/pami.cc:554
>
>                         000000000155d0dc
>                         PMPI_Waitany
>                         /bgsys/source/srcV1R2M1.17463/comm/lib/dev/mpich2/src/mpid/pamid/include/../src/mpid_progress.h:239
>
>                         00000000010e84e4
>                         00000072.long_branch_r2off.H5Dget_space+0
>                         :0
>                         0000000001042be0
>                         00000012.long_branch_r2off._ZNSt14basic_ifstreamIcSt11char_traitsIcEEC1EPKcSt13_Ios_Openmode+0
>
>                         :0
>                         00000000019de058
>                         generic_start_main
>                         /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../csu/libc-start.c:226
>
>                         00000000019de354
>                         __libc_start_main
>                         /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:194
>
>                         0000000000000000
>                         ??
>                         ??:0
>
>                          >   Have these sort of runs succeeded in the
>                         past using the same code base with no changes
>                         and similar input data?
>                         That is the first time i'm trying to run this
>                         code for that long time
>                         Thanks
>                         Anton
>
>                         On Thu, 24 Apr 2014 15:45:49 -0500, Scott
>                         Parker wrote:
>
>
>                             Anton-
>
>                             Thanks, it's aborting because of a runtime
>                             error that appears to be in the mpich layer.
>
>                             Can you rerun with  "--env
>                             BG_COREDUMPONEXIT=1:BG_COREDUMPONERROR=1"
>                             added to your qsub line - that should
>                             generate some core files on which you can
>                             run bgq_stack.
>
>                             The system software (driver) on Mira was
>                             updated this week and I'd like to get a
>                             clearer picture of
>                             whether that could be related to you problem, so
>
>                                 Has your code been recompiled since
>                             Monday? If not can you recompile and try
>                             running again
>
>                                 Have these sort of runs succeeded in the
>                             past using the same code base with no
>                             changes and similar input data?
>
>                             -Scott
>
>
>
>                             On 4/24/14, 2:59 PM, kanaev at ibrae.ac.ru
>                             <mailto:kanaev at ibrae.ac.ru> wrote:
>
>                                 Sorry about the attached files, here the
>                                 are
>                                 There's no core files after exiting,
>                                 looks like stopping because of time
>                                 requested expires but you can see from
>                                 cobaltlog it's about 5 hours passed (10
>                                 hours was requested) before exit
>                                 Anton
>                                 On Thu, 24 Apr 2014 14:07:07 -0500,
>                                 Scott Parker wrote:
>
>
>                                     Anton-
>
>                                     Please send these emails to
>                                     support at alcf.anl.gov
>                                     <mailto:support at alcf.anl.gov> as I
>                                     may not always be available to
>                                     investigate.
>
>                                     I don't see the cobalt or error
>                                     files attached, so I can't really
>                                     say anything about why your job may be
>                                     failing. Do you get core files when
>                                     the job crashed? If so I'd recommend
>                                     using 'bgq_stack '
>                                     to try and get the file and line
>                                     number where the failure occurred.
>                                     Knowing the line may be enough
>                                     to let you figure it out, if not
>                                     you'll need to dump the values of
>                                     the variables at the time of the crash
>                                     to get a clearer picture of what is
>                                     going on.
>
>                                     Scott
>
>                                     On 4/24/14, 1:36 PM,
>                                     kanaev at ibrae.ac.ru
>                                     <mailto:kanaev at ibrae.ac.ru> wrote:
>
>                                         Hello Scott,
>                                         I've tried twice to run 10 hours
>                                         1024 cores job on Mira in mode
>                                         c16 with
>                                         qsub -n 64 -t 10:00:00 --mode
>                                         c16 -A ThermHydraX pinit
>                                         Lid_128x128x1p1024.h5m
>                                         Both times job exited earlier
>                                         than expected on the same
>                                         iteration after the same error
>                                         during executing the following
>                                         section (it's between two couts):
>                                           ...
>                                         //LOOP OVER OWNED CELLS
>                                               double r = 0;
>                                               for (moab::Range::iterator
>                                         it = owned_ents.begin(); it !=
>                                         owned_ents.end(); it++) {
>                                                EntityHandle ent = *it;
>                                                int cellid =
>                                         mb->id_from_handle(ent);
>                                                double Vol_;
>                                                double u_;
>                                                double v_;
>                                                double w_;
>                                                double r1,r2,r3;
>                                                double tmp;
>                                                result =
>                                         mb->tag_get_data(u, &ent, 1, &u_);
>                                                PRINT_LAST_ERROR;
>                                                result =
>                                         mb->tag_get_data(v, &ent, 1, &v_);
>                                                PRINT_LAST_ERROR;
>                                                result =
>                                         mb->tag_get_data(w, &ent, 1, &w_);
>                                                PRINT_LAST_ERROR;
>                                                double result;
>
>                                         SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][2].lx,CG[cellid][2].ly,CG[cellid][2].lz);
>
>                                                r1 = (sound +
>                                         fabs(result))/CG[cellid][2].length;
>
>                                         SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][3].lx,CG[cellid][3].ly,CG[cellid][3].lz);
>
>                                                r2 = (sound +
>                                         fabs(result))/CG[cellid][3].length;
>
>                                         SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][5].lx,CG[cellid][5].ly,CG[cellid][5].lz);
>
>                                                r3 = (sound +
>                                         fabs(result))/CG[cellid][5].length;
>                                                tmp = MAX3(r1,r2,r3);
>                                                r = MAX(tmp,r);
>                                           }
>                                           double rmax;
>                                           MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD);
>                                           tau = CFL/rmax;
>                                           ttime+=tau;
>                                         ...
>                                         So it may be Allreduce
>                                         I've attached cobaltlog and
>                                         error files of both runs
>                                         Can you please take a look and
>                                         suggest a further debugging
>                                         Thanks
>                                         Anton
>
>
>
>
>
>

-- 
Timothy J. Tautges
Manager, Directed Meshing, CD-adapco
Phone: 608-354-1459
timothy.tautges at cd-adapco.com


More information about the moab-dev mailing list