[MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
Tim Tautges
timothy.tautges at cd-adapco.com
Wed May 21 10:11:24 CDT 2014
IIRC, there's an option to wait at a barrier where appropriate. All
sends/receives should have matches, so if it doesn't make it through a
barrier something is wrong. There's no algorithm in there that has
non-matching sends/receives (i.e. no polling for random messages).
- tim
On 05/21/2014 09:57 AM, Grindeanu, Iulian R. wrote:
> Anton's issue is with doing exchange tags repeatedly; it seems to be
> mismatched isend/ireceives; as they accumulate, they run the memory out.
> Or something else is happening with memory, maybe we need to do some
> flushes of the buffers.
> Jane's issue is different; assign ids might have problems.
> All these should deserve tickets.
> Iulian
> ------------------------------------------------------------------------
> *From:* Vijay S. Mahadevan [vijay.m at gmail.com]
> *Sent:* Wednesday, May 21, 2014 9:48 AM
> *To:* Jiangtao Hu
> *Cc:* Grindeanu, Iulian R.; kanaev at ibrae.ac.ru; MOAB dev
> *Subject:* Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
>
> Jane,
>
> // Global Ids
> // 2.0 4 ------- 3 -------- 6 -------- 10 --------- 14
> // | | | | |
> // | | | | |
> // | | | | |
> // 1.0 1 ------- 2 -------- 5 -------- 9 --------- 13
> // 0.0 1.0 2.0 3.0 4.0
>
> This is certainly a bug and should not happen to start with. If you have
> this test case available, do send it to the list so that we can find out
> the actual reason for the misnumber. Anton's issue might or not be
> directly related to this bug but his reference to GID being
> discontinuous shows that there is an outstanding issue here.
>
> Vijay
>
> On Wed, May 21, 2014 at 8:06 PM, Jiangtao Hu <jiangtao_ma at yahoo.com
> <mailto:jiangtao_ma at yahoo.com>> wrote:
>
> On unit test for my project, I am trying to use
> ParallelComm::assign_global_id() to get the global ids for vertices
> of a simple grid
> // Mesh Ids
> // 2.0 6 ------- 7 -------- 8 -------- 9 --------- 10
> // | | | | |
> // | 1 | 2 | 3 | 4 |
> // | | | | |
> // 1.0 1 ------- 2 -------- 3 -------- 4 --------- 5
> // 0.0 1.0 2.0 3.0 4.0
>
> and got global ids as following
> // Global Ids
> // 2.0 4 ------- 3 -------- 6 -------- 10 --------- 14
> // | | | | |
> // | | | | |
> // | | | | |
> // 1.0 1 ------- 2 -------- 5 -------- 9 --------- 13
> // 0.0 1.0 2.0 3.0 4.0
>
> Don't know if this is what Anton got through.
>
> Iulian, if you are interested to see the test, please let me know,
> and I'll send to you.
> Jane
>
>
> Asst. Researcher
> Dept. of Engineering Physics
> UW @ Madison
>
>
> "And we know that for those who love God, that is, for those who are
> called according to his purpose, all things are working together for
> good." (Romans 8:28)
> On Tuesday, May 20, 2014 5:36 PM, "Grindeanu, Iulian R."
> <iulian at mcs.anl.gov <mailto:iulian at mcs.anl.gov>> wrote:
>
>
> hmmm,
> this is not good :(
>
> Are you running this on mira? Do you have a small file for a
> laptop/workstation?
> Maybe I can create one similar.
>
> Do you see this only on 1024 processes or can it be lower count?
>
> How does your model look like?
> Any processor should not communicate with more than 64 other
> processes, maybe after ghosting this number is reached.
>
> Can you run a debug version of this ? maybe some asserts are not
> triggered in optimized mode.
>
> Is your file somewhere on mira I can get to it?
>
> Iulian
>
> ------------------------------------------------------------------------
> *From:* moab-dev-bounces at mcs.anl.gov
> <mailto:moab-dev-bounces at mcs.anl.gov> [moab-dev-bounces at mcs.anl.gov
> <mailto:moab-dev-bounces at mcs.anl.gov>] on behalf of
> kanaev at ibrae.ac.ru <mailto:kanaev at ibrae.ac.ru> [kanaev at ibrae.ac.ru
> <mailto:kanaev at ibrae.ac.ru>]
> *Sent:* Tuesday, May 20, 2014 5:05 PM
> *To:* MOAB dev
> *Subject:* Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
>
> The problem is still here
> I've made a simple program performing certain numbers of
> exchange_tags calls within a loop
> If you run it on several processors with any mesh file it will
> eventually crash with the following message from every core:
> Fatal error in PMPI_Isend: Internal MPI error!, error stack:
> PMPI_Isend(148): MPI_Isend(buf=0xd0f300, count=4, MPI_UNSIGNED_CHAR,
> dest=1, tag=6, MPI_COMM_WORLD, request=0xcde354) failed
> (unknown)(): Internal MPI error!
> Thanks
> Anton
>
> On Tue, 20 May 2014 04:40:03 -0400, wrote:
>
> Please disregard that, the global_id space for Quads was
> incontinuous in my mesh file
> Will check back with correct mesh
> Anton
>
> On Mon, 19 May 2014 15:17:14 -0400, wrote:
>
> Hello MOAB dev,
> I've attached a simplified version of my program that
> crashes presumably after particular numbers calls of
> exchange_tags
> I ran it couple of times on Mira on 1024 cores (64 nodes in
> --mode c16)
> It performs around 524378 iterations and then crushes (error
> file attached)
> Can you please take a look at what Scott Parker from ALCF
> suggests about it:
> -------- Original Message --------
> Subject: Re: Job exiting early [ThermHydraX]
> Date: Fri, 9 May 2014 18:48:25 -0500
> From: Scott Parker
> To:
>
>
>
> Anton-
>
> I took a look at the core files and from the stack trace it
> appears that the code is failing in an MPI_Isend call
> that is called from moab.ParallelComm::recv_buffer which is
> called from moab::ParallelComm::exchange_tags
> called from main(). The stack looks like:
>
> main
> moab::ParallelComm::exchange_tags
> moab.ParallelComm::recv_buffer
> MPI_Isend
>
> I've been able to get the same failure and error message you
> are seeing by having an MPI process call MPI_Isend
> when there are no matching receives. After 2 million Isends
> the program exits with the error you are seeing. So
> I'm pretty sure your ending up with a large number of
> outstanding requests and the program is failing because
> it can't allocate space for new MPI_Request objects.
>
> I'd suggest looking at how Moab is calling MPI and how many
> requests might be outstanding at any one time.
> Since the code is running for 5 hours and looks to be
> executing hundreds of thousands of iterations I wonder
> if there is some sort of send-receive mismatch that is
> letting requests accumulate. I think your best bet is to
> talk to the Moab folks and see if they have any ideas about
> why this might be happening.
>
> One possibility is a load imbalance between processes - if
> you don't have any MPI_Barriers or other collectives in
> your code you could try adding a barrier to synchronize the
> processes.
>
> If the Moab guys can't help you and adding a barrier doesn't
> help I can work with you to instrument the code to
> collect more information on how MPI is being called and we
> could possibly pin down the source of the problem
> that way.
>
>
> Scott
>
>
> On 5/2/14, 11:14 PM, kanaev at ibrae.ac.ru
> <mailto:kanaev at ibrae.ac.ru> wrote:
>
> Hello Scott
> The dir is cd
> /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit
> The run produced core files is 253035
> I took another run with the line
> MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD); commented and it stopped the same very time iteration #524378, just passed some more lines
> I use MOAB library and its function of exchanging data
> between processors so i think i cannot really count MPI
> requests
> Anton
>
> On Mon, 28 Apr 2014 16:45:41 -0500, Scott Parker wrote:
>
>
> Can you point me to the directory where your binary
> and core files are?
>
> The stack trace you sent shows a call to
> MPI_Waitany, do you know how many MPI requests
> the code generally has outstanding at any time?
>
> -Scott
>
> On 4/28/14, 4:30 PM, kanaev at ibrae.ac.ru
> <mailto:kanaev at ibrae.ac.ru> wrote:
>
> Hello Scott,
> I took rerun with the mentioned keys. The code
> was freshly compiled with makefile attached just
> in case.
> I've got 1024 core files. Two of them are attached.
> I run bgq_stack for core.0 and here's what i got:
> [akanaev at miralac1 pinit]$bgq_stack pinit core.0
> ------------------------------------------------------------------------
>
> Program :
> /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit/pinit
>
> ------------------------------------------------------------------------
>
> +++ID Rank: 0, TGID: 1, Core: 0, HWTID:0 TID: 1
> State: RUN
> 00000000018334c0
> _ZN4PAMI6Device2MU7Factory12advance_implEPNS1_7ContextEmm
>
> /bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/components/devices/bgq/mu2/Factory.h:269
>
> 000000000170da28
> PAMI_Context_trylock_advancev
> /bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/api/c/pami.cc:554
>
> 000000000155d0dc
> PMPI_Waitany
> /bgsys/source/srcV1R2M1.17463/comm/lib/dev/mpich2/src/mpid/pamid/include/../src/mpid_progress.h:239
>
> 00000000010e84e4
> 00000072.long_branch_r2off.H5Dget_space+0
> :0
> 0000000001042be0
> 00000012.long_branch_r2off._ZNSt14basic_ifstreamIcSt11char_traitsIcEEC1EPKcSt13_Ios_Openmode+0
>
> :0
> 00000000019de058
> generic_start_main
> /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../csu/libc-start.c:226
>
> 00000000019de354
> __libc_start_main
> /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:194
>
> 0000000000000000
> ??
> ??:0
>
> > Have these sort of runs succeeded in the
> past using the same code base with no changes
> and similar input data?
> That is the first time i'm trying to run this
> code for that long time
> Thanks
> Anton
>
> On Thu, 24 Apr 2014 15:45:49 -0500, Scott
> Parker wrote:
>
>
> Anton-
>
> Thanks, it's aborting because of a runtime
> error that appears to be in the mpich layer.
>
> Can you rerun with "--env
> BG_COREDUMPONEXIT=1:BG_COREDUMPONERROR=1"
> added to your qsub line - that should
> generate some core files on which you can
> run bgq_stack.
>
> The system software (driver) on Mira was
> updated this week and I'd like to get a
> clearer picture of
> whether that could be related to you problem, so
>
> Has your code been recompiled since
> Monday? If not can you recompile and try
> running again
>
> Have these sort of runs succeeded in the
> past using the same code base with no
> changes and similar input data?
>
> -Scott
>
>
>
> On 4/24/14, 2:59 PM, kanaev at ibrae.ac.ru
> <mailto:kanaev at ibrae.ac.ru> wrote:
>
> Sorry about the attached files, here the
> are
> There's no core files after exiting,
> looks like stopping because of time
> requested expires but you can see from
> cobaltlog it's about 5 hours passed (10
> hours was requested) before exit
> Anton
> On Thu, 24 Apr 2014 14:07:07 -0500,
> Scott Parker wrote:
>
>
> Anton-
>
> Please send these emails to
> support at alcf.anl.gov
> <mailto:support at alcf.anl.gov> as I
> may not always be available to
> investigate.
>
> I don't see the cobalt or error
> files attached, so I can't really
> say anything about why your job may be
> failing. Do you get core files when
> the job crashed? If so I'd recommend
> using 'bgq_stack '
> to try and get the file and line
> number where the failure occurred.
> Knowing the line may be enough
> to let you figure it out, if not
> you'll need to dump the values of
> the variables at the time of the crash
> to get a clearer picture of what is
> going on.
>
> Scott
>
> On 4/24/14, 1:36 PM,
> kanaev at ibrae.ac.ru
> <mailto:kanaev at ibrae.ac.ru> wrote:
>
> Hello Scott,
> I've tried twice to run 10 hours
> 1024 cores job on Mira in mode
> c16 with
> qsub -n 64 -t 10:00:00 --mode
> c16 -A ThermHydraX pinit
> Lid_128x128x1p1024.h5m
> Both times job exited earlier
> than expected on the same
> iteration after the same error
> during executing the following
> section (it's between two couts):
> ...
> //LOOP OVER OWNED CELLS
> double r = 0;
> for (moab::Range::iterator
> it = owned_ents.begin(); it !=
> owned_ents.end(); it++) {
> EntityHandle ent = *it;
> int cellid =
> mb->id_from_handle(ent);
> double Vol_;
> double u_;
> double v_;
> double w_;
> double r1,r2,r3;
> double tmp;
> result =
> mb->tag_get_data(u, &ent, 1, &u_);
> PRINT_LAST_ERROR;
> result =
> mb->tag_get_data(v, &ent, 1, &v_);
> PRINT_LAST_ERROR;
> result =
> mb->tag_get_data(w, &ent, 1, &w_);
> PRINT_LAST_ERROR;
> double result;
>
> SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][2].lx,CG[cellid][2].ly,CG[cellid][2].lz);
>
> r1 = (sound +
> fabs(result))/CG[cellid][2].length;
>
> SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][3].lx,CG[cellid][3].ly,CG[cellid][3].lz);
>
> r2 = (sound +
> fabs(result))/CG[cellid][3].length;
>
> SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][5].lx,CG[cellid][5].ly,CG[cellid][5].lz);
>
> r3 = (sound +
> fabs(result))/CG[cellid][5].length;
> tmp = MAX3(r1,r2,r3);
> r = MAX(tmp,r);
> }
> double rmax;
> MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD);
> tau = CFL/rmax;
> ttime+=tau;
> ...
> So it may be Allreduce
> I've attached cobaltlog and
> error files of both runs
> Can you please take a look and
> suggest a further debugging
> Thanks
> Anton
>
>
>
>
>
>
--
Timothy J. Tautges
Manager, Directed Meshing, CD-adapco
Phone: 608-354-1459
timothy.tautges at cd-adapco.com
More information about the moab-dev
mailing list