[MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
Tim Tautges
timothy.tautges at cd-adapco.com
Wed May 21 11:01:23 CDT 2014
Milad reported an issue like that many years ago, but I was never able
to track it down.
- tim
On 05/21/2014 10:52 AM, Grindeanu, Iulian R. wrote:
> valgrind did not find anything in moab. It found something on the video card :(
> I will recompile with minimal dependencies
> ________________________________________
> From: moab-dev-bounces at mcs.anl.gov [moab-dev-bounces at mcs.anl.gov] on behalf of Grindeanu, Iulian R. [iulian at mcs.anl.gov]
> Sent: Wednesday, May 21, 2014 10:32 AM
> To: Tim Tautges; moab-dev at mcs.anl.gov
> Subject: Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
>
> I am looking at this problem;
> it is a memory leak somewhere, in exchange tags. After 200000 iterations even a few bytes per iteration is enough to blow up.
> is shows on my laptop that the memory usage is increasing, with number of iterations :(
> Actually, after 300K iterations, memory is almost 1Gb higher per process.
> 1Gb/300K ~= 300Kb
> I will run valgrind and let you know.
>
> ....
> //BEGIN ITER
> for (int iter = 1; iter <= 1000000; iter++){
> if(rank ==0 && iter%100==1) std::cout << "---ITER " << iter << " ---" << std::endl;
>
> result = pcomm->exchange_tags(celltagsn, celltagsn, ents);
> PRINT_LAST_ERROR;
>
> result = pcomm->exchange_tags(celltags, celltags, ents);
> PRINT_LAST_ERROR;
>
> result = pcomm->exchange_tags(facetags, facetags, adjs);
> PRINT_LAST_ERROR;
>
> } //END ITER
> ________________________________________
> From: moab-dev-bounces at mcs.anl.gov [moab-dev-bounces at mcs.anl.gov] on behalf of Tim Tautges [timothy.tautges at cd-adapco.com]
> Sent: Wednesday, May 21, 2014 10:11 AM
> To: moab-dev at mcs.anl.gov
> Subject: Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
>
> IIRC, there's an option to wait at a barrier where appropriate. All
> sends/receives should have matches, so if it doesn't make it through a
> barrier something is wrong. There's no algorithm in there that has
> non-matching sends/receives (i.e. no polling for random messages).
>
> - tim
>
> On 05/21/2014 09:57 AM, Grindeanu, Iulian R. wrote:
>> Anton's issue is with doing exchange tags repeatedly; it seems to be
>> mismatched isend/ireceives; as they accumulate, they run the memory out.
>> Or something else is happening with memory, maybe we need to do some
>> flushes of the buffers.
>> Jane's issue is different; assign ids might have problems.
>> All these should deserve tickets.
>> Iulian
>> ------------------------------------------------------------------------
>> *From:* Vijay S. Mahadevan [vijay.m at gmail.com]
>> *Sent:* Wednesday, May 21, 2014 9:48 AM
>> *To:* Jiangtao Hu
>> *Cc:* Grindeanu, Iulian R.; kanaev at ibrae.ac.ru; MOAB dev
>> *Subject:* Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
>>
>> Jane,
>>
>> // Global Ids
>> // 2.0 4 ------- 3 -------- 6 -------- 10 --------- 14
>> // | | | | |
>> // | | | | |
>> // | | | | |
>> // 1.0 1 ------- 2 -------- 5 -------- 9 --------- 13
>> // 0.0 1.0 2.0 3.0 4.0
>>
>> This is certainly a bug and should not happen to start with. If you have
>> this test case available, do send it to the list so that we can find out
>> the actual reason for the misnumber. Anton's issue might or not be
>> directly related to this bug but his reference to GID being
>> discontinuous shows that there is an outstanding issue here.
>>
>> Vijay
>>
>> On Wed, May 21, 2014 at 8:06 PM, Jiangtao Hu <jiangtao_ma at yahoo.com
>> <mailto:jiangtao_ma at yahoo.com>> wrote:
>>
>> On unit test for my project, I am trying to use
>> ParallelComm::assign_global_id() to get the global ids for vertices
>> of a simple grid
>> // Mesh Ids
>> // 2.0 6 ------- 7 -------- 8 -------- 9 --------- 10
>> // | | | | |
>> // | 1 | 2 | 3 | 4 |
>> // | | | | |
>> // 1.0 1 ------- 2 -------- 3 -------- 4 --------- 5
>> // 0.0 1.0 2.0 3.0 4.0
>>
>> and got global ids as following
>> // Global Ids
>> // 2.0 4 ------- 3 -------- 6 -------- 10 --------- 14
>> // | | | | |
>> // | | | | |
>> // | | | | |
>> // 1.0 1 ------- 2 -------- 5 -------- 9 --------- 13
>> // 0.0 1.0 2.0 3.0 4.0
>>
>> Don't know if this is what Anton got through.
>>
>> Iulian, if you are interested to see the test, please let me know,
>> and I'll send to you.
>> Jane
>>
>>
>> Asst. Researcher
>> Dept. of Engineering Physics
>> UW @ Madison
>>
>>
>> "And we know that for those who love God, that is, for those who are
>> called according to his purpose, all things are working together for
>> good." (Romans 8:28)
>> On Tuesday, May 20, 2014 5:36 PM, "Grindeanu, Iulian R."
>> <iulian at mcs.anl.gov <mailto:iulian at mcs.anl.gov>> wrote:
>>
>>
>> hmmm,
>> this is not good :(
>>
>> Are you running this on mira? Do you have a small file for a
>> laptop/workstation?
>> Maybe I can create one similar.
>>
>> Do you see this only on 1024 processes or can it be lower count?
>>
>> How does your model look like?
>> Any processor should not communicate with more than 64 other
>> processes, maybe after ghosting this number is reached.
>>
>> Can you run a debug version of this ? maybe some asserts are not
>> triggered in optimized mode.
>>
>> Is your file somewhere on mira I can get to it?
>>
>> Iulian
>>
>> ------------------------------------------------------------------------
>> *From:* moab-dev-bounces at mcs.anl.gov
>> <mailto:moab-dev-bounces at mcs.anl.gov> [moab-dev-bounces at mcs.anl.gov
>> <mailto:moab-dev-bounces at mcs.anl.gov>] on behalf of
>> kanaev at ibrae.ac.ru <mailto:kanaev at ibrae.ac.ru> [kanaev at ibrae.ac.ru
>> <mailto:kanaev at ibrae.ac.ru>]
>> *Sent:* Tuesday, May 20, 2014 5:05 PM
>> *To:* MOAB dev
>> *Subject:* Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
>>
>> The problem is still here
>> I've made a simple program performing certain numbers of
>> exchange_tags calls within a loop
>> If you run it on several processors with any mesh file it will
>> eventually crash with the following message from every core:
>> Fatal error in PMPI_Isend: Internal MPI error!, error stack:
>> PMPI_Isend(148): MPI_Isend(buf=0xd0f300, count=4, MPI_UNSIGNED_CHAR,
>> dest=1, tag=6, MPI_COMM_WORLD, request=0xcde354) failed
>> (unknown)(): Internal MPI error!
>> Thanks
>> Anton
>>
>> On Tue, 20 May 2014 04:40:03 -0400, wrote:
>>
>> Please disregard that, the global_id space for Quads was
>> incontinuous in my mesh file
>> Will check back with correct mesh
>> Anton
>>
>> On Mon, 19 May 2014 15:17:14 -0400, wrote:
>>
>> Hello MOAB dev,
>> I've attached a simplified version of my program that
>> crashes presumably after particular numbers calls of
>> exchange_tags
>> I ran it couple of times on Mira on 1024 cores (64 nodes in
>> --mode c16)
>> It performs around 524378 iterations and then crushes (error
>> file attached)
>> Can you please take a look at what Scott Parker from ALCF
>> suggests about it:
>> -------- Original Message --------
>> Subject: Re: Job exiting early [ThermHydraX]
>> Date: Fri, 9 May 2014 18:48:25 -0500
>> From: Scott Parker
>> To:
>>
>>
>>
>> Anton-
>>
>> I took a look at the core files and from the stack trace it
>> appears that the code is failing in an MPI_Isend call
>> that is called from moab.ParallelComm::recv_buffer which is
>> called from moab::ParallelComm::exchange_tags
>> called from main(). The stack looks like:
>>
>> main
>> moab::ParallelComm::exchange_tags
>> moab.ParallelComm::recv_buffer
>> MPI_Isend
>>
>> I've been able to get the same failure and error message you
>> are seeing by having an MPI process call MPI_Isend
>> when there are no matching receives. After 2 million Isends
>> the program exits with the error you are seeing. So
>> I'm pretty sure your ending up with a large number of
>> outstanding requests and the program is failing because
>> it can't allocate space for new MPI_Request objects.
>>
>> I'd suggest looking at how Moab is calling MPI and how many
>> requests might be outstanding at any one time.
>> Since the code is running for 5 hours and looks to be
>> executing hundreds of thousands of iterations I wonder
>> if there is some sort of send-receive mismatch that is
>> letting requests accumulate. I think your best bet is to
>> talk to the Moab folks and see if they have any ideas about
>> why this might be happening.
>>
>> One possibility is a load imbalance between processes - if
>> you don't have any MPI_Barriers or other collectives in
>> your code you could try adding a barrier to synchronize the
>> processes.
>>
>> If the Moab guys can't help you and adding a barrier doesn't
>> help I can work with you to instrument the code to
>> collect more information on how MPI is being called and we
>> could possibly pin down the source of the problem
>> that way.
>>
>>
>> Scott
>>
>>
>> On 5/2/14, 11:14 PM, kanaev at ibrae.ac.ru
>> <mailto:kanaev at ibrae.ac.ru> wrote:
>>
>> Hello Scott
>> The dir is cd
>> /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit
>> The run produced core files is 253035
>> I took another run with the line
>> MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD); commented and it stopped the same very time iteration #524378, just passed some more lines
>> I use MOAB library and its function of exchanging data
>> between processors so i think i cannot really count MPI
>> requests
>> Anton
>>
>> On Mon, 28 Apr 2014 16:45:41 -0500, Scott Parker wrote:
>>
>>
>> Can you point me to the directory where your binary
>> and core files are?
>>
>> The stack trace you sent shows a call to
>> MPI_Waitany, do you know how many MPI requests
>> the code generally has outstanding at any time?
>>
>> -Scott
>>
>> On 4/28/14, 4:30 PM, kanaev at ibrae.ac.ru
>> <mailto:kanaev at ibrae.ac.ru> wrote:
>>
>> Hello Scott,
>> I took rerun with the mentioned keys. The code
>> was freshly compiled with makefile attached just
>> in case.
>> I've got 1024 core files. Two of them are attached.
>> I run bgq_stack for core.0 and here's what i got:
>> [akanaev at miralac1 pinit]$bgq_stack pinit core.0
>> ------------------------------------------------------------------------
>>
>> Program :
>> /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit/pinit
>>
>> ------------------------------------------------------------------------
>>
>> +++ID Rank: 0, TGID: 1, Core: 0, HWTID:0 TID: 1
>> State: RUN
>> 00000000018334c0
>> _ZN4PAMI6Device2MU7Factory12advance_implEPNS1_7ContextEmm
>>
>> /bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/components/devices/bgq/mu2/Factory.h:269
>>
>> 000000000170da28
>> PAMI_Context_trylock_advancev
>> /bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/api/c/pami.cc:554
>>
>> 000000000155d0dc
>> PMPI_Waitany
>> /bgsys/source/srcV1R2M1.17463/comm/lib/dev/mpich2/src/mpid/pamid/include/../src/mpid_progress.h:239
>>
>> 00000000010e84e4
>> 00000072.long_branch_r2off.H5Dget_space+0
>> :0
>> 0000000001042be0
>> 00000012.long_branch_r2off._ZNSt14basic_ifstreamIcSt11char_traitsIcEEC1EPKcSt13_Ios_Openmode+0
>>
>> :0
>> 00000000019de058
>> generic_start_main
>> /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../csu/libc-start.c:226
>>
>> 00000000019de354
>> __libc_start_main
>> /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:194
>>
>> 0000000000000000
>> ??
>> ??:0
>>
>> > Have these sort of runs succeeded in the
>> past using the same code base with no changes
>> and similar input data?
>> That is the first time i'm trying to run this
>> code for that long time
>> Thanks
>> Anton
>>
>> On Thu, 24 Apr 2014 15:45:49 -0500, Scott
>> Parker wrote:
>>
>>
>> Anton-
>>
>> Thanks, it's aborting because of a runtime
>> error that appears to be in the mpich layer.
>>
>> Can you rerun with "--env
>> BG_COREDUMPONEXIT=1:BG_COREDUMPONERROR=1"
>> added to your qsub line - that should
>> generate some core files on which you can
>> run bgq_stack.
>>
>> The system software (driver) on Mira was
>> updated this week and I'd like to get a
>> clearer picture of
>> whether that could be related to you problem, so
>>
>> Has your code been recompiled since
>> Monday? If not can you recompile and try
>> running again
>>
>> Have these sort of runs succeeded in the
>> past using the same code base with no
>> changes and similar input data?
>>
>> -Scott
>>
>>
>>
>> On 4/24/14, 2:59 PM, kanaev at ibrae.ac.ru
>> <mailto:kanaev at ibrae.ac.ru> wrote:
>>
>> Sorry about the attached files, here the
>> are
>> There's no core files after exiting,
>> looks like stopping because of time
>> requested expires but you can see from
>> cobaltlog it's about 5 hours passed (10
>> hours was requested) before exit
>> Anton
>> On Thu, 24 Apr 2014 14:07:07 -0500,
>> Scott Parker wrote:
>>
>>
>> Anton-
>>
>> Please send these emails to
>> support at alcf.anl.gov
>> <mailto:support at alcf.anl.gov> as I
>> may not always be available to
>> investigate.
>>
>> I don't see the cobalt or error
>> files attached, so I can't really
>> say anything about why your job may be
>> failing. Do you get core files when
>> the job crashed? If so I'd recommend
>> using 'bgq_stack '
>> to try and get the file and line
>> number where the failure occurred.
>> Knowing the line may be enough
>> to let you figure it out, if not
>> you'll need to dump the values of
>> the variables at the time of the crash
>> to get a clearer picture of what is
>> going on.
>>
>> Scott
>>
>> On 4/24/14, 1:36 PM,
>> kanaev at ibrae.ac.ru
>> <mailto:kanaev at ibrae.ac.ru> wrote:
>>
>> Hello Scott,
>> I've tried twice to run 10 hours
>> 1024 cores job on Mira in mode
>> c16 with
>> qsub -n 64 -t 10:00:00 --mode
>> c16 -A ThermHydraX pinit
>> Lid_128x128x1p1024.h5m
>> Both times job exited earlier
>> than expected on the same
>> iteration after the same error
>> during executing the following
>> section (it's between two couts):
>> ...
>> //LOOP OVER OWNED CELLS
>> double r = 0;
>> for (moab::Range::iterator
>> it = owned_ents.begin(); it !=
>> owned_ents.end(); it++) {
>> EntityHandle ent = *it;
>> int cellid =
>> mb->id_from_handle(ent);
>> double Vol_;
>> double u_;
>> double v_;
>> double w_;
>> double r1,r2,r3;
>> double tmp;
>> result =
>> mb->tag_get_data(u, &ent, 1, &u_);
>> PRINT_LAST_ERROR;
>> result =
>> mb->tag_get_data(v, &ent, 1, &v_);
>> PRINT_LAST_ERROR;
>> result =
>> mb->tag_get_data(w, &ent, 1, &w_);
>> PRINT_LAST_ERROR;
>> double result;
>>
>> SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][2].lx,CG[cellid][2].ly,CG[cellid][2].lz);
>>
>> r1 = (sound +
>> fabs(result))/CG[cellid][2].length;
>>
>> SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][3].lx,CG[cellid][3].ly,CG[cellid][3].lz);
>>
>> r2 = (sound +
>> fabs(result))/CG[cellid][3].length;
>>
>> SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][5].lx,CG[cellid][5].ly,CG[cellid][5].lz);
>>
>> r3 = (sound +
>> fabs(result))/CG[cellid][5].length;
>> tmp = MAX3(r1,r2,r3);
>> r = MAX(tmp,r);
>> }
>> double rmax;
>> MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD);
>> tau = CFL/rmax;
>> ttime+=tau;
>> ...
>> So it may be Allreduce
>> I've attached cobaltlog and
>> error files of both runs
>> Can you please take a look and
>> suggest a further debugging
>> Thanks
>> Anton
>>
>>
>>
>>
>>
>>
>
> --
> Timothy J. Tautges
> Manager, Directed Meshing, CD-adapco
> Phone: 608-354-1459
> timothy.tautges at cd-adapco.com
>
--
Timothy J. Tautges
Manager, Directed Meshing, CD-adapco
Phone: 608-354-1459
timothy.tautges at cd-adapco.com
More information about the moab-dev
mailing list