[MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
Tim Tautges
timothy.tautges at cd-adapco.com
Thu May 22 10:10:59 CDT 2014
Check out ParallelComm::Buffer::reserve in ParallelComm.hpp, and try
defining DEBUG_BUFFER and see if valgrind sees anything. realloc is
used to resize the buffer, and that shouldn't generate any memory leaks.
If the problem goes away after defining, then I guess the problem is
in fact in Buffer::reserve.
- tim
On 05/21/2014 06:02 PM, Grindeanu, Iulian R. wrote:
> Hi Tim,
> For a small mesh file, I did not see the leak; for a slightly larger, I see a leak after 100K iterations, of about 800-1000
> bytes per iteration
>
> My feeling is that we are not freeing the memory for the first buffer sent; can you see the comments I made here?
> https://bitbucket.org/fathomteam/moab/issue/7/exchange-tags-memory-leak
>
> It is reproducible on a laptop :(
>
> I know that we send first a message of size 1024, then a second message with the rest of it. The first message has the size to expect for the second message. (I mean, the total size in the first 4 bytes)
>
> Adding more ghost layers in the example should increase the size of the messages, but I do not see an increase in leak, this is why I suspect it is related to the size of the first message only. The memory usage is approximate, but I think the leak is about 1024-2000 bytes per exchange
>
> Thanks,
> Iulian
>
> ________________________________________
> From: moab-dev-bounces at mcs.anl.gov [moab-dev-bounces at mcs.anl.gov] on behalf of Tim Tautges [timothy.tautges at cd-adapco.com]
> Sent: Wednesday, May 21, 2014 11:01 AM
> To: moab-dev at mcs.anl.gov
> Subject: Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
>
> Milad reported an issue like that many years ago, but I was never able
> to track it down.
>
> - tim
>
> On 05/21/2014 10:52 AM, Grindeanu, Iulian R. wrote:
>> valgrind did not find anything in moab. It found something on the video card :(
>> I will recompile with minimal dependencies
>> ________________________________________
>> From: moab-dev-bounces at mcs.anl.gov [moab-dev-bounces at mcs.anl.gov] on behalf of Grindeanu, Iulian R. [iulian at mcs.anl.gov]
>> Sent: Wednesday, May 21, 2014 10:32 AM
>> To: Tim Tautges; moab-dev at mcs.anl.gov
>> Subject: Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
>>
>> I am looking at this problem;
>> it is a memory leak somewhere, in exchange tags. After 200000 iterations even a few bytes per iteration is enough to blow up.
>> is shows on my laptop that the memory usage is increasing, with number of iterations :(
>> Actually, after 300K iterations, memory is almost 1Gb higher per process.
>> 1Gb/300K ~= 300Kb
>> I will run valgrind and let you know.
>>
>> ....
>> //BEGIN ITER
>> for (int iter = 1; iter <= 1000000; iter++){
>> if(rank ==0 && iter%100==1) std::cout << "---ITER " << iter << " ---" << std::endl;
>>
>> result = pcomm->exchange_tags(celltagsn, celltagsn, ents);
>> PRINT_LAST_ERROR;
>>
>> result = pcomm->exchange_tags(celltags, celltags, ents);
>> PRINT_LAST_ERROR;
>>
>> result = pcomm->exchange_tags(facetags, facetags, adjs);
>> PRINT_LAST_ERROR;
>>
>> } //END ITER
>> ________________________________________
>> From: moab-dev-bounces at mcs.anl.gov [moab-dev-bounces at mcs.anl.gov] on behalf of Tim Tautges [timothy.tautges at cd-adapco.com]
>> Sent: Wednesday, May 21, 2014 10:11 AM
>> To: moab-dev at mcs.anl.gov
>> Subject: Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
>>
>> IIRC, there's an option to wait at a barrier where appropriate. All
>> sends/receives should have matches, so if it doesn't make it through a
>> barrier something is wrong. There's no algorithm in there that has
>> non-matching sends/receives (i.e. no polling for random messages).
>>
>> - tim
>>
>> On 05/21/2014 09:57 AM, Grindeanu, Iulian R. wrote:
>>> Anton's issue is with doing exchange tags repeatedly; it seems to be
>>> mismatched isend/ireceives; as they accumulate, they run the memory out.
>>> Or something else is happening with memory, maybe we need to do some
>>> flushes of the buffers.
>>> Jane's issue is different; assign ids might have problems.
>>> All these should deserve tickets.
>>> Iulian
>>> ------------------------------------------------------------------------
>>> *From:* Vijay S. Mahadevan [vijay.m at gmail.com]
>>> *Sent:* Wednesday, May 21, 2014 9:48 AM
>>> *To:* Jiangtao Hu
>>> *Cc:* Grindeanu, Iulian R.; kanaev at ibrae.ac.ru; MOAB dev
>>> *Subject:* Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
>>>
>>> Jane,
>>>
>>> // Global Ids
>>> // 2.0 4 ------- 3 -------- 6 -------- 10 --------- 14
>>> // | | | | |
>>> // | | | | |
>>> // | | | | |
>>> // 1.0 1 ------- 2 -------- 5 -------- 9 --------- 13
>>> // 0.0 1.0 2.0 3.0 4.0
>>>
>>> This is certainly a bug and should not happen to start with. If you have
>>> this test case available, do send it to the list so that we can find out
>>> the actual reason for the misnumber. Anton's issue might or not be
>>> directly related to this bug but his reference to GID being
>>> discontinuous shows that there is an outstanding issue here.
>>>
>>> Vijay
>>>
>>> On Wed, May 21, 2014 at 8:06 PM, Jiangtao Hu <jiangtao_ma at yahoo.com
>>> <mailto:jiangtao_ma at yahoo.com>> wrote:
>>>
>>> On unit test for my project, I am trying to use
>>> ParallelComm::assign_global_id() to get the global ids for vertices
>>> of a simple grid
>>> // Mesh Ids
>>> // 2.0 6 ------- 7 -------- 8 -------- 9 --------- 10
>>> // | | | | |
>>> // | 1 | 2 | 3 | 4 |
>>> // | | | | |
>>> // 1.0 1 ------- 2 -------- 3 -------- 4 --------- 5
>>> // 0.0 1.0 2.0 3.0 4.0
>>>
>>> and got global ids as following
>>> // Global Ids
>>> // 2.0 4 ------- 3 -------- 6 -------- 10 --------- 14
>>> // | | | | |
>>> // | | | | |
>>> // | | | | |
>>> // 1.0 1 ------- 2 -------- 5 -------- 9 --------- 13
>>> // 0.0 1.0 2.0 3.0 4.0
>>>
>>> Don't know if this is what Anton got through.
>>>
>>> Iulian, if you are interested to see the test, please let me know,
>>> and I'll send to you.
>>> Jane
>>>
>>>
>>> Asst. Researcher
>>> Dept. of Engineering Physics
>>> UW @ Madison
>>>
>>>
>>> "And we know that for those who love God, that is, for those who are
>>> called according to his purpose, all things are working together for
>>> good." (Romans 8:28)
>>> On Tuesday, May 20, 2014 5:36 PM, "Grindeanu, Iulian R."
>>> <iulian at mcs.anl.gov <mailto:iulian at mcs.anl.gov>> wrote:
>>>
>>>
>>> hmmm,
>>> this is not good :(
>>>
>>> Are you running this on mira? Do you have a small file for a
>>> laptop/workstation?
>>> Maybe I can create one similar.
>>>
>>> Do you see this only on 1024 processes or can it be lower count?
>>>
>>> How does your model look like?
>>> Any processor should not communicate with more than 64 other
>>> processes, maybe after ghosting this number is reached.
>>>
>>> Can you run a debug version of this ? maybe some asserts are not
>>> triggered in optimized mode.
>>>
>>> Is your file somewhere on mira I can get to it?
>>>
>>> Iulian
>>>
>>> ------------------------------------------------------------------------
>>> *From:* moab-dev-bounces at mcs.anl.gov
>>> <mailto:moab-dev-bounces at mcs.anl.gov> [moab-dev-bounces at mcs.anl.gov
>>> <mailto:moab-dev-bounces at mcs.anl.gov>] on behalf of
>>> kanaev at ibrae.ac.ru <mailto:kanaev at ibrae.ac.ru> [kanaev at ibrae.ac.ru
>>> <mailto:kanaev at ibrae.ac.ru>]
>>> *Sent:* Tuesday, May 20, 2014 5:05 PM
>>> *To:* MOAB dev
>>> *Subject:* Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
>>>
>>> The problem is still here
>>> I've made a simple program performing certain numbers of
>>> exchange_tags calls within a loop
>>> If you run it on several processors with any mesh file it will
>>> eventually crash with the following message from every core:
>>> Fatal error in PMPI_Isend: Internal MPI error!, error stack:
>>> PMPI_Isend(148): MPI_Isend(buf=0xd0f300, count=4, MPI_UNSIGNED_CHAR,
>>> dest=1, tag=6, MPI_COMM_WORLD, request=0xcde354) failed
>>> (unknown)(): Internal MPI error!
>>> Thanks
>>> Anton
>>>
>>> On Tue, 20 May 2014 04:40:03 -0400, wrote:
>>>
>>> Please disregard that, the global_id space for Quads was
>>> incontinuous in my mesh file
>>> Will check back with correct mesh
>>> Anton
>>>
>>> On Mon, 19 May 2014 15:17:14 -0400, wrote:
>>>
>>> Hello MOAB dev,
>>> I've attached a simplified version of my program that
>>> crashes presumably after particular numbers calls of
>>> exchange_tags
>>> I ran it couple of times on Mira on 1024 cores (64 nodes in
>>> --mode c16)
>>> It performs around 524378 iterations and then crushes (error
>>> file attached)
>>> Can you please take a look at what Scott Parker from ALCF
>>> suggests about it:
>>> -------- Original Message --------
>>> Subject: Re: Job exiting early [ThermHydraX]
>>> Date: Fri, 9 May 2014 18:48:25 -0500
>>> From: Scott Parker
>>> To:
>>>
>>>
>>>
>>> Anton-
>>>
>>> I took a look at the core files and from the stack trace it
>>> appears that the code is failing in an MPI_Isend call
>>> that is called from moab.ParallelComm::recv_buffer which is
>>> called from moab::ParallelComm::exchange_tags
>>> called from main(). The stack looks like:
>>>
>>> main
>>> moab::ParallelComm::exchange_tags
>>> moab.ParallelComm::recv_buffer
>>> MPI_Isend
>>>
>>> I've been able to get the same failure and error message you
>>> are seeing by having an MPI process call MPI_Isend
>>> when there are no matching receives. After 2 million Isends
>>> the program exits with the error you are seeing. So
>>> I'm pretty sure your ending up with a large number of
>>> outstanding requests and the program is failing because
>>> it can't allocate space for new MPI_Request objects.
>>>
>>> I'd suggest looking at how Moab is calling MPI and how many
>>> requests might be outstanding at any one time.
>>> Since the code is running for 5 hours and looks to be
>>> executing hundreds of thousands of iterations I wonder
>>> if there is some sort of send-receive mismatch that is
>>> letting requests accumulate. I think your best bet is to
>>> talk to the Moab folks and see if they have any ideas about
>>> why this might be happening.
>>>
>>> One possibility is a load imbalance between processes - if
>>> you don't have any MPI_Barriers or other collectives in
>>> your code you could try adding a barrier to synchronize the
>>> processes.
>>>
>>> If the Moab guys can't help you and adding a barrier doesn't
>>> help I can work with you to instrument the code to
>>> collect more information on how MPI is being called and we
>>> could possibly pin down the source of the problem
>>> that way.
>>>
>>>
>>> Scott
>>>
>>>
>>> On 5/2/14, 11:14 PM, kanaev at ibrae.ac.ru
>>> <mailto:kanaev at ibrae.ac.ru> wrote:
>>>
>>> Hello Scott
>>> The dir is cd
>>> /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit
>>> The run produced core files is 253035
>>> I took another run with the line
>>> MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD); commented and it stopped the same very time iteration #524378, just passed some more lines
>>> I use MOAB library and its function of exchanging data
>>> between processors so i think i cannot really count MPI
>>> requests
>>> Anton
>>>
>>> On Mon, 28 Apr 2014 16:45:41 -0500, Scott Parker wrote:
>>>
>>>
>>> Can you point me to the directory where your binary
>>> and core files are?
>>>
>>> The stack trace you sent shows a call to
>>> MPI_Waitany, do you know how many MPI requests
>>> the code generally has outstanding at any time?
>>>
>>> -Scott
>>>
>>> On 4/28/14, 4:30 PM, kanaev at ibrae.ac.ru
>>> <mailto:kanaev at ibrae.ac.ru> wrote:
>>>
>>> Hello Scott,
>>> I took rerun with the mentioned keys. The code
>>> was freshly compiled with makefile attached just
>>> in case.
>>> I've got 1024 core files. Two of them are attached.
>>> I run bgq_stack for core.0 and here's what i got:
>>> [akanaev at miralac1 pinit]$bgq_stack pinit core.0
>>> ------------------------------------------------------------------------
>>>
>>> Program :
>>> /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit/pinit
>>>
>>> ------------------------------------------------------------------------
>>>
>>> +++ID Rank: 0, TGID: 1, Core: 0, HWTID:0 TID: 1
>>> State: RUN
>>> 00000000018334c0
>>> _ZN4PAMI6Device2MU7Factory12advance_implEPNS1_7ContextEmm
>>>
>>> /bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/components/devices/bgq/mu2/Factory.h:269
>>>
>>> 000000000170da28
>>> PAMI_Context_trylock_advancev
>>> /bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/api/c/pami.cc:554
>>>
>>> 000000000155d0dc
>>> PMPI_Waitany
>>> /bgsys/source/srcV1R2M1.17463/comm/lib/dev/mpich2/src/mpid/pamid/include/../src/mpid_progress.h:239
>>>
>>> 00000000010e84e4
>>> 00000072.long_branch_r2off.H5Dget_space+0
>>> :0
>>> 0000000001042be0
>>> 00000012.long_branch_r2off._ZNSt14basic_ifstreamIcSt11char_traitsIcEEC1EPKcSt13_Ios_Openmode+0
>>>
>>> :0
>>> 00000000019de058
>>> generic_start_main
>>> /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../csu/libc-start.c:226
>>>
>>> 00000000019de354
>>> __libc_start_main
>>> /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:194
>>>
>>> 0000000000000000
>>> ??
>>> ??:0
>>>
>>> > Have these sort of runs succeeded in the
>>> past using the same code base with no changes
>>> and similar input data?
>>> That is the first time i'm trying to run this
>>> code for that long time
>>> Thanks
>>> Anton
>>>
>>> On Thu, 24 Apr 2014 15:45:49 -0500, Scott
>>> Parker wrote:
>>>
>>>
>>> Anton-
>>>
>>> Thanks, it's aborting because of a runtime
>>> error that appears to be in the mpich layer.
>>>
>>> Can you rerun with "--env
>>> BG_COREDUMPONEXIT=1:BG_COREDUMPONERROR=1"
>>> added to your qsub line - that should
>>> generate some core files on which you can
>>> run bgq_stack.
>>>
>>> The system software (driver) on Mira was
>>> updated this week and I'd like to get a
>>> clearer picture of
>>> whether that could be related to you problem, so
>>>
>>> Has your code been recompiled since
>>> Monday? If not can you recompile and try
>>> running again
>>>
>>> Have these sort of runs succeeded in the
>>> past using the same code base with no
>>> changes and similar input data?
>>>
>>> -Scott
>>>
>>>
>>>
>>> On 4/24/14, 2:59 PM, kanaev at ibrae.ac.ru
>>> <mailto:kanaev at ibrae.ac.ru> wrote:
>>>
>>> Sorry about the attached files, here the
>>> are
>>> There's no core files after exiting,
>>> looks like stopping because of time
>>> requested expires but you can see from
>>> cobaltlog it's about 5 hours passed (10
>>> hours was requested) before exit
>>> Anton
>>> On Thu, 24 Apr 2014 14:07:07 -0500,
>>> Scott Parker wrote:
>>>
>>>
>>> Anton-
>>>
>>> Please send these emails to
>>> support at alcf.anl.gov
>>> <mailto:support at alcf.anl.gov> as I
>>> may not always be available to
>>> investigate.
>>>
>>> I don't see the cobalt or error
>>> files attached, so I can't really
>>> say anything about why your job may be
>>> failing. Do you get core files when
>>> the job crashed? If so I'd recommend
>>> using 'bgq_stack '
>>> to try and get the file and line
>>> number where the failure occurred.
>>> Knowing the line may be enough
>>> to let you figure it out, if not
>>> you'll need to dump the values of
>>> the variables at the time of the crash
>>> to get a clearer picture of what is
>>> going on.
>>>
>>> Scott
>>>
>>> On 4/24/14, 1:36 PM,
>>> kanaev at ibrae.ac.ru
>>> <mailto:kanaev at ibrae.ac.ru> wrote:
>>>
>>> Hello Scott,
>>> I've tried twice to run 10 hours
>>> 1024 cores job on Mira in mode
>>> c16 with
>>> qsub -n 64 -t 10:00:00 --mode
>>> c16 -A ThermHydraX pinit
>>> Lid_128x128x1p1024.h5m
>>> Both times job exited earlier
>>> than expected on the same
>>> iteration after the same error
>>> during executing the following
>>> section (it's between two couts):
>>> ...
>>> //LOOP OVER OWNED CELLS
>>> double r = 0;
>>> for (moab::Range::iterator
>>> it = owned_ents.begin(); it !=
>>> owned_ents.end(); it++) {
>>> EntityHandle ent = *it;
>>> int cellid =
>>> mb->id_from_handle(ent);
>>> double Vol_;
>>> double u_;
>>> double v_;
>>> double w_;
>>> double r1,r2,r3;
>>> double tmp;
>>> result =
>>> mb->tag_get_data(u, &ent, 1, &u_);
>>> PRINT_LAST_ERROR;
>>> result =
>>> mb->tag_get_data(v, &ent, 1, &v_);
>>> PRINT_LAST_ERROR;
>>> result =
>>> mb->tag_get_data(w, &ent, 1, &w_);
>>> PRINT_LAST_ERROR;
>>> double result;
>>>
>>> SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][2].lx,CG[cellid][2].ly,CG[cellid][2].lz);
>>>
>>> r1 = (sound +
>>> fabs(result))/CG[cellid][2].length;
>>>
>>> SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][3].lx,CG[cellid][3].ly,CG[cellid][3].lz);
>>>
>>> r2 = (sound +
>>> fabs(result))/CG[cellid][3].length;
>>>
>>> SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][5].lx,CG[cellid][5].ly,CG[cellid][5].lz);
>>>
>>> r3 = (sound +
>>> fabs(result))/CG[cellid][5].length;
>>> tmp = MAX3(r1,r2,r3);
>>> r = MAX(tmp,r);
>>> }
>>> double rmax;
>>> MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD);
>>> tau = CFL/rmax;
>>> ttime+=tau;
>>> ...
>>> So it may be Allreduce
>>> I've attached cobaltlog and
>>> error files of both runs
>>> Can you please take a look and
>>> suggest a further debugging
>>> Thanks
>>> Anton
>>>
>>>
>>>
>>>
>>>
>>>
>>
>> --
>> Timothy J. Tautges
>> Manager, Directed Meshing, CD-adapco
>> Phone: 608-354-1459
>> timothy.tautges at cd-adapco.com
>>
>
> --
> Timothy J. Tautges
> Manager, Directed Meshing, CD-adapco
> Phone: 608-354-1459
> timothy.tautges at cd-adapco.com
>
--
Timothy J. Tautges
Manager, Directed Meshing, CD-adapco
Phone: 608-354-1459
timothy.tautges at cd-adapco.com
More information about the moab-dev
mailing list