[MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]

Thu May 22 10:10:59 CDT 2014

Check out ParallelComm::Buffer::reserve in ParallelComm.hpp, and try 
defining DEBUG_BUFFER and see if valgrind sees anything.  realloc is 
used to resize the buffer, and that shouldn't generate any memory leaks. 
  If the problem goes away after defining, then I guess the problem is 
in fact in Buffer::reserve.

- tim

On 05/21/2014 06:02 PM, Grindeanu, Iulian R. wrote:
> Hi Tim,
> For a small mesh file, I did not see the leak; for a slightly larger, I see a leak after 100K iterations, of about 800-1000
> bytes per iteration
>
> My feeling is that we are not freeing the memory for the first buffer sent; can you see the comments I made here?
> https://bitbucket.org/fathomteam/moab/issue/7/exchange-tags-memory-leak
>
> It is reproducible on a laptop :(
>
> I know that we send first a message of size 1024, then a second message with the rest of it. The first message has the size to expect for the second message. (I mean, the total size in the first 4 bytes)
>
> Adding more ghost layers in the example should increase the size of the messages, but I do not see an increase in leak, this is why I suspect it is related to the size of the first message only. The memory usage is approximate, but I think the leak is about 1024-2000 bytes per exchange
>
> Thanks,
> Iulian
>
> ________________________________________
> From: moab-dev-bounces at mcs.anl.gov [moab-dev-bounces at mcs.anl.gov] on behalf of Tim Tautges [timothy.tautges at cd-adapco.com]
> Sent: Wednesday, May 21, 2014 11:01 AM
> To: moab-dev at mcs.anl.gov
> Subject: Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
>
> Milad reported an issue like that many years ago, but I was never able
> to track it down.
>
> - tim
>
> On 05/21/2014 10:52 AM, Grindeanu, Iulian R. wrote:
>> valgrind did not find anything in moab. It found something on the video card :(
>> I will recompile with minimal dependencies
>> ________________________________________
>> From: moab-dev-bounces at mcs.anl.gov [moab-dev-bounces at mcs.anl.gov] on behalf of Grindeanu, Iulian R. [iulian at mcs.anl.gov]
>> Sent: Wednesday, May 21, 2014 10:32 AM
>> To: Tim Tautges; moab-dev at mcs.anl.gov
>> Subject: Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
>>
>> I am looking at this problem;
>> it is a memory leak somewhere, in exchange tags. After 200000 iterations even a few bytes per iteration is enough to blow up.
>> is shows on my laptop that the memory usage is increasing, with number of iterations :(
>> Actually, after 300K iterations, memory is almost 1Gb higher per process.
>> 1Gb/300K ~= 300Kb
>> I will run valgrind and let you know.
>>
>> ....
>> //BEGIN ITER
>>     for (int iter = 1; iter <= 1000000; iter++){
>>           if(rank ==0 && iter%100==1) std::cout << "---ITER " << iter << " ---" << std::endl;
>>
>>           result = pcomm->exchange_tags(celltagsn, celltagsn, ents);
>>           PRINT_LAST_ERROR;
>>
>>           result = pcomm->exchange_tags(celltags, celltags, ents);
>>           PRINT_LAST_ERROR;
>>
>>           result = pcomm->exchange_tags(facetags, facetags, adjs);
>>           PRINT_LAST_ERROR;
>>
>> } //END ITER
>> ________________________________________
>> From: moab-dev-bounces at mcs.anl.gov [moab-dev-bounces at mcs.anl.gov] on behalf of Tim Tautges [timothy.tautges at cd-adapco.com]
>> Sent: Wednesday, May 21, 2014 10:11 AM
>> To: moab-dev at mcs.anl.gov
>> Subject: Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
>>
>> IIRC, there's an option to wait at a barrier where appropriate.  All
>> sends/receives should have matches, so if it doesn't make it through a
>> barrier something is wrong.  There's no algorithm in there that has
>> non-matching sends/receives (i.e. no polling for random messages).
>>
>> - tim
>>
>> On 05/21/2014 09:57 AM, Grindeanu, Iulian R. wrote:
>>> Anton's issue is with doing exchange tags repeatedly; it seems to be
>>> mismatched isend/ireceives; as they accumulate, they run the memory out.
>>> Or something else is happening with memory, maybe we need to do some
>>> flushes of the buffers.
>>> Jane's issue is different; assign ids might have problems.
>>> All these should deserve tickets.
>>> Iulian
>>> ------------------------------------------------------------------------
>>> *From:* Vijay S. Mahadevan [vijay.m at gmail.com]
>>> *Sent:* Wednesday, May 21, 2014 9:48 AM
>>> *To:* Jiangtao Hu
>>> *Cc:* Grindeanu, Iulian R.; kanaev at ibrae.ac.ru; MOAB dev
>>> *Subject:* Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
>>>
>>> Jane,
>>>
>>> //              Global Ids
>>>      //   2.0   4 ------- 3 -------- 6 -------- 10 --------- 14
>>>      //          |         |          |          |                 |
>>>      //          |         |          |          |                 |
>>>      //          |         |          |          |                 |
>>>      //   1.0   1 ------- 2 -------- 5 -------- 9 --------- 13
>>>      //        0.0       1.0        2.0        3.0          4.0
>>>
>>> This is certainly a bug and should not happen to start with. If you have
>>> this test case available, do send it to the list so that we can find out
>>> the actual reason for the misnumber. Anton's issue might or not be
>>> directly related to this bug but his reference to GID being
>>> discontinuous shows that there is an outstanding issue here.
>>>
>>> Vijay
>>>
>>> On Wed, May 21, 2014 at 8:06 PM, Jiangtao Hu <jiangtao_ma at yahoo.com
>>> <mailto:jiangtao_ma at yahoo.com>> wrote:
>>>
>>>       On unit test for my project, I am trying to use
>>>       ParallelComm::assign_global_id() to get the global ids for vertices
>>>       of a simple grid
>>>          //              Mesh Ids
>>>          //   2.0   6 ------- 7 -------- 8 -------- 9 --------- 10
>>>          //          |        |          |          |                 |
>>>          //          |    1  |    2    |     3    |      4         |
>>>          //          |        |          |          |                 |
>>>          //   1.0   1 ------- 2 -------- 3 -------- 4 --------- 5
>>>          //        0.0       1.0        2.0        3.0          4.0
>>>
>>>       and got global ids as following
>>>       //              Global Ids
>>>          //   2.0   4 ------- 3 -------- 6 -------- 10 --------- 14
>>>          //          |         |          |          |                 |
>>>          //          |         |          |          |                 |
>>>          //          |         |          |          |                 |
>>>          //   1.0   1 ------- 2 -------- 5 -------- 9 --------- 13
>>>          //        0.0       1.0        2.0        3.0          4.0
>>>
>>>       Don't know if this is what Anton got through.
>>>
>>>       Iulian, if you are interested to see the test, please let me know,
>>>       and I'll send to you.
>>>       Jane
>>>
>>>
>>>       Asst. Researcher
>>>       Dept. of Engineering Physics
>>>       UW @ Madison
>>>
>>>
>>>       "And we know that for those who love God, that is, for those who are
>>>       called according to his purpose, all things are working together for
>>>       good." (Romans 8:28)
>>>       On Tuesday, May 20, 2014 5:36 PM, "Grindeanu, Iulian R."
>>>       <iulian at mcs.anl.gov <mailto:iulian at mcs.anl.gov>> wrote:
>>>
>>>
>>>       hmmm,
>>>       this is not good :(
>>>
>>>       Are you running this on mira? Do you have a small file for a
>>>       laptop/workstation?
>>>       Maybe I can create one similar.
>>>
>>>       Do you see this only on 1024 processes or can it be lower count?
>>>
>>>       How does your model look like?
>>>       Any processor should not communicate with more than 64 other
>>>       processes, maybe after ghosting this number is reached.
>>>
>>>       Can you run a debug version of this ? maybe some asserts are not
>>>       triggered in optimized mode.
>>>
>>>       Is your file somewhere on mira I can get to it?
>>>
>>>       Iulian
>>>
>>>       ------------------------------------------------------------------------
>>>       *From:* moab-dev-bounces at mcs.anl.gov
>>>       <mailto:moab-dev-bounces at mcs.anl.gov> [moab-dev-bounces at mcs.anl.gov
>>>       <mailto:moab-dev-bounces at mcs.anl.gov>] on behalf of
>>>       kanaev at ibrae.ac.ru <mailto:kanaev at ibrae.ac.ru> [kanaev at ibrae.ac.ru
>>>       <mailto:kanaev at ibrae.ac.ru>]
>>>       *Sent:* Tuesday, May 20, 2014 5:05 PM
>>>       *To:* MOAB dev
>>>       *Subject:* Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
>>>
>>>       The problem is still here
>>>       I've made a simple program performing certain numbers of
>>>       exchange_tags calls within a loop
>>>       If you run it on several processors with any mesh file it will
>>>       eventually crash with the following message from every core:
>>>       Fatal error in PMPI_Isend: Internal MPI error!, error stack:
>>>       PMPI_Isend(148): MPI_Isend(buf=0xd0f300, count=4, MPI_UNSIGNED_CHAR,
>>>       dest=1, tag=6, MPI_COMM_WORLD, request=0xcde354) failed
>>>       (unknown)(): Internal MPI error!
>>>       Thanks
>>>       Anton
>>>
>>>       On Tue, 20 May 2014 04:40:03 -0400, wrote:
>>>
>>>           Please disregard that, the global_id space for Quads was
>>>           incontinuous in my mesh file
>>>           Will check back with correct mesh
>>>           Anton
>>>
>>>           On Mon, 19 May 2014 15:17:14 -0400, wrote:
>>>
>>>               Hello MOAB dev,
>>>               I've attached a simplified version of my program that
>>>               crashes presumably after particular numbers calls of
>>>               exchange_tags
>>>               I ran it couple of times on Mira on 1024 cores (64 nodes in
>>>               --mode c16)
>>>               It performs around 524378 iterations and then crushes (error
>>>               file attached)
>>>               Can you please take a look at what Scott Parker from ALCF
>>>               suggests about it:
>>>               -------- Original Message --------
>>>               Subject:  Re: Job exiting early [ThermHydraX]
>>>               Date:     Fri, 9 May 2014 18:48:25 -0500
>>>               From:     Scott Parker
>>>               To:
>>>
>>>
>>>
>>>               Anton-
>>>
>>>               I took a look at the core files and from the stack trace it
>>>               appears that the code is failing in an MPI_Isend call
>>>               that is called from moab.ParallelComm::recv_buffer which is
>>>               called from moab::ParallelComm::exchange_tags
>>>               called from main(). The stack looks like:
>>>
>>>                  main
>>>                    moab::ParallelComm::exchange_tags
>>>                      moab.ParallelComm::recv_buffer
>>>                        MPI_Isend
>>>
>>>               I've been able to get the same failure and error message you
>>>               are seeing by having an MPI process call MPI_Isend
>>>               when there are no matching receives. After 2 million Isends
>>>               the program exits with the error you are seeing. So
>>>               I'm pretty sure your ending up with a large number of
>>>               outstanding requests and the program is failing because
>>>               it can't allocate space for new MPI_Request objects.
>>>
>>>               I'd suggest looking at how Moab is calling MPI and how many
>>>               requests might be outstanding at any one time.
>>>               Since the code is running for 5 hours and looks to be
>>>               executing hundreds of thousands of iterations I wonder
>>>               if there is some sort of send-receive mismatch that is
>>>               letting requests accumulate. I think your best bet is to
>>>               talk to the Moab folks and see if they have any ideas about
>>>               why this might be happening.
>>>
>>>               One possibility is a load imbalance between processes - if
>>>               you don't have any MPI_Barriers or other collectives in
>>>               your code you could try adding a barrier to synchronize the
>>>               processes.
>>>
>>>               If the Moab guys can't help you and adding a barrier doesn't
>>>               help I can work with you to instrument the code to
>>>               collect more information on how MPI is being called and we
>>>               could possibly pin down the source of the problem
>>>               that way.
>>>
>>>
>>>               Scott
>>>
>>>
>>>               On 5/2/14, 11:14 PM, kanaev at ibrae.ac.ru
>>>               <mailto:kanaev at ibrae.ac.ru> wrote:
>>>
>>>                   Hello Scott
>>>                   The dir is cd
>>>                   /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit
>>>                   The run produced core files is  253035
>>>                   I took another run with the line
>>>                     MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD); commented and it stopped the same very time iteration #524378,  just passed some more lines
>>>                   I use MOAB library and its function of exchanging data
>>>                   between processors so i think i cannot really count MPI
>>>                   requests
>>>                   Anton
>>>
>>>                   On Mon, 28 Apr 2014 16:45:41 -0500, Scott Parker wrote:
>>>
>>>
>>>                       Can you point me to the directory where your binary
>>>                       and core files are?
>>>
>>>                       The stack trace you sent shows a call to
>>>                       MPI_Waitany, do you know how many MPI requests
>>>                       the code generally has outstanding at any time?
>>>
>>>                       -Scott
>>>
>>>                       On 4/28/14, 4:30 PM, kanaev at ibrae.ac.ru
>>>                       <mailto:kanaev at ibrae.ac.ru> wrote:
>>>
>>>                           Hello Scott,
>>>                           I took rerun with the mentioned keys. The code
>>>                           was freshly compiled with makefile attached just
>>>                           in case.
>>>                           I've got 1024 core files. Two of them are attached.
>>>                           I run bgq_stack for core.0 and here's what i got:
>>>                             [akanaev at miralac1 pinit]$bgq_stack pinit core.0
>>>                           ------------------------------------------------------------------------
>>>
>>>                           Program   :
>>>                           /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit/pinit
>>>
>>>                           ------------------------------------------------------------------------
>>>
>>>                           +++ID Rank: 0, TGID: 1, Core: 0, HWTID:0 TID: 1
>>>                           State: RUN
>>>                           00000000018334c0
>>>                           _ZN4PAMI6Device2MU7Factory12advance_implEPNS1_7ContextEmm
>>>
>>>                           /bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/components/devices/bgq/mu2/Factory.h:269
>>>
>>>                           000000000170da28
>>>                           PAMI_Context_trylock_advancev
>>>                           /bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/api/c/pami.cc:554
>>>
>>>                           000000000155d0dc
>>>                           PMPI_Waitany
>>>                           /bgsys/source/srcV1R2M1.17463/comm/lib/dev/mpich2/src/mpid/pamid/include/../src/mpid_progress.h:239
>>>
>>>                           00000000010e84e4
>>>                           00000072.long_branch_r2off.H5Dget_space+0
>>>                           :0
>>>                           0000000001042be0
>>>                           00000012.long_branch_r2off._ZNSt14basic_ifstreamIcSt11char_traitsIcEEC1EPKcSt13_Ios_Openmode+0
>>>
>>>                           :0
>>>                           00000000019de058
>>>                           generic_start_main
>>>                           /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../csu/libc-start.c:226
>>>
>>>                           00000000019de354
>>>                           __libc_start_main
>>>                           /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:194
>>>
>>>                           0000000000000000
>>>                           ??
>>>                           ??:0
>>>
>>>                            >   Have these sort of runs succeeded in the
>>>                           past using the same code base with no changes
>>>                           and similar input data?
>>>                           That is the first time i'm trying to run this
>>>                           code for that long time
>>>                           Thanks
>>>                           Anton
>>>
>>>                           On Thu, 24 Apr 2014 15:45:49 -0500, Scott
>>>                           Parker wrote:
>>>
>>>
>>>                               Anton-
>>>
>>>                               Thanks, it's aborting because of a runtime
>>>                               error that appears to be in the mpich layer.
>>>
>>>                               Can you rerun with  "--env
>>>                               BG_COREDUMPONEXIT=1:BG_COREDUMPONERROR=1"
>>>                               added to your qsub line - that should
>>>                               generate some core files on which you can
>>>                               run bgq_stack.
>>>
>>>                               The system software (driver) on Mira was
>>>                               updated this week and I'd like to get a
>>>                               clearer picture of
>>>                               whether that could be related to you problem, so
>>>
>>>                                   Has your code been recompiled since
>>>                               Monday? If not can you recompile and try
>>>                               running again
>>>
>>>                                   Have these sort of runs succeeded in the
>>>                               past using the same code base with no
>>>                               changes and similar input data?
>>>
>>>                               -Scott
>>>
>>>
>>>
>>>                               On 4/24/14, 2:59 PM, kanaev at ibrae.ac.ru
>>>                               <mailto:kanaev at ibrae.ac.ru> wrote:
>>>
>>>                                   Sorry about the attached files, here the
>>>                                   are
>>>                                   There's no core files after exiting,
>>>                                   looks like stopping because of time
>>>                                   requested expires but you can see from
>>>                                   cobaltlog it's about 5 hours passed (10
>>>                                   hours was requested) before exit
>>>                                   Anton
>>>                                   On Thu, 24 Apr 2014 14:07:07 -0500,
>>>                                   Scott Parker wrote:
>>>
>>>
>>>                                       Anton-
>>>
>>>                                       Please send these emails to
>>>                                       support at alcf.anl.gov
>>>                                       <mailto:support at alcf.anl.gov> as I
>>>                                       may not always be available to
>>>                                       investigate.
>>>
>>>                                       I don't see the cobalt or error
>>>                                       files attached, so I can't really
>>>                                       say anything about why your job may be
>>>                                       failing. Do you get core files when
>>>                                       the job crashed? If so I'd recommend
>>>                                       using 'bgq_stack '
>>>                                       to try and get the file and line
>>>                                       number where the failure occurred.
>>>                                       Knowing the line may be enough
>>>                                       to let you figure it out, if not
>>>                                       you'll need to dump the values of
>>>                                       the variables at the time of the crash
>>>                                       to get a clearer picture of what is
>>>                                       going on.
>>>
>>>                                       Scott
>>>
>>>                                       On 4/24/14, 1:36 PM,
>>>                                       kanaev at ibrae.ac.ru
>>>                                       <mailto:kanaev at ibrae.ac.ru> wrote:
>>>
>>>                                           Hello Scott,
>>>                                           I've tried twice to run 10 hours
>>>                                           1024 cores job on Mira in mode
>>>                                           c16 with
>>>                                           qsub -n 64 -t 10:00:00 --mode
>>>                                           c16 -A ThermHydraX pinit
>>>                                           Lid_128x128x1p1024.h5m
>>>                                           Both times job exited earlier
>>>                                           than expected on the same
>>>                                           iteration after the same error
>>>                                           during executing the following
>>>                                           section (it's between two couts):
>>>                                             ...
>>>                                           //LOOP OVER OWNED CELLS
>>>                                                 double r = 0;
>>>                                                 for (moab::Range::iterator
>>>                                           it = owned_ents.begin(); it !=
>>>                                           owned_ents.end(); it++) {
>>>                                                  EntityHandle ent = *it;
>>>                                                  int cellid =
>>>                                           mb->id_from_handle(ent);
>>>                                                  double Vol_;
>>>                                                  double u_;
>>>                                                  double v_;
>>>                                                  double w_;
>>>                                                  double r1,r2,r3;
>>>                                                  double tmp;
>>>                                                  result =
>>>                                           mb->tag_get_data(u, &ent, 1, &u_);
>>>                                                  PRINT_LAST_ERROR;
>>>                                                  result =
>>>                                           mb->tag_get_data(v, &ent, 1, &v_);
>>>                                                  PRINT_LAST_ERROR;
>>>                                                  result =
>>>                                           mb->tag_get_data(w, &ent, 1, &w_);
>>>                                                  PRINT_LAST_ERROR;
>>>                                                  double result;
>>>
>>>                                           SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][2].lx,CG[cellid][2].ly,CG[cellid][2].lz);
>>>
>>>                                                  r1 = (sound +
>>>                                           fabs(result))/CG[cellid][2].length;
>>>
>>>                                           SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][3].lx,CG[cellid][3].ly,CG[cellid][3].lz);
>>>
>>>                                                  r2 = (sound +
>>>                                           fabs(result))/CG[cellid][3].length;
>>>
>>>                                           SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][5].lx,CG[cellid][5].ly,CG[cellid][5].lz);
>>>
>>>                                                  r3 = (sound +
>>>                                           fabs(result))/CG[cellid][5].length;
>>>                                                  tmp = MAX3(r1,r2,r3);
>>>                                                  r = MAX(tmp,r);
>>>                                             }
>>>                                             double rmax;
>>>                                             MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD);
>>>                                             tau = CFL/rmax;
>>>                                             ttime+=tau;
>>>                                           ...
>>>                                           So it may be Allreduce
>>>                                           I've attached cobaltlog and
>>>                                           error files of both runs
>>>                                           Can you please take a look and
>>>                                           suggest a further debugging
>>>                                           Thanks
>>>                                           Anton
>>>
>>>
>>>
>>>
>>>
>>>
>>
>> --
>> Timothy J. Tautges
>> Manager, Directed Meshing, CD-adapco
>> Phone: 608-354-1459
>> timothy.tautges at cd-adapco.com
>>
>
> --
> Timothy J. Tautges
> Manager, Directed Meshing, CD-adapco
> Phone: 608-354-1459
> timothy.tautges at cd-adapco.com
>

-- 
Timothy J. Tautges
Manager, Directed Meshing, CD-adapco
Phone: 608-354-1459
timothy.tautges at cd-adapco.com