[MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]

Wed May 21 11:01:23 CDT 2014

Milad reported an issue like that many years ago, but I was never able 
to track it down.

- tim

On 05/21/2014 10:52 AM, Grindeanu, Iulian R. wrote:
> valgrind did not find anything in moab. It found something on the video card :(
> I will recompile with minimal dependencies
> ________________________________________
> From: moab-dev-bounces at mcs.anl.gov [moab-dev-bounces at mcs.anl.gov] on behalf of Grindeanu, Iulian R. [iulian at mcs.anl.gov]
> Sent: Wednesday, May 21, 2014 10:32 AM
> To: Tim Tautges; moab-dev at mcs.anl.gov
> Subject: Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
>
> I am looking at this problem;
> it is a memory leak somewhere, in exchange tags. After 200000 iterations even a few bytes per iteration is enough to blow up.
> is shows on my laptop that the memory usage is increasing, with number of iterations :(
> Actually, after 300K iterations, memory is almost 1Gb higher per process.
> 1Gb/300K ~= 300Kb
> I will run valgrind and let you know.
>
> ....
> //BEGIN ITER
>    for (int iter = 1; iter <= 1000000; iter++){
>          if(rank ==0 && iter%100==1) std::cout << "---ITER " << iter << " ---" << std::endl;
>
>          result = pcomm->exchange_tags(celltagsn, celltagsn, ents);
>          PRINT_LAST_ERROR;
>
>          result = pcomm->exchange_tags(celltags, celltags, ents);
>          PRINT_LAST_ERROR;
>
>          result = pcomm->exchange_tags(facetags, facetags, adjs);
>          PRINT_LAST_ERROR;
>
> } //END ITER
> ________________________________________
> From: moab-dev-bounces at mcs.anl.gov [moab-dev-bounces at mcs.anl.gov] on behalf of Tim Tautges [timothy.tautges at cd-adapco.com]
> Sent: Wednesday, May 21, 2014 10:11 AM
> To: moab-dev at mcs.anl.gov
> Subject: Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
>
> IIRC, there's an option to wait at a barrier where appropriate.  All
> sends/receives should have matches, so if it doesn't make it through a
> barrier something is wrong.  There's no algorithm in there that has
> non-matching sends/receives (i.e. no polling for random messages).
>
> - tim
>
> On 05/21/2014 09:57 AM, Grindeanu, Iulian R. wrote:
>> Anton's issue is with doing exchange tags repeatedly; it seems to be
>> mismatched isend/ireceives; as they accumulate, they run the memory out.
>> Or something else is happening with memory, maybe we need to do some
>> flushes of the buffers.
>> Jane's issue is different; assign ids might have problems.
>> All these should deserve tickets.
>> Iulian
>> ------------------------------------------------------------------------
>> *From:* Vijay S. Mahadevan [vijay.m at gmail.com]
>> *Sent:* Wednesday, May 21, 2014 9:48 AM
>> *To:* Jiangtao Hu
>> *Cc:* Grindeanu, Iulian R.; kanaev at ibrae.ac.ru; MOAB dev
>> *Subject:* Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
>>
>> Jane,
>>
>> //              Global Ids
>>     //   2.0   4 ------- 3 -------- 6 -------- 10 --------- 14
>>     //          |         |          |          |                 |
>>     //          |         |          |          |                 |
>>     //          |         |          |          |                 |
>>     //   1.0   1 ------- 2 -------- 5 -------- 9 --------- 13
>>     //        0.0       1.0        2.0        3.0          4.0
>>
>> This is certainly a bug and should not happen to start with. If you have
>> this test case available, do send it to the list so that we can find out
>> the actual reason for the misnumber. Anton's issue might or not be
>> directly related to this bug but his reference to GID being
>> discontinuous shows that there is an outstanding issue here.
>>
>> Vijay
>>
>> On Wed, May 21, 2014 at 8:06 PM, Jiangtao Hu <jiangtao_ma at yahoo.com
>> <mailto:jiangtao_ma at yahoo.com>> wrote:
>>
>>      On unit test for my project, I am trying to use
>>      ParallelComm::assign_global_id() to get the global ids for vertices
>>      of a simple grid
>>         //              Mesh Ids
>>         //   2.0   6 ------- 7 -------- 8 -------- 9 --------- 10
>>         //          |        |          |          |                 |
>>         //          |    1  |    2    |     3    |      4         |
>>         //          |        |          |          |                 |
>>         //   1.0   1 ------- 2 -------- 3 -------- 4 --------- 5
>>         //        0.0       1.0        2.0        3.0          4.0
>>
>>      and got global ids as following
>>      //              Global Ids
>>         //   2.0   4 ------- 3 -------- 6 -------- 10 --------- 14
>>         //          |         |          |          |                 |
>>         //          |         |          |          |                 |
>>         //          |         |          |          |                 |
>>         //   1.0   1 ------- 2 -------- 5 -------- 9 --------- 13
>>         //        0.0       1.0        2.0        3.0          4.0
>>
>>      Don't know if this is what Anton got through.
>>
>>      Iulian, if you are interested to see the test, please let me know,
>>      and I'll send to you.
>>      Jane
>>
>>
>>      Asst. Researcher
>>      Dept. of Engineering Physics
>>      UW @ Madison
>>
>>
>>      "And we know that for those who love God, that is, for those who are
>>      called according to his purpose, all things are working together for
>>      good." (Romans 8:28)
>>      On Tuesday, May 20, 2014 5:36 PM, "Grindeanu, Iulian R."
>>      <iulian at mcs.anl.gov <mailto:iulian at mcs.anl.gov>> wrote:
>>
>>
>>      hmmm,
>>      this is not good :(
>>
>>      Are you running this on mira? Do you have a small file for a
>>      laptop/workstation?
>>      Maybe I can create one similar.
>>
>>      Do you see this only on 1024 processes or can it be lower count?
>>
>>      How does your model look like?
>>      Any processor should not communicate with more than 64 other
>>      processes, maybe after ghosting this number is reached.
>>
>>      Can you run a debug version of this ? maybe some asserts are not
>>      triggered in optimized mode.
>>
>>      Is your file somewhere on mira I can get to it?
>>
>>      Iulian
>>
>>      ------------------------------------------------------------------------
>>      *From:* moab-dev-bounces at mcs.anl.gov
>>      <mailto:moab-dev-bounces at mcs.anl.gov> [moab-dev-bounces at mcs.anl.gov
>>      <mailto:moab-dev-bounces at mcs.anl.gov>] on behalf of
>>      kanaev at ibrae.ac.ru <mailto:kanaev at ibrae.ac.ru> [kanaev at ibrae.ac.ru
>>      <mailto:kanaev at ibrae.ac.ru>]
>>      *Sent:* Tuesday, May 20, 2014 5:05 PM
>>      *To:* MOAB dev
>>      *Subject:* Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
>>
>>      The problem is still here
>>      I've made a simple program performing certain numbers of
>>      exchange_tags calls within a loop
>>      If you run it on several processors with any mesh file it will
>>      eventually crash with the following message from every core:
>>      Fatal error in PMPI_Isend: Internal MPI error!, error stack:
>>      PMPI_Isend(148): MPI_Isend(buf=0xd0f300, count=4, MPI_UNSIGNED_CHAR,
>>      dest=1, tag=6, MPI_COMM_WORLD, request=0xcde354) failed
>>      (unknown)(): Internal MPI error!
>>      Thanks
>>      Anton
>>
>>      On Tue, 20 May 2014 04:40:03 -0400, wrote:
>>
>>          Please disregard that, the global_id space for Quads was
>>          incontinuous in my mesh file
>>          Will check back with correct mesh
>>          Anton
>>
>>          On Mon, 19 May 2014 15:17:14 -0400, wrote:
>>
>>              Hello MOAB dev,
>>              I've attached a simplified version of my program that
>>              crashes presumably after particular numbers calls of
>>              exchange_tags
>>              I ran it couple of times on Mira on 1024 cores (64 nodes in
>>              --mode c16)
>>              It performs around 524378 iterations and then crushes (error
>>              file attached)
>>              Can you please take a look at what Scott Parker from ALCF
>>              suggests about it:
>>              -------- Original Message --------
>>              Subject:  Re: Job exiting early [ThermHydraX]
>>              Date:     Fri, 9 May 2014 18:48:25 -0500
>>              From:     Scott Parker
>>              To:
>>
>>
>>
>>              Anton-
>>
>>              I took a look at the core files and from the stack trace it
>>              appears that the code is failing in an MPI_Isend call
>>              that is called from moab.ParallelComm::recv_buffer which is
>>              called from moab::ParallelComm::exchange_tags
>>              called from main(). The stack looks like:
>>
>>                 main
>>                   moab::ParallelComm::exchange_tags
>>                     moab.ParallelComm::recv_buffer
>>                       MPI_Isend
>>
>>              I've been able to get the same failure and error message you
>>              are seeing by having an MPI process call MPI_Isend
>>              when there are no matching receives. After 2 million Isends
>>              the program exits with the error you are seeing. So
>>              I'm pretty sure your ending up with a large number of
>>              outstanding requests and the program is failing because
>>              it can't allocate space for new MPI_Request objects.
>>
>>              I'd suggest looking at how Moab is calling MPI and how many
>>              requests might be outstanding at any one time.
>>              Since the code is running for 5 hours and looks to be
>>              executing hundreds of thousands of iterations I wonder
>>              if there is some sort of send-receive mismatch that is
>>              letting requests accumulate. I think your best bet is to
>>              talk to the Moab folks and see if they have any ideas about
>>              why this might be happening.
>>
>>              One possibility is a load imbalance between processes - if
>>              you don't have any MPI_Barriers or other collectives in
>>              your code you could try adding a barrier to synchronize the
>>              processes.
>>
>>              If the Moab guys can't help you and adding a barrier doesn't
>>              help I can work with you to instrument the code to
>>              collect more information on how MPI is being called and we
>>              could possibly pin down the source of the problem
>>              that way.
>>
>>
>>              Scott
>>
>>
>>              On 5/2/14, 11:14 PM, kanaev at ibrae.ac.ru
>>              <mailto:kanaev at ibrae.ac.ru> wrote:
>>
>>                  Hello Scott
>>                  The dir is cd
>>                  /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit
>>                  The run produced core files is  253035
>>                  I took another run with the line
>>                    MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD); commented and it stopped the same very time iteration #524378,  just passed some more lines
>>                  I use MOAB library and its function of exchanging data
>>                  between processors so i think i cannot really count MPI
>>                  requests
>>                  Anton
>>
>>                  On Mon, 28 Apr 2014 16:45:41 -0500, Scott Parker wrote:
>>
>>
>>                      Can you point me to the directory where your binary
>>                      and core files are?
>>
>>                      The stack trace you sent shows a call to
>>                      MPI_Waitany, do you know how many MPI requests
>>                      the code generally has outstanding at any time?
>>
>>                      -Scott
>>
>>                      On 4/28/14, 4:30 PM, kanaev at ibrae.ac.ru
>>                      <mailto:kanaev at ibrae.ac.ru> wrote:
>>
>>                          Hello Scott,
>>                          I took rerun with the mentioned keys. The code
>>                          was freshly compiled with makefile attached just
>>                          in case.
>>                          I've got 1024 core files. Two of them are attached.
>>                          I run bgq_stack for core.0 and here's what i got:
>>                            [akanaev at miralac1 pinit]$bgq_stack pinit core.0
>>                          ------------------------------------------------------------------------
>>
>>                          Program   :
>>                          /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit/pinit
>>
>>                          ------------------------------------------------------------------------
>>
>>                          +++ID Rank: 0, TGID: 1, Core: 0, HWTID:0 TID: 1
>>                          State: RUN
>>                          00000000018334c0
>>                          _ZN4PAMI6Device2MU7Factory12advance_implEPNS1_7ContextEmm
>>
>>                          /bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/components/devices/bgq/mu2/Factory.h:269
>>
>>                          000000000170da28
>>                          PAMI_Context_trylock_advancev
>>                          /bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/api/c/pami.cc:554
>>
>>                          000000000155d0dc
>>                          PMPI_Waitany
>>                          /bgsys/source/srcV1R2M1.17463/comm/lib/dev/mpich2/src/mpid/pamid/include/../src/mpid_progress.h:239
>>
>>                          00000000010e84e4
>>                          00000072.long_branch_r2off.H5Dget_space+0
>>                          :0
>>                          0000000001042be0
>>                          00000012.long_branch_r2off._ZNSt14basic_ifstreamIcSt11char_traitsIcEEC1EPKcSt13_Ios_Openmode+0
>>
>>                          :0
>>                          00000000019de058
>>                          generic_start_main
>>                          /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../csu/libc-start.c:226
>>
>>                          00000000019de354
>>                          __libc_start_main
>>                          /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:194
>>
>>                          0000000000000000
>>                          ??
>>                          ??:0
>>
>>                           >   Have these sort of runs succeeded in the
>>                          past using the same code base with no changes
>>                          and similar input data?
>>                          That is the first time i'm trying to run this
>>                          code for that long time
>>                          Thanks
>>                          Anton
>>
>>                          On Thu, 24 Apr 2014 15:45:49 -0500, Scott
>>                          Parker wrote:
>>
>>
>>                              Anton-
>>
>>                              Thanks, it's aborting because of a runtime
>>                              error that appears to be in the mpich layer.
>>
>>                              Can you rerun with  "--env
>>                              BG_COREDUMPONEXIT=1:BG_COREDUMPONERROR=1"
>>                              added to your qsub line - that should
>>                              generate some core files on which you can
>>                              run bgq_stack.
>>
>>                              The system software (driver) on Mira was
>>                              updated this week and I'd like to get a
>>                              clearer picture of
>>                              whether that could be related to you problem, so
>>
>>                                  Has your code been recompiled since
>>                              Monday? If not can you recompile and try
>>                              running again
>>
>>                                  Have these sort of runs succeeded in the
>>                              past using the same code base with no
>>                              changes and similar input data?
>>
>>                              -Scott
>>
>>
>>
>>                              On 4/24/14, 2:59 PM, kanaev at ibrae.ac.ru
>>                              <mailto:kanaev at ibrae.ac.ru> wrote:
>>
>>                                  Sorry about the attached files, here the
>>                                  are
>>                                  There's no core files after exiting,
>>                                  looks like stopping because of time
>>                                  requested expires but you can see from
>>                                  cobaltlog it's about 5 hours passed (10
>>                                  hours was requested) before exit
>>                                  Anton
>>                                  On Thu, 24 Apr 2014 14:07:07 -0500,
>>                                  Scott Parker wrote:
>>
>>
>>                                      Anton-
>>
>>                                      Please send these emails to
>>                                      support at alcf.anl.gov
>>                                      <mailto:support at alcf.anl.gov> as I
>>                                      may not always be available to
>>                                      investigate.
>>
>>                                      I don't see the cobalt or error
>>                                      files attached, so I can't really
>>                                      say anything about why your job may be
>>                                      failing. Do you get core files when
>>                                      the job crashed? If so I'd recommend
>>                                      using 'bgq_stack '
>>                                      to try and get the file and line
>>                                      number where the failure occurred.
>>                                      Knowing the line may be enough
>>                                      to let you figure it out, if not
>>                                      you'll need to dump the values of
>>                                      the variables at the time of the crash
>>                                      to get a clearer picture of what is
>>                                      going on.
>>
>>                                      Scott
>>
>>                                      On 4/24/14, 1:36 PM,
>>                                      kanaev at ibrae.ac.ru
>>                                      <mailto:kanaev at ibrae.ac.ru> wrote:
>>
>>                                          Hello Scott,
>>                                          I've tried twice to run 10 hours
>>                                          1024 cores job on Mira in mode
>>                                          c16 with
>>                                          qsub -n 64 -t 10:00:00 --mode
>>                                          c16 -A ThermHydraX pinit
>>                                          Lid_128x128x1p1024.h5m
>>                                          Both times job exited earlier
>>                                          than expected on the same
>>                                          iteration after the same error
>>                                          during executing the following
>>                                          section (it's between two couts):
>>                                            ...
>>                                          //LOOP OVER OWNED CELLS
>>                                                double r = 0;
>>                                                for (moab::Range::iterator
>>                                          it = owned_ents.begin(); it !=
>>                                          owned_ents.end(); it++) {
>>                                                 EntityHandle ent = *it;
>>                                                 int cellid =
>>                                          mb->id_from_handle(ent);
>>                                                 double Vol_;
>>                                                 double u_;
>>                                                 double v_;
>>                                                 double w_;
>>                                                 double r1,r2,r3;
>>                                                 double tmp;
>>                                                 result =
>>                                          mb->tag_get_data(u, &ent, 1, &u_);
>>                                                 PRINT_LAST_ERROR;
>>                                                 result =
>>                                          mb->tag_get_data(v, &ent, 1, &v_);
>>                                                 PRINT_LAST_ERROR;
>>                                                 result =
>>                                          mb->tag_get_data(w, &ent, 1, &w_);
>>                                                 PRINT_LAST_ERROR;
>>                                                 double result;
>>
>>                                          SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][2].lx,CG[cellid][2].ly,CG[cellid][2].lz);
>>
>>                                                 r1 = (sound +
>>                                          fabs(result))/CG[cellid][2].length;
>>
>>                                          SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][3].lx,CG[cellid][3].ly,CG[cellid][3].lz);
>>
>>                                                 r2 = (sound +
>>                                          fabs(result))/CG[cellid][3].length;
>>
>>                                          SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][5].lx,CG[cellid][5].ly,CG[cellid][5].lz);
>>
>>                                                 r3 = (sound +
>>                                          fabs(result))/CG[cellid][5].length;
>>                                                 tmp = MAX3(r1,r2,r3);
>>                                                 r = MAX(tmp,r);
>>                                            }
>>                                            double rmax;
>>                                            MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD);
>>                                            tau = CFL/rmax;
>>                                            ttime+=tau;
>>                                          ...
>>                                          So it may be Allreduce
>>                                          I've attached cobaltlog and
>>                                          error files of both runs
>>                                          Can you please take a look and
>>                                          suggest a further debugging
>>                                          Thanks
>>                                          Anton
>>
>>
>>
>>
>>
>>
>
> --
> Timothy J. Tautges
> Manager, Directed Meshing, CD-adapco
> Phone: 608-354-1459
> timothy.tautges at cd-adapco.com
>

-- 
Timothy J. Tautges
Manager, Directed Meshing, CD-adapco
Phone: 608-354-1459
timothy.tautges at cd-adapco.com