[MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]

Wed May 21 11:53:20 CDT 2014

Jane/Iulian, use the bitbucket issues tab to file tickets for each of them.
I will also track down the issue with the small test case Jane sent out.

Vijay

On Wed, May 21, 2014 at 9:25 PM, Jiangtao Hu <jiangtao_ma at yahoo.com> wrote:

> My next step is to use exchange_tags(), cause I need to know each element
> in different processor's full global ids for interface vertices. So there's
> a bug there too...
>
>
> Jane
>
>
> Asst. Researcher
> Dept. of Engineering Physics
> UW @ Madison
>
>
> "And we know that for those who love God, that is, for those who are
> called according to his purpose, all things are working together for good."
> (Romans 8:28)
>    On Wednesday, May 21, 2014 9:58 AM, "Grindeanu, Iulian R." <
> iulian at mcs.anl.gov> wrote:
>
>
>  Anton's issue is with doing exchange tags repeatedly; it seems to be
> mismatched isend/ireceives; as they accumulate, they run the memory out. Or
> something else is happening with  memory, maybe we need to do some flushes
> of the buffers.
> Jane's issue is different; assign ids might have problems.
> All these should deserve tickets.
> Iulian
>  ------------------------------
> *From:* Vijay S. Mahadevan [vijay.m at gmail.com]
> *Sent:* Wednesday, May 21, 2014 9:48 AM
> *To:* Jiangtao Hu
> *Cc:* Grindeanu, Iulian R.; kanaev at ibrae.ac.ru; MOAB dev
> *Subject:* Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
>
>   Jane,
>
>   //              Global Ids
>
>    //   2.0   4 ------- 3 -------- 6 -------- 10 --------- 14
>
>    //          |         |          |          |                 |
>
>    //          |         |          |          |                 |
>
>    //          |         |          |          |                 |
>
>    //   1.0   1 ------- 2 -------- 5 -------- 9 --------- 13
>    //        0.0       1.0        2.0        3.0          4.0
>
>  This is certainly a bug and should not happen to start with. If you have
> this test case available, do send it to the list so that we can find out
> the actual reason for the misnumber. Anton's issue might or not be directly
> related to this bug but his reference to GID being discontinuous shows that
> there is an outstanding issue here.
>
>  Vijay
>
> On Wed, May 21, 2014 at 8:06 PM, Jiangtao Hu <jiangtao_ma at yahoo.com>wrote:
>
>    On unit test for my project, I am trying to use
> ParallelComm::assign_global_id() to get the global ids for vertices of a
> simple grid
>   //              Mesh Ids
>
>   //   2.0   6 ------- 7 -------- 8 -------- 9 --------- 10
>
>   //          |        |          |          |                 |
>
>   //          |    1  |    2    |     3    |      4         |
>
>   //          |        |          |          |                 |
>
>   //   1.0   1 ------- 2 -------- 3 -------- 4 --------- 5
>    //        0.0       1.0        2.0        3.0          4.0
>
>  and got global ids as following
> //              Global Ids
>
>   //   2.0   4 ------- 3 -------- 6 -------- 10 --------- 14
>
>   //          |         |          |          |                 |
>
>   //          |         |          |          |                 |
>
>   //          |         |          |          |                 |
>
>   //   1.0   1 ------- 2 -------- 5 -------- 9 --------- 13
>   //        0.0       1.0        2.0        3.0          4.0
>
>  Don't know if this is what Anton got through.
>
>  Iulian, if you are interested to see the test, please let me know, and
> I'll send to you.
>
> Jane
>
>
>  Asst. Researcher
> Dept. of Engineering Physics
> UW @ Madison
>
>
>  "And we know that for those who love God, that is, for those who are
> called according to his purpose, all things are working together for good."
> (Romans 8:28)
>    On Tuesday, May 20, 2014 5:36 PM, "Grindeanu, Iulian R." <
> iulian at mcs.anl.gov> wrote:
>
>
>   hmmm,
> this is not good :(
>
> Are you running this on mira? Do you have a small file for a
> laptop/workstation?
> Maybe I can create one similar.
>
> Do you see this only on 1024 processes or can it be lower count?
>
> How does your model look like?
> Any processor should not communicate with more than 64 other processes,
> maybe after ghosting this number is reached.
>
> Can you run a debug version of this ? maybe some asserts are not triggered
> in optimized mode.
>
> Is your file somewhere on mira I can get to it?
>
> Iulian
>
>  ------------------------------
> *From:* moab-dev-bounces at mcs.anl.gov [moab-dev-bounces at mcs.anl.gov] on
> behalf of kanaev at ibrae.ac.ru [kanaev at ibrae.ac.ru]
> *Sent:* Tuesday, May 20, 2014 5:05 PM
> *To:* MOAB dev
> *Subject:* Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]
>
>   The problem is still here
> I've made a simple program performing certain numbers of exchange_tags
> calls within a loop
> If you run it on several processors with any mesh file it will eventually
> crash with the following message from every core:
>
> Fatal error in PMPI_Isend: Internal MPI error!, error stack:
> PMPI_Isend(148): MPI_Isend(buf=0xd0f300, count=4, MPI_UNSIGNED_CHAR,
> dest=1, tag=6, MPI_COMM_WORLD, request=0xcde354) failed
> (unknown)(): Internal MPI error!
>
> Thanks
> Anton
>
>  On Tue, 20 May 2014 04:40:03 -0400, wrote:
>
> Please disregard that, the global_id space for Quads was incontinuous in
> my mesh file
> Will check back with correct mesh
> Anton
>
> On Mon, 19 May 2014 15:17:14 -0400, wrote:
>
> Hello MOAB dev,
> I've attached a simplified version of my program that crashes presumably
> after particular numbers calls of exchange_tags
> I ran it couple of times on Mira on 1024 cores (64 nodes in --mode c16)
> It performs around 524378 iterations and then crushes (error file
> attached)
> Can you please take a look at what Scott Parker from ALCF suggests about
> it:
> -------- Original Message --------
>   Subject: Re: Job exiting early [ThermHydraX]  Date: Fri, 9 May 2014
> 18:48:25 -0500  From: Scott Parker  To:
>
> Anton-
>
> I took a look at the core files and from the stack trace it appears that
> the code is failing in an MPI_Isend call
> that is called from moab.ParallelComm::recv_buffer which is called from
> moab::ParallelComm::exchange_tags
> called from main(). The stack looks like:
>
>   main
>     moab::ParallelComm::exchange_tags
>       moab.ParallelComm::recv_buffer
>         MPI_Isend
>
> I've been able to get the same failure and error message you are seeing by
> having an MPI process call MPI_Isend
> when there are no matching receives. After 2 million Isends the program
> exits with the error you are seeing. So
> I'm pretty sure your ending up with a large number of outstanding requests
> and the program is failing because
> it can't allocate space for new MPI_Request objects.
>
> I'd suggest looking at how Moab is calling MPI and how many requests might
> be outstanding at any one time.
> Since the code is running for 5 hours and looks to be executing hundreds
> of thousands of iterations I wonder
> if there is some sort of send-receive mismatch that is letting requests
> accumulate. I think your best bet is to
> talk to the Moab folks and see if they have any ideas about why this might
> be happening.
>
> One possibility is a load imbalance between processes - if you don't have
> any MPI_Barriers or other collectives in
> your code you could try adding a barrier to synchronize the processes.
>
> If the Moab guys can't help you and adding a barrier doesn't help I can
> work with you to instrument the code to
> collect more information on how MPI is being called and we could possibly
> pin down the source of the problem
> that way.
>
>
> Scott
>
>
> On 5/2/14, 11:14 PM, kanaev at ibrae.ac.ru wrote:
>
> Hello Scott
> The dir is cd /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit
> The run produced core files is  253035
>
> I took another run with the line
>  MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD); commented and
> it stopped the same very time iteration #524378,  just passed some more
> lines
>
> I use MOAB library and its function of exchanging data between processors
> so i think i cannot really count MPI requests
>
> Anton
>
> On Mon, 28 Apr 2014 16:45:41 -0500, Scott Parker wrote:
>
>
> Can you point me to the directory where your binary and core files are?
>
> The stack trace you sent shows a call to MPI_Waitany, do you know how many
> MPI requests
> the code generally has outstanding at any time?
>
> -Scott
>
> On 4/28/14, 4:30 PM, kanaev at ibrae.ac.ru wrote:
>
> Hello Scott,
> I took rerun with the mentioned keys. The code was freshly compiled with
> makefile attached just in case.
> I've got 1024 core files. Two of them are attached.
> I run bgq_stack for core.0 and here's what i got:
>  [akanaev at miralac1 pinit]$bgq_stack pinit core.0
> ------------------------------------------------------------------------
> Program   : /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit/pinit
> ------------------------------------------------------------------------
> +++ID Rank: 0, TGID: 1, Core: 0, HWTID:0 TID: 1 State: RUN
>
> 00000000018334c0
> _ZN4PAMI6Device2MU7Factory12advance_implEPNS1_7ContextEmm
> /bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/components/devices/bgq/mu2/Factory.h:269
>
>
> 000000000170da28
> PAMI_Context_trylock_advancev
> /bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/api/c/pami.cc:554
>
> 000000000155d0dc
> PMPI_Waitany
> /bgsys/source/srcV1R2M1.17463/comm/lib/dev/mpich2/src/mpid/pamid/include/../src/mpid_progress.h:239
>
>
> 00000000010e84e4
> 00000072.long_branch_r2off.H5Dget_space+0
> :0
>
> 0000000001042be0
> 00000012.long_branch_r2off._ZNSt14basic_ifstreamIcSt11char_traitsIcEEC1EPKcSt13_Ios_Openmode+0
>
> :0
>
> 00000000019de058
> generic_start_main
> /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../csu/libc-start.c:226
>
>
> 00000000019de354
> __libc_start_main
> /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:194
>
>
> 0000000000000000
> ??
> ??:0
>
>  >   Have these sort of runs succeeded in the past using the same code
> base with no changes and similar input data?
> That is the first time i'm trying to run this code for that long time
>
> Thanks
> Anton
>
> On Thu, 24 Apr 2014 15:45:49 -0500, Scott Parker wrote:
>
>
> Anton-
>
> Thanks, it's aborting because of a runtime error that appears to be in the
> mpich layer.
>
> Can you rerun with  "--env  BG_COREDUMPONEXIT=1:BG_COREDUMPONERROR=1"
> added to your qsub line - that should
> generate some core files on which you can run bgq_stack.
>
> The system software (driver) on Mira was updated this week and I'd like to
> get a clearer picture of
> whether that could be related to you problem, so
>
>    Has your code been recompiled since Monday? If not can you recompile
> and try running again
>
>    Have these sort of runs succeeded in the past using the same code base
> with no changes and similar input data?
>
> -Scott
>
>
>
> On 4/24/14, 2:59 PM, kanaev at ibrae.ac.ru wrote:
>
> Sorry about the attached files, here the are
> There's no core files after exiting, looks like stopping because of time
> requested expires but you can see from cobaltlog it's about 5 hours passed
> (10 hours was requested) before exit
> Anton
> On Thu, 24 Apr 2014 14:07:07 -0500, Scott Parker wrote:
>
>
> Anton-
>
> Please send these emails to support at alcf.anl.gov as I may not always be
> available to investigate.
>
> I don't see the cobalt or error files attached, so I can't really say
> anything about why your job may be
> failing. Do you get core files when the job crashed? If so I'd recommend
> using 'bgq_stack '
> to try and get the file and line number where the failure occurred.
> Knowing the line may be enough
> to let you figure it out, if not you'll need to dump the values of the
> variables at the time of the crash
> to get a clearer picture of what is going on.
>
> Scott
>
> On 4/24/14, 1:36 PM, kanaev at ibrae.ac.ru wrote:
>
> Hello Scott,
> I've tried twice to run 10 hours 1024 cores job on Mira in mode c16 with
> qsub -n 64 -t 10:00:00 --mode c16 -A ThermHydraX pinit
> Lid_128x128x1p1024.h5m
> Both times job exited earlier than expected on the same iteration after
> the same error during executing the following section (it's between two
> couts):
>  ...
> //LOOP OVER OWNED CELLS
>      double r = 0;
>      for (moab::Range::iterator it = owned_ents.begin(); it !=
> owned_ents.end(); it++) {
>       EntityHandle ent = *it;
>       int cellid = mb->id_from_handle(ent);
>       double Vol_;
>       double u_;
>       double v_;
>       double w_;
>       double r1,r2,r3;
>       double tmp;
>
>       result = mb->tag_get_data(u, &ent, 1, &u_);
>       PRINT_LAST_ERROR;
>       result = mb->tag_get_data(v, &ent, 1, &v_);
>       PRINT_LAST_ERROR;
>       result = mb->tag_get_data(w, &ent, 1, &w_);
>       PRINT_LAST_ERROR;
>
>       double result;
>
> SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][2].lx,CG[cellid][2].ly,CG[cellid][2].lz);
>
>
>       r1 = (sound + fabs(result))/CG[cellid][2].length;
>
>
> SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][3].lx,CG[cellid][3].ly,CG[cellid][3].lz);
>
>
>       r2 = (sound + fabs(result))/CG[cellid][3].length;
>
>
> SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][5].lx,CG[cellid][5].ly,CG[cellid][5].lz);
>
>
>       r3 = (sound + fabs(result))/CG[cellid][5].length;
>
>       tmp = MAX3(r1,r2,r3);
>
>       r = MAX(tmp,r);
>  }
>
>  double rmax;
>  MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD);
>  tau = CFL/rmax;
>  ttime+=tau;
> ...
>
> So it may be Allreduce
> I've attached cobaltlog and error files of both runs
> Can you please take a look and suggest a further debugging
>
> Thanks
> Anton
>
>
>
>
>
>
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/moab-dev/attachments/20140521/d6b0d861/attachment-0001.html>