[MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]

Wed May 21 09:57:57 CDT 2014

Anton's issue is with doing exchange tags repeatedly; it seems to be mismatched isend/ireceives; as they accumulate, they run the memory out. Or something else is happening with  memory, maybe we need to do some flushes of the buffers.
Jane's issue is different; assign ids might have problems.
All these should deserve tickets.
Iulian
________________________________
From: Vijay S. Mahadevan [vijay.m at gmail.com]
Sent: Wednesday, May 21, 2014 9:48 AM
To: Jiangtao Hu
Cc: Grindeanu, Iulian R.; kanaev at ibrae.ac.ru; MOAB dev
Subject: Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]

Jane,

//              Global Ids
  //   2.0   4 ------- 3 -------- 6 -------- 10 --------- 14
  //          |         |          |          |                 |
  //          |         |          |          |                 |
  //          |         |          |          |                 |
  //   1.0   1 ------- 2 -------- 5 -------- 9 --------- 13
  //        0.0       1.0        2.0        3.0          4.0

This is certainly a bug and should not happen to start with. If you have this test case available, do send it to the list so that we can find out the actual reason for the misnumber. Anton's issue might or not be directly related to this bug but his reference to GID being discontinuous shows that there is an outstanding issue here.

Vijay

On Wed, May 21, 2014 at 8:06 PM, Jiangtao Hu <jiangtao_ma at yahoo.com<mailto:jiangtao_ma at yahoo.com>> wrote:
On unit test for my project, I am trying to use ParallelComm::assign_global_id() to get the global ids for vertices of a simple grid
  //              Mesh Ids
  //   2.0   6 ------- 7 -------- 8 -------- 9 --------- 10
  //          |        |          |          |                 |
  //          |    1  |    2    |     3    |      4         |
  //          |        |          |          |                 |
  //   1.0   1 ------- 2 -------- 3 -------- 4 --------- 5
  //        0.0       1.0        2.0        3.0          4.0

and got global ids as following
//              Global Ids
  //   2.0   4 ------- 3 -------- 6 -------- 10 --------- 14
  //          |         |          |          |                 |
  //          |         |          |          |                 |
  //          |         |          |          |                 |
  //   1.0   1 ------- 2 -------- 5 -------- 9 --------- 13
  //        0.0       1.0        2.0        3.0          4.0

Don't know if this is what Anton got through.

Iulian, if you are interested to see the test, please let me know, and I'll send to you.

Jane

Asst. Researcher
Dept. of Engineering Physics
UW @ Madison

"And we know that for those who love God, that is, for those who are called according to his purpose, all things are working together for good." (Romans 8:28)
On Tuesday, May 20, 2014 5:36 PM, "Grindeanu, Iulian R." <iulian at mcs.anl.gov<mailto:iulian at mcs.anl.gov>> wrote:

hmmm,
this is not good :(

Are you running this on mira? Do you have a small file for a laptop/workstation?
Maybe I can create one similar.

Do you see this only on 1024 processes or can it be lower count?

How does your model look like?
Any processor should not communicate with more than 64 other processes, maybe after ghosting this number is reached.

Can you run a debug version of this ? maybe some asserts are not triggered in optimized mode.

Is your file somewhere on mira I can get to it?

Iulian

________________________________
From: moab-dev-bounces at mcs.anl.gov<mailto:moab-dev-bounces at mcs.anl.gov> [moab-dev-bounces at mcs.anl.gov<mailto:moab-dev-bounces at mcs.anl.gov>] on behalf of kanaev at ibrae.ac.ru<mailto:kanaev at ibrae.ac.ru> [kanaev at ibrae.ac.ru<mailto:kanaev at ibrae.ac.ru>]
Sent: Tuesday, May 20, 2014 5:05 PM
To: MOAB dev
Subject: Re: [MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]

The problem is still here
I've made a simple program performing certain numbers of exchange_tags calls within a loop
If you run it on several processors with any mesh file it will eventually crash with the following message from every core:

Fatal error in PMPI_Isend: Internal MPI error!, error stack:
PMPI_Isend(148): MPI_Isend(buf=0xd0f300, count=4, MPI_UNSIGNED_CHAR, dest=1, tag=6, MPI_COMM_WORLD, request=0xcde354) failed
(unknown)(): Internal MPI error!

Thanks
Anton

On Tue, 20 May 2014 04:40:03 -0400, wrote:
Please disregard that, the global_id space for Quads was incontinuous in my mesh file
Will check back with correct mesh
Anton

On Mon, 19 May 2014 15:17:14 -0400, wrote:
Hello MOAB dev,
I've attached a simplified version of my program that crashes presumably after particular numbers calls of exchange_tags
I ran it couple of times on Mira on 1024 cores (64 nodes in --mode c16)
It performs around 524378 iterations and then crushes (error file attached)
Can you please take a look at what Scott Parker from ALCF suggests about it:
-------- Original Message --------
Subject:        Re: Job exiting early [ThermHydraX]
Date:   Fri, 9 May 2014 18:48:25 -0500
From:   Scott Parker
To:

Anton-

I took a look at the core files and from the stack trace it appears that the code is failing in an MPI_Isend call
that is called from moab.ParallelComm::recv_buffer which is called from moab::ParallelComm::exchange_tags
called from main(). The stack looks like:

  main
    moab::ParallelComm::exchange_tags
      moab.ParallelComm::recv_buffer
        MPI_Isend

I've been able to get the same failure and error message you are seeing by having an MPI process call MPI_Isend
when there are no matching receives. After 2 million Isends the program exits with the error you are seeing. So
I'm pretty sure your ending up with a large number of outstanding requests and the program is failing because
it can't allocate space for new MPI_Request objects.

I'd suggest looking at how Moab is calling MPI and how many requests might be outstanding at any one time.
Since the code is running for 5 hours and looks to be executing hundreds of thousands of iterations I wonder
if there is some sort of send-receive mismatch that is letting requests accumulate. I think your best bet is to
talk to the Moab folks and see if they have any ideas about why this might be happening.

One possibility is a load imbalance between processes - if you don't have any MPI_Barriers or other collectives in
your code you could try adding a barrier to synchronize the processes.

If the Moab guys can't help you and adding a barrier doesn't help I can work with you to instrument the code to
collect more information on how MPI is being called and we could possibly pin down the source of the problem
that way.

Scott

On 5/2/14, 11:14 PM, kanaev at ibrae.ac.ru<mailto:kanaev at ibrae.ac.ru> wrote:
Hello Scott
The dir is cd /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit
The run produced core files is  253035

I took another run with the line  MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD); commented and it stopped the same very time iteration #524378,  just passed some more lines

I use MOAB library and its function of exchanging data between processors so i think i cannot really count MPI requests

Anton

On Mon, 28 Apr 2014 16:45:41 -0500, Scott Parker wrote:

Can you point me to the directory where your binary and core files are?

The stack trace you sent shows a call to MPI_Waitany, do you know how many MPI requests
the code generally has outstanding at any time?

-Scott

On 4/28/14, 4:30 PM, kanaev at ibrae.ac.ru<mailto:kanaev at ibrae.ac.ru> wrote:
Hello Scott,
I took rerun with the mentioned keys. The code was freshly compiled with makefile attached just in case.
I've got 1024 core files. Two of them are attached.
I run bgq_stack for core.0 and here's what i got:
 [akanaev at miralac1 pinit]$bgq_stack pinit core.0
------------------------------------------------------------------------
Program   : /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit/pinit
------------------------------------------------------------------------
+++ID Rank: 0, TGID: 1, Core: 0, HWTID:0 TID: 1 State: RUN

00000000018334c0
_ZN4PAMI6Device2MU7Factory12advance_implEPNS1_7ContextEmm
/bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/components/devices/bgq/mu2/Factory.h:269

000000000170da28
PAMI_Context_trylock_advancev
/bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/api/c/pami.cc:554

000000000155d0dc
PMPI_Waitany
/bgsys/source/srcV1R2M1.17463/comm/lib/dev/mpich2/src/mpid/pamid/include/../src/mpid_progress.h:239

00000000010e84e4
00000072.long_branch_r2off.H5Dget_space+0
:0

0000000001042be0
00000012.long_branch_r2off._ZNSt14basic_ifstreamIcSt11char_traitsIcEEC1EPKcSt13_Ios_Openmode+0
:0

00000000019de058
generic_start_main
/bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../csu/libc-start.c:226

00000000019de354
__libc_start_main
/bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:194

0000000000000000
??
??:0

>   Have these sort of runs succeeded in the past using the same code base with no changes and similar input data?
That is the first time i'm trying to run this code for that long time

Thanks
Anton

On Thu, 24 Apr 2014 15:45:49 -0500, Scott Parker wrote:

Anton-

Thanks, it's aborting because of a runtime error that appears to be in the mpich layer.

Can you rerun with  "--env  BG_COREDUMPONEXIT=1:BG_COREDUMPONERROR=1" added to your qsub line - that should
generate some core files on which you can run bgq_stack.

The system software (driver) on Mira was updated this week and I'd like to get a clearer picture of
whether that could be related to you problem, so

   Has your code been recompiled since Monday? If not can you recompile and try running again

   Have these sort of runs succeeded in the past using the same code base with no changes and similar input data?

-Scott

On 4/24/14, 2:59 PM, kanaev at ibrae.ac.ru<mailto:kanaev at ibrae.ac.ru> wrote:
Sorry about the attached files, here the are
There's no core files after exiting, looks like stopping because of time requested expires but you can see from cobaltlog it's about 5 hours passed (10 hours was requested) before exit
Anton
On Thu, 24 Apr 2014 14:07:07 -0500, Scott Parker wrote:

Anton-

Please send these emails to support at alcf.anl.gov<mailto:support at alcf.anl.gov> as I may not always be available to investigate.

I don't see the cobalt or error files attached, so I can't really say anything about why your job may be
failing. Do you get core files when the job crashed? If so I'd recommend using 'bgq_stack '
to try and get the file and line number where the failure occurred. Knowing the line may be enough
to let you figure it out, if not you'll need to dump the values of the variables at the time of the crash
to get a clearer picture of what is going on.

Scott

On 4/24/14, 1:36 PM, kanaev at ibrae.ac.ru<mailto:kanaev at ibrae.ac.ru> wrote:
Hello Scott,
I've tried twice to run 10 hours 1024 cores job on Mira in mode c16 with
qsub -n 64 -t 10:00:00 --mode c16 -A ThermHydraX pinit Lid_128x128x1p1024.h5m
Both times job exited earlier than expected on the same iteration after the same error during executing the following section (it's between two couts):
 ...
//LOOP OVER OWNED CELLS
     double r = 0;
     for (moab::Range::iterator it = owned_ents.begin(); it != owned_ents.end(); it++) {
      EntityHandle ent = *it;
      int cellid = mb->id_from_handle(ent);
      double Vol_;
      double u_;
      double v_;
      double w_;
      double r1,r2,r3;
      double tmp;

      result = mb->tag_get_data(u, &ent, 1, &u_);
      PRINT_LAST_ERROR;
      result = mb->tag_get_data(v, &ent, 1, &v_);
      PRINT_LAST_ERROR;
      result = mb->tag_get_data(w, &ent, 1, &w_);
      PRINT_LAST_ERROR;

      double result;
      SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][2].lx,CG[cellid][2].ly,CG[cellid][2].lz);

      r1 = (sound + fabs(result))/CG[cellid][2].length;

      SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][3].lx,CG[cellid][3].ly,CG[cellid][3].lz);

      r2 = (sound + fabs(result))/CG[cellid][3].length;

      SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][5].lx,CG[cellid][5].ly,CG[cellid][5].lz);

      r3 = (sound + fabs(result))/CG[cellid][5].length;

      tmp = MAX3(r1,r2,r3);

      r = MAX(tmp,r);
 }

 double rmax;
 MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD);
 tau = CFL/rmax;
 ttime+=tau;
...

So it may be Allreduce
I've attached cobaltlog and error files of both runs
Can you please take a look and suggest a further debugging

Thanks
Anton

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/moab-dev/attachments/20140521/931ee71c/attachment-0001.html>