[MOAB-dev] Fwd: Re: Job exiting early [ThermHydraX]

Mon May 19 14:17:14 CDT 2014

 Hello MOAB dev,  

 I've attached a simplified version of my program that crashes
presumably after particular numbers calls of exchange_tags  

 I ran it couple of times on Mira on 1024 cores (64 nodes in --mode
c16)  

 It performs around 524378 iterations and then crushes (error file
attached)  

 Can you please take a look at what Scott Parker from ALCF suggests
about it:   

 -------- Original Message --------  
 		SUBJECT: 
 		Re: Job exiting early [ThermHydraX]
 		DATE: 
 		Fri, 9 May 2014 18:48:25 -0500
 		FROM: 
 		Scott Parker 
 		TO: 
 Anton-
 I took a look at the core files and from the stack trace it appears
that the code is failing in an MPI_Isend call
 that is called from moab.ParallelComm::recv_buffer which is called
from moab::ParallelComm::exchange_tags
 called from main(). The stack looks like:
   main
     moab::ParallelComm::exchange_tags
       moab.ParallelComm::recv_buffer
         MPI_Isend
 I've been able to get the same failure and error message you
are
seeing by having an MPI process call MPI_Isend
 when there are no matching receives. After 2 million Isends the
program exits with the error you are seeing. So
 I'm pretty sure your ending up with a large number of outstanding
requests and the program is failing because
 it can't allocate space for new MPI_Request objects.
 I'd suggest looking at how Moab is calling MPI and how many requests
might be outstanding at any one time.
 Since the code is running for 5 hours and looks to be executing
hundreds of thousands of iterations I wonder
 if there is some sort of send-receive mismatch that is letting
requests accumulate. I think your best bet is to
 talk to the Moab folks and see if they have any ideas about why this
might be happening. 
 One possibility is a load imbalance between processes - if you don't
have any MPI_Barriers or other collectives in
 your code you could try adding a barrier to synchronize the
processes.
 If the Moab guys can't help you and adding a barrier
doesn't help I
can work with you to instrument the code to
 collect more information on how MPI is being called and we could
possibly pin down the source of the problem
 that way.
 Scott
  On 5/2/14, 11:14 PM, kanaev at ibrae.ac.ru [1] wrote:
 Hello Scott  

 The dir is cd /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit   The
run produced core files is  253035       I took another run with the
line  MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD);
commented and it stopped the same very time iteration #524378,  just
passed some more lines        I use MOAB library and its function of
exchanging data between processors so i think i cannot really count
MPI requests       Anton   
 On Mon, 28 Apr 2014 16:45:41 -0500, Scott Parker wrote:    
 Can you point me to the directory where your binary and core files
are?
 The stack trace you sent shows a call to MPI_Waitany, do you know
how many MPI requests
 the code generally has outstanding at any time?
 -Scott
   On 4/28/14, 4:30 PM,
kanaev at ibrae.ac.ru [1] wrote:
  Hello Scott,   

  I took rerun with the mentioned keys. The code was freshly compiled
with makefile attached just in case.   

  I've got 1024 core files. Two of them are attached.   

  I run bgq_stack for core.0 and here's what i got:   

   [akanaev at miralac1 pinit]$bgq_stack pinit core.0   

------------------------------------------------------------------------

  Program   : /gpfs/mira-fs0/projects/ThermHydraX/kanaev/pinit/pinit 

------------------------------------------------------------------------

  +++ID Rank: 0, TGID: 1, Core: 0, HWTID:0 TID: 1 State: RUN  
  00000000018334c0   

  _ZN4PAMI6Device2MU7Factory12advance_implEPNS1_7ContextEmm  

/bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/components/devices/bgq/mu2/Factory.h:269

  000000000170da28   

  PAMI_Context_trylock_advancev   

/bgsys/source/srcV1R2M1.17463/comm/sys/buildtools/pami/api/c/pami.cc:554

  000000000155d0dc   

  PMPI_Waitany  

/bgsys/source/srcV1R2M1.17463/comm/lib/dev/mpich2/src/mpid/pamid/include/../src/mpid_progress.h:239

  00000000010e84e4   

  00000072.long_branch_r2off.H5Dget_space+0   

  :0   
  0000000001042be0   

00000012.long_branch_r2off._ZNSt14basic_ifstreamIcSt11char_traitsIcEEC1EPKcSt13_Ios_Openmode+0

  :0   
  00000000019de058   

  generic_start_main   

/bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../csu/libc-start.c:226

  00000000019de354   

  __libc_start_main   

/bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:194

  0000000000000000   

  ??   

  ??:0     
  >   Have these sort of runs succeeded in the past using the same
code base with no changes and similar input data?   

  That is the first time i'm trying to run this code for that long
time   
  Thanks   

  Anton    
 On Thu, 24 Apr 2014 15:45:49 -0500, Scott Parker wrote:     
 Anton-
 Thanks, it's aborting because of a runtime error that
appears to be
in the mpich layer. 
 Can you rerun with  "--env 
BG_COREDUMPONEXIT=1:BG_COREDUMPONERROR=1" added to your qsub line -
that should
 generate some core files on which you can run bgq_stack.
 The system software (driver) on Mira was updated this week and I'd
like to get a clearer picture of 
 whether that could be related to you problem, so
    Has your code been recompiled since Monday? If not can you
recompile and try running again
    Have these sort of runs succeeded in the past using the same code
base with no changes and similar input data?
 -Scott
   On 4/24/14, 2:59 PM, kanaev at ibrae.ac.ru [1] wrote:
  Sorry about the attached files, here the are   

  There's no core files after exiting, looks like stopping because of
time requested expires but you can see from cobaltlog it's about 5
hours passed (10 hours was requested) before exit   

  Anton    

  On Thu, 24 Apr 2014 14:07:07 -0500, Scott Parker wrote:     
 Anton-
 Please send these emails to
support at alcf.anl.gov [2] as I may not
always be available to investigate.
 I don't see the cobalt or error files attached, so I can't really
say anything about why your job may be
 failing. Do you get core files when the job crashed? If so I'd
recommend using 'bgq_stack '
 to try and get the file and line number where the failure occurred.
Knowing the line may be enough
 to let you figure it out, if not you'll need to dump the values of
the variables at the time of the crash
 to get a clearer picture of what is going on.
 Scott
   On 4/24/14, 1:36 PM, kanaev at ibrae.ac.ru [1] wrote:
  Hello Scott,   

  I've tried twice to run 10 hours 1024 cores job on Mira in mode c16
with   

  qsub -n 64 -t 10:00:00 --mode c16 -A ThermHydraX pinit
Lid_128x128x1p1024.h5m   

  Both times job exited earlier than expected on the same iteration
after the same error during executing the following section (it's
between two couts):   

   ...   

  //LOOP OVER OWNED CELLS   

       double r = 0;   

  for (moab::Range::iterator it = owned_ents.begin(); it !=
owned_ents.end(); it++) {  

        EntityHandle ent = *it;   

        int cellid = mb->id_from_handle(ent);   

        double Vol_;   

        double u_;   

        double v_;   

        double w_;   

        double r1,r2,r3;   

        double tmp;   
        result = mb->tag_get_data(u, &ent, 1, &u_);   

        PRINT_LAST_ERROR;   

        result = mb->tag_get_data(v, &ent, 1, &v_);   

        PRINT_LAST_ERROR;   

        result = mb->tag_get_data(w, &ent, 1, &w_);   

        PRINT_LAST_ERROR;    
        double result;   

SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][2].lx,CG[cellid][2].ly,CG[cellid][2].lz);

        r1 = (sound + fabs(result))/CG[cellid][2].length;   

SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][3].lx,CG[cellid][3].ly,CG[cellid][3].lz);

        r2 = (sound + fabs(result))/CG[cellid][3].length;        

SCALAR_PRODUCT(result,u_,v_,w_,CG[cellid][5].lx,CG[cellid][5].ly,CG[cellid][5].lz);

        r3 = (sound + fabs(result))/CG[cellid][5].length;   
        tmp = MAX3(r1,r2,r3);   
        r = MAX(tmp,r);   

   }   
   double rmax;   

  MPI_Allreduce(&r,&rmax,1,MPI_REAL8,MPI_MAX,MPI_COMM_WORLD);  

   tau = CFL/rmax;   

   ttime+=tau;     

  ...   
  So it may be Allreduce   

  I've attached cobaltlog and error files of both runs   

  Can you please take a look and suggest a further debugging   
  Thanks   

  Anton         

Links:
------
[1] mailto:kanaev at ibrae.ac.ru
[2] mailto:support at alcf.anl.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/moab-dev/attachments/20140519/5320183e/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpitest.cpp
Type: text/x-c charset=us-ascii
Size: 8440 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/moab-dev/attachments/20140519/5320183e/attachment-0003.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Makefile
Type: text/x-makefile charset=us-ascii
Size: 520 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/moab-dev/attachments/20140519/5320183e/attachment-0004.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 264390.error
Type: text/plain charset=us-ascii
Size: 2269 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/moab-dev/attachments/20140519/5320183e/attachment-0005.bin>