[petsc-dev] Error on large problems.

Barry Smith bsmith at mcs.anl.gov
Mon Apr 6 23:56:39 CDT 2015


  Run with -info

  Note that in maint all your verbose printf used %d which gives nonsense for 64 bit integers so is unreliable. For example N=0 

  I am wondering if it is possible there is overflow in one of the PetscMPIInt variables or arrays.

  Barry

> On Apr 6, 2015, at 11:20 PM, Mark Adams <mfadams at lbl.gov> wrote:
> 
> This is with 'maint'.  I will retry with master. 
> 
> On Mon, Apr 6, 2015 at 1:44 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> 
>   Mark,
> 
>    Exactly what git commit is this?  Should be somewhere in the error message, just send the entire error message.
> 
>   Barry
> 
> > On Apr 6, 2015, at 9:05 AM, Mark Adams <mfadams at lbl.gov> wrote:
> >
> > And added squaring the matrix (A^T * A), which uses a matrix matrix product. And get the error.  (recall this did run without the square graph, which is where it dies here).  I can run this about every two weeks if you want me to run with some other branch or parameters (-info ?).
> > Mark
> >
> >
> >
> >
> >     [0]PCSetFromOptions_GAMG threshold set -5.000000e-03
> >     [0]PCSetUp_GAMG level 0 N=0, n data rows=1, n data cols=1, nnz/row (ave)=26, np=131072
> >     [0]PCGAMGFilterGraph 100% nnz after filtering, with threshold -0.005, 26.9649 nnz ave. (N=0)
> > [0]PCGAMGCoarsen_AGG square graph
> > MPICH2 ERROR [Rank 48173] [job id 11540278] [Sun Apr  5 22:11:13 2015] [c0-1c0s10n3] [nid01579] - MPID_nem_gni_check_localCQ(): GNI_CQ_EVENT_TYPE_POST had error (SOURCE_SSID:AT_MDD_INV:CPLTN_SRSP)
> > [48173]PETSC ERROR: #1 MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ() line 1757 in /global/u2/m/madams/petsc_maint/src/mat/impls/aij/mpi/mpimatmatmult.c
> > [48173]PETSC ERROR: #2 MatTransposeMatMult_MPIAIJ_MPIAIJ() line 881 in /global/u2/m/madams/petsc_maint/src/mat/impls/aij/mpi/mpimatmatmult.c
> > [48173]PETSC ERROR: #3 MatTransposeMatMult() line 8977 in /global/u2/m/madams/petsc_maint/src/mat/interface/matrix.c
> > [48173]PETSC ERROR: #4 PCGAMGCoarsen_AGG() line 991 in /global/u2/m/madams/petsc_maint/src/ksp/pc/impls/gamg/agg.c
> > [48173]PETSC ERROR: #5 PCSetUp_GAMG() line 596 in /global/u2/m/madams/petsc_maint/src/ksp/pc/impls/gamg/gamg.c
> > [48173]PETSC ERROR: #6 PCSetUp() line 902 in /global/u2/m/madams/petsc_maint/src/ksp/pc/interface/precon.c
> > [48173]PETSC ERROR: #7 KSPSetUp() line 306 in /global/u2/m/madams/petsc_maint/src/ksp/ksp/interface/itfunc.c
> > [48
> >
> > On Sat, Mar 28, 2015 at 1:21 PM, Mark Adams <mfadams at lbl.gov> wrote:
> > It works for a stripped down GAMG solve (no smoothing and no square graph).  I will try adding the square graph back in ...
> >
> > On Wed, Mar 25, 2015 at 7:24 AM, Mark Adams <mfadams at lbl.gov> wrote:
> >
> >
> > On Sat, Mar 7, 2015 at 5:49 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> >
> > > On Mar 7, 2015, at 4:27 PM, Mark Adams <mfadams at lbl.gov> wrote:
> > >
> > >
> > >
> > > On Sat, Mar 7, 2015 at 3:11 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> > >
> > >   Hmm,  my first guess is a mixup between PetscMPIInt and PetscInt arrays  or MPIU_INT somewhere. (But compilers catch most of these)
> > >
> > > I have a run with ~9K eq/core 128K cores many times but have never gotten the 32K/core (4B eq.) run to work.  So it looks like an int overflow issue.
> >
> >   Well in theory if we have done everything write with 64 bit indices there should never be an integer overflow (though of course there could be mistake somewhere but generally the compiler will detect if we are trying to stick a 64 bit int into a 32 bit slot.
> >
> >    Can you try the same example without GAMG (say with hypre instead); if it goes through ok it might indicate an int issue either in gamg or the code that gamg calls?
> >
> >
> > Good idea,  Jacobi works.  I will try a stripped down GAMG next.
> >
> >
> >
> >
> >   Barry
> >
> > >
> > >
> > >   Another possibility is bug in the MPI for many large messages; any chance you can run the same thing on a very different system? Mira?
> > >
> > > Not soon, but I have Chombo and PETSc built on Mira and it would not be hard to get this code setup and try it.
> > >
> > > This is for SCE15 so I will turn the PETSc test off unless someone has any ideas on something to try.  I am using maint perhaps I should use master.
> > >
> > > Mark
> > >
> > >
> > >   Barry
> > >
> > > > On Mar 7, 2015, at 1:21 PM, Mark Adams <mfadams at lbl.gov> wrote:
> > > >
> > > > I seem to be getting this error on Edison with 128K cores and ~4 Billion equations.  I've seen this error several time.  I've attached a recent output from this.  I wonder if it is an integer overflow.  This built with 64 bit integers, but I notice that GAMG prints out N and I see N=0 for the finest level.
> > > >
> > > > Mark
> > > > <out.131072.uniform.txt>
> > >
> > >
> 
> 




More information about the petsc-dev mailing list