[petsc-dev] Error on large problems.

Mark Adams mfadams at lbl.gov
Sun Apr 19 15:54:26 CDT 2015


On Tue, Apr 7, 2015 at 12:56 AM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
>   Run with -info
>

This timed out printing from 128K cores.  I will try again with this run
below.


>   Note that in maint all your verbose printf used %d which gives nonsense
> for 64 bit integers so is unreliable. For example N=0
>
>
Is there a way to fix this?  %ld?

And I ran a variant that does just a MaTransposeMat, and get 128Mb of this:

MPICH2 ERROR [Rank 63070] [job id 11889451] [Sun Apr 19 08:13:54 2015]
[c6-1c1s2n2] [nid02762] - MPID_nem_gni_check_localCQ(): GNI_CQ_EVENT_TY
PE_POST had error (SOURCE_SSID:AT_MDD_INV:CPLTN_SRSP)
[63070]PETSC ERROR: #1 MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ() line
1760 in /global/u2/m/madams/petsc/src/mat/impls/aij/mpi/mpimatmatmult.
c
[63070]PETSC ERROR: #2 MatTransposeMatMult_MPIAIJ_MPIAIJ() line 881 in
/global/u2/m/madams/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
[63070]PETSC ERROR: #3 MatTransposeMatMult() line 9275 in
/global/u2/m/madams/petsc/src/mat/interface/matrix.c
[63070]PETSC ERROR: #4 PCGAMGCoarsen_AGG() line 952 in
/global/u2/m/madams/petsc/src/ksp/pc/impls/gamg/agg.c
[63070]PETSC ERROR: #5 PCSetUp_GAMG() line 567 in
/global/u2/m/madams/petsc/src/ksp/pc/impls/gamg/gamg.c
[63070]PETSC ERROR: #6 PCSetUp() line 918 in
/global/u2/m/madams/petsc/src/ksp/pc/interface/precon.c
[63070]PETSC ERROR: #7 KSPSetUp() line 330 in
/global/u2/m/madams/petsc/src/ksp/ksp/interface/itfunc.c
[63070]PETSC ERROR: #8 KSPSolve() line 542 in
/global/u2/m/madams/petsc/src/ksp/ksp/interface/itfunc.c
MPICH2 ERROR [Rank 63633] [job id 11889451] [Sun Apr 19 08:13:54 2015]
[c6-1c1s8n2] [nid02786] - MPID_nem_gni_check_localCQ(): GNI_CQ_EVENT_TY
PE_POST had error (SOURCE_SSID:AT_MDD_INV:CPLTN_SRSP)
[63633]PETSC ERROR: #1 MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ() line
1760 in /global/u2/m/madams/petsc/src/mat/impls/aij/mpi/mpimatmatmult.
c
[63633]PETSC ERROR: #2 MatTransposeMatMult_MPIAIJ_MPIAIJ() line 881 in
/global/u2/m/madams/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
[63633]PETSC ERROR: #3 MatTransposeMatMult() line 9275 in
/global/u2/m/madams/petsc/src/mat/interface/matrix.c
[63633]PETSC ERROR: #4 PCGAMGCoarsen_AGG() line 952 in
/global/u2/m/madams/petsc/src/ksp/pc/impls/gamg/agg.c
[63633]PETSC ERROR: #5 PCSetUp_GAMG() line 567 in
/global/u2/m/madams/petsc/src/ksp/pc/impls/gamg/gamg.c
[63633]PETSC ERROR: #6 PCSetUp() line 918 in
/global/u2/m/madams/petsc/src/ksp/pc/interface/precon.c
[63633]PETSC ERROR: #7 KSPSetUp() line 330 in
/global/u2/m/madams/petsc/src/ksp/ksp/interface/itfunc.c
[63633]PETSC ERROR: #8 KSPSolve() line 542 in
/global/u2/m/madams/petsc/src/ksp/ksp/interface/itfunc.c




>   I am wondering if it is possible there is overflow in one of the
> PetscMPIInt variables or arrays.
>
>   Barry
>
> > On Apr 6, 2015, at 11:20 PM, Mark Adams <mfadams at lbl.gov> wrote:
> >
> > This is with 'maint'.  I will retry with master.
> >
> > On Mon, Apr 6, 2015 at 1:44 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> >
> >   Mark,
> >
> >    Exactly what git commit is this?  Should be somewhere in the error
> message, just send the entire error message.
> >
> >   Barry
> >
> > > On Apr 6, 2015, at 9:05 AM, Mark Adams <mfadams at lbl.gov> wrote:
> > >
> > > And added squaring the matrix (A^T * A), which uses a matrix matrix
> product. And get the error.  (recall this did run without the square graph,
> which is where it dies here).  I can run this about every two weeks if you
> want me to run with some other branch or parameters (-info ?).
> > > Mark
> > >
> > >
> > >
> > >
> > >     [0]PCSetFromOptions_GAMG threshold set -5.000000e-03
> > >     [0]PCSetUp_GAMG level 0 N=0, n data rows=1, n data cols=1, nnz/row
> (ave)=26, np=131072
> > >     [0]PCGAMGFilterGraph 100% nnz after filtering, with threshold
> -0.005, 26.9649 nnz ave. (N=0)
> > > [0]PCGAMGCoarsen_AGG square graph
> > > MPICH2 ERROR [Rank 48173] [job id 11540278] [Sun Apr  5 22:11:13 2015]
> [c0-1c0s10n3] [nid01579] - MPID_nem_gni_check_localCQ():
> GNI_CQ_EVENT_TYPE_POST had error (SOURCE_SSID:AT_MDD_INV:CPLTN_SRSP)
> > > [48173]PETSC ERROR: #1 MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ()
> line 1757 in
> /global/u2/m/madams/petsc_maint/src/mat/impls/aij/mpi/mpimatmatmult.c
> > > [48173]PETSC ERROR: #2 MatTransposeMatMult_MPIAIJ_MPIAIJ() line 881 in
> /global/u2/m/madams/petsc_maint/src/mat/impls/aij/mpi/mpimatmatmult.c
> > > [48173]PETSC ERROR: #3 MatTransposeMatMult() line 8977 in
> /global/u2/m/madams/petsc_maint/src/mat/interface/matrix.c
> > > [48173]PETSC ERROR: #4 PCGAMGCoarsen_AGG() line 991 in
> /global/u2/m/madams/petsc_maint/src/ksp/pc/impls/gamg/agg.c
> > > [48173]PETSC ERROR: #5 PCSetUp_GAMG() line 596 in
> /global/u2/m/madams/petsc_maint/src/ksp/pc/impls/gamg/gamg.c
> > > [48173]PETSC ERROR: #6 PCSetUp() line 902 in
> /global/u2/m/madams/petsc_maint/src/ksp/pc/interface/precon.c
> > > [48173]PETSC ERROR: #7 KSPSetUp() line 306 in
> /global/u2/m/madams/petsc_maint/src/ksp/ksp/interface/itfunc.c
> > > [48
> > >
> > > On Sat, Mar 28, 2015 at 1:21 PM, Mark Adams <mfadams at lbl.gov> wrote:
> > > It works for a stripped down GAMG solve (no smoothing and no square
> graph).  I will try adding the square graph back in ...
> > >
> > > On Wed, Mar 25, 2015 at 7:24 AM, Mark Adams <mfadams at lbl.gov> wrote:
> > >
> > >
> > > On Sat, Mar 7, 2015 at 5:49 PM, Barry Smith <bsmith at mcs.anl.gov>
> wrote:
> > >
> > > > On Mar 7, 2015, at 4:27 PM, Mark Adams <mfadams at lbl.gov> wrote:
> > > >
> > > >
> > > >
> > > > On Sat, Mar 7, 2015 at 3:11 PM, Barry Smith <bsmith at mcs.anl.gov>
> wrote:
> > > >
> > > >   Hmm,  my first guess is a mixup between PetscMPIInt and PetscInt
> arrays  or MPIU_INT somewhere. (But compilers catch most of these)
> > > >
> > > > I have a run with ~9K eq/core 128K cores many times but have never
> gotten the 32K/core (4B eq.) run to work.  So it looks like an int overflow
> issue.
> > >
> > >   Well in theory if we have done everything write with 64 bit indices
> there should never be an integer overflow (though of course there could be
> mistake somewhere but generally the compiler will detect if we are trying
> to stick a 64 bit int into a 32 bit slot.
> > >
> > >    Can you try the same example without GAMG (say with hypre instead);
> if it goes through ok it might indicate an int issue either in gamg or the
> code that gamg calls?
> > >
> > >
> > > Good idea,  Jacobi works.  I will try a stripped down GAMG next.
> > >
> > >
> > >
> > >
> > >   Barry
> > >
> > > >
> > > >
> > > >   Another possibility is bug in the MPI for many large messages; any
> chance you can run the same thing on a very different system? Mira?
> > > >
> > > > Not soon, but I have Chombo and PETSc built on Mira and it would not
> be hard to get this code setup and try it.
> > > >
> > > > This is for SCE15 so I will turn the PETSc test off unless someone
> has any ideas on something to try.  I am using maint perhaps I should use
> master.
> > > >
> > > > Mark
> > > >
> > > >
> > > >   Barry
> > > >
> > > > > On Mar 7, 2015, at 1:21 PM, Mark Adams <mfadams at lbl.gov> wrote:
> > > > >
> > > > > I seem to be getting this error on Edison with 128K cores and ~4
> Billion equations.  I've seen this error several time.  I've attached a
> recent output from this.  I wonder if it is an integer overflow.  This
> built with 64 bit integers, but I notice that GAMG prints out N and I see
> N=0 for the finest level.
> > > > >
> > > > > Mark
> > > > > <out.131072.uniform.txt>
> > > >
> > > >
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20150419/692df6e7/attachment.html>


More information about the petsc-dev mailing list