[petsc-dev] problems with master and MPICH version 1

Satish Balay balay at mcs.anl.gov
Mon Jun 8 17:29:02 CDT 2015


On Mon, 8 Jun 2015, Jed Brown wrote:

> Barry Smith <bsmith at mcs.anl.gov> writes:
> 
> >   We are having some problems with master and MPICH 1 in the nightly tests
> >
> > http://ftp.mcs.anl.gov/pub/petsc/nightlylogs/archive/2015/06/08/examples_master_arch-linux-mpich1_steamroller.log
> >
> > I've done a lot of debugging including with valgrind and cannot
> > determine the problem. I'm concluding that it is a problem with how
> > they handle attributes since changing the number of attributes can
> > produce crashes or prevent crashes. It is flaky, like memory
> > corruption problems but valgrind is happy.
> 
> MPICH uses integer tables instead of pointers, hiding a lot of
> information from Valgrind.  When run in a debugger or Valgrind, what is
> the trace when SEGV is raised?  (I don't currently have an MPICH1
> built.)
> 
> It may well be an MPICH1 bug, but there is definitely no support at all
> for MPICH1 and hopefully people have stopped using it long ago.  If we
> do decide to end MPICH1 support, we can merge 'jed/mpi-2' (not
> necessarily for this release).

This can be reproduced on MCS linux boxes. The trigger is the
following commit:

http://bitbucket.org/petsc/petsc/commits/5c25fcd7c4e1ecb15ec7a0829572c7e72f90b2d9

[however we don't see anything there thats closely relavent. The
current hypothesis is - the order/number of attributes added/deleted
changed - triggering an error]

A Vec example [with VecView()] triggered this error before. With the
following change - that vec example is now happy - but snes examples
[with -snes_monitor_short] are crashing.


./configure  --with-mpi-dir=/homes/petsc/soft/build/mpich-1.2.7p1 --with-cxx=0 --with-fc=0 --with-shared-libraries=0

balay at es^/scratch/balay/petsc/src/snes/examples/tutorials(master=) $ ./ex1
Number of SNES iterations = 6
balay at es^/scratch/balay/petsc/src/snes/examples/tutorials(master=) $ valgrind --tool=memcheck -q ./ex1 -snes_monitor_short
  0 SNES Function norm 6.04152 
  1 SNES Function norm 4.78676 
  2 SNES Function norm 2.98646 
  3 SNES Function norm 0.230624 
  4 SNES Function norm 0.00193631 
  5 SNES Function norm 1.43559e-07 
  6 SNES Function norm < 1.e-11
Number of SNES iterations = 6
==6303== Invalid read of size 2
==6303==    at 0x13A54FC: MPIR_HBT_delete (util_hbt.c:575)
==6303==    by 0x13A756D: PMPI_Attr_delete (attr_delval.c:88)
==6303==    by 0x460BB8: Petsc_DelComm_Outer (pinit.c:362)
==6303==    by 0x13A7527: PMPI_Attr_delete (attr_delval.c:82)
==6303==    by 0x43F16B: PetscCommDestroy (tagm.c:237)
==6303==    by 0xA284CA: PetscHeaderDestroy_Private (inherit.c:121)
==6303==    by 0x4977BC: PetscViewerDestroy (view.c:108)
==6303==    by 0x440BA3: PetscObjectDestroy (destroy.c:73)
==6303==    by 0x4422B5: PetscObjectRegisterDestroyAll (destroy.c:251)
==6303==    by 0x464B62: PetscFinalize (pinit.c:1096)
==6303==    by 0x406562: main (ex1.c:143)
==6303==  Address 0x18 is not stack'd, malloc'd or (recently) free'd
==6303== 
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: likely location of problem given in stack below
[0]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
[0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[0]PETSC ERROR:       INSTEAD the line number of the start of the function
[0]PETSC ERROR:       is given.
[0]PETSC ERROR: [0] Petsc_DelComm_Outer line 355 /scratch/balay/petsc/src/sys/objects/pinit.c
[0]PETSC ERROR: [0] PetscCommDestroy line 217 /scratch/balay/petsc/src/sys/objects/tagm.c
[0]PETSC ERROR: [0] PetscHeaderDestroy_Private line 101 /scratch/balay/petsc/src/sys/objects/inherit.c
[0]PETSC ERROR: [0] PetscViewerDestroy line 97 /scratch/balay/petsc/src/sys/classes/viewer/interface/view.c
[0]PETSC ERROR: [0] PetscObjectDestroy line 69 /scratch/balay/petsc/src/sys/objects/destroy.c
[0]PETSC ERROR: [0] PetscObjectRegisterDestroyAll line 249 /scratch/balay/petsc/src/sys/objects/destroy.c
[0]PETSC ERROR: [0] PetscFinalize line 956 /scratch/balay/petsc/src/sys/objects/pinit.c
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Signal received
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
[0]PETSC ERROR: Petsc Development GIT revision: v3.5.4-3311-gf6dae19  GIT Date: 2015-06-08 16:14:54 -0600
[0]PETSC ERROR: ./ex1 on a arch-linux2-c-debug named es by balay Mon Jun  8 17:23:37 2015
[0]PETSC ERROR: Configure options --with-mpi-dir=/homes/petsc/soft/build/mpich-1.2.7p1 --with-cxx=0 --with-fc=0 --with-shared-libraries=0
[0]PETSC ERROR: #1 User provided function() line 0 in  unknown file
[0] MPI Abort by user Aborting program !
[0] Aborting program!
p0_6303:  p4_error: : 59
balay at es^/scratch/balay/petsc/src/snes/examples/tutorials(master=) $ 




More information about the petsc-dev mailing list