[petsc-users] Error with parallel solve

Balay, Satish balay at mcs.anl.gov
Mon Apr 8 14:38:01 CDT 2019


https://github.com/spack/spack/pull/11132

If this works - please add a comment on the PR

Satish

On Mon, 8 Apr 2019, Balay, Satish via petsc-users wrote:

> spack update to use mumps-5.1.2 - with this patch is in branch 'balay/mumps-5.1.2'
> 
> Satish
> 
> On Mon, 8 Apr 2019, Satish Balay wrote:
> 
> > Yes - mumps via spack is unlikely to have this patch - but it can be added.
> > 
> > https://bitbucket.org/petsc/pkg-mumps/commits/5fe5b9e56f78de2b7b1c199688f6c73ff3ff4c2d
> > 
> > Satish
> > 
> > On Mon, 8 Apr 2019, Manav Bhatia wrote:
> > 
> > > This is helpful, Thibaut. Thanks! 
> > > 
> > > For reference: all my Linux installs are using Spack, while my Mac install is through a petsc config where I let it download and install mumps. 
> > > 
> > > Could this be a source of difference in patch level for Mumps? 
> > > 
> > > 
> > > > On Apr 8, 2019, at 1:56 PM, Appel, Thibaut <t.appel17 at imperial.ac.uk> wrote:
> > > > 
> > > > Hi Manav,
> > > > This seems to be the bug in MUMPS that I reported to their developers last summer.
> > > > But I thought Satish Balay had issued a patch in the maint branch of PETSc to correct that a few months ago?
> > > > The temporary workaround was to disable the ScaLAPACK root node, ICNTL(13)=1
> > > > One of the developers said later
> > > >> A workaround consists in modifying the file src/dtype3_root.F near line 808
> > > >> and replace the lines:
> > > >> 
> > > >>       SUBROUTINE DMUMPS_INIT_ROOT_FAC( N, root, FILS, IROOT,
> > > >>      &                                 KEEP, INFO )
> > > >>       IMPLICIT NONE
> > > >>       INCLUDE 'dmumps_root.h'
> > > >> by:
> > > >> 
> > > >>       SUBROUTINE DMUMPS_INIT_ROOT_FAC( N, root, FILS, IROOT,
> > > >>      &                                 KEEP, INFO )
> > > >>       USE DMUMPS_STRUC_DEF
> > > >>       IMPLICIT NONE
> > > >> 
> > > > 
> > > > Weird that you’re getting this now if it has been corrected in PETSc?
> > > > 
> > > > Thibaut
> > > >> 
> > > >> > On Apr 8, 2019, at 1:33 PM, Mark Adams <mfadams at lbl.gov <https://lists.mcs.anl.gov/mailman/listinfo/petsc-users>> wrote:
> > > >> > 
> > > >> > Are you able to run the exact same job on your Mac? ie, same number of processes, etc.
> > > >> 
> > > >> This is what I am trying to dig into now. 
> > > >> 
> > > >> My Mac has 4 cores. 
> > > >> 
> > > >> I have used several different Linux machines with different number of processors: 4, 12, 10, 20. They all eventually crash. 
> > > >> 
> > > >> I am trying to establish if the point of crash is the same across machines. 
> > > >> 
> > > >> -Manav
> > > > 
> > > > 
> > > >> On 8 Apr 2019, at 20:24, petsc-users-request at mcs.anl.gov <mailto:petsc-users-request at mcs.anl.gov> wrote:
> > > >> 
> > > >> Send petsc-users mailing list submissions to
> > > >> petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>
> > > >> 
> > > >> To subscribe or unsubscribe via the World Wide Web, visit
> > > >> https://lists.mcs.anl.gov/mailman/listinfo/petsc-users
> > > >> or, via email, send a message with subject or body 'help' to
> > > >> petsc-users-request at mcs.anl.gov
> > > >> 
> > > >> You can reach the person managing the list at
> > > >> petsc-users-owner at mcs.anl.gov
> > > >> 
> > > >> When replying, please edit your Subject line so it is more specific
> > > >> than "Re: Contents of petsc-users digest..."
> > > >> 
> > > >> 
> > > >> Today's Topics:
> > > >> 
> > > >>   1.  Error with parallel solve (Manav Bhatia)
> > > >>   2. Re:  Error with parallel solve (Smith, Barry F.)
> > > >>   3. Re:  Error with parallel solve (Mark Adams)
> > > >>   4. Re:  Error with parallel solve (Manav Bhatia)
> > > >> 
> > > >> 
> > > >> ----------------------------------------------------------------------
> > > >> 
> > > >> Message: 1
> > > >> Date: Mon, 8 Apr 2019 12:12:06 -0500
> > > >> From: Manav Bhatia <bhatiamanav at gmail.com>
> > > >> To: Evan Um via petsc-users <petsc-users at mcs.anl.gov>
> > > >> Subject: [petsc-users] Error with parallel solve
> > > >> Message-ID: <BB21322F-A7C6-4D93-98B4-4B2D0D484724 at gmail.com>
> > > >> Content-Type: text/plain; charset="us-ascii"
> > > >> 
> > > >> 
> > > >> Hi,
> > > >> 
> > > >>    I am running a code a nonlinear simulation using mesh-refinement on libMesh. The code runs without issues on a Mac (can run for days without issues), but crashes on Linux (Centos 6). I am using version 3.11 on Linux with openmpi 3.1.3 and gcc8.2. 
> > > >> 
> > > >>    I tried to use the -on_error_attach_debugger, but it only gave me this message. Does this message imply something to the more experienced eyes? 
> > > >> 
> > > >>    I am going to try to build a debug version of petsc to figure out what is going wrong. I will get and share more detailed logs in a bit. 
> > > >> 
> > > >> Regards,
> > > >> Manav
> > > >> 
> > > >> [8]PETSC ERROR: ------------------------------------------------------------------------
> > > >> [8]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
> > > >> [8]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> > > >> [8]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> > > >> [8]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
> > > >> [8]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
> > > >> [8]PETSC ERROR: to get more information on the crash.
> > > >> [8]PETSC ERROR: User provided function() line 0 in  unknown file  
> > > >> PETSC: Attaching gdb to /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5 of pid 2108 on display localhost:10.0 on machine Warhawk1.HPC.MsState.Edu
> > > >> PETSC: Attaching gdb to /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5 of pid 2112 on display localhost:10.0 on machine Warhawk1.HPC.MsState.Edu
> > > >>           0 :INTERNAL Error: recvd root arrowhead 
> > > >>           0 :not belonging to me. IARR,JARR=       67525       67525
> > > >>           0 :IROW_GRID,JCOL_GRID=           0           4
> > > >>           0 :MYROW, MYCOL=           0           0
> > > >>           0 :IPOSROOT,JPOSROOT=    92264688    92264688
> > > >> --------------------------------------------------------------------------
> > > >> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> > > >> with errorcode -99.
> > > >> 
> > > >> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> > > >> You may or may not see output from other processes, depending on
> > > >> exactly when Open MPI kills them.
> > > >> --------------------------------------------------------------------------
> > > >> 
> > > >> -------------- next part --------------
> > > >> An HTML attachment was scrubbed...
> > > >> URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190408/25b954eb/attachment-0001.html>
> > > >> 
> > > >> ------------------------------
> > > >> 
> > > >> Message: 2
> > > >> Date: Mon, 8 Apr 2019 17:36:53 +0000
> > > >> From: "Smith, Barry F." <bsmith at mcs.anl.gov>
> > > >> To: Manav Bhatia <bhatiamanav at gmail.com>
> > > >> Cc: Evan Um via petsc-users <petsc-users at mcs.anl.gov>
> > > >> Subject: Re: [petsc-users] Error with parallel solve
> > > >> Message-ID: <B0E28830-AAF4-425B-8C7D-63686AF1B503 at anl.gov>
> > > >> Content-Type: text/plain; charset="us-ascii"
> > > >> 
> > > >>  Difficult to tell what is going on. 
> > > >> 
> > > >>  The message User provided function() line 0 in  unknown file  indicates the crash took place OUTSIDE of PETSc code and error message INTERNAL Error: recvd root arrowhead  is definitely not coming from PETSc. 
> > > >> 
> > > >>   Yes, debug with the debug version and also try valgrind.
> > > >> 
> > > >>   Barry
> > > >> 
> > > >> 
> > > >>> On Apr 8, 2019, at 12:12 PM, Manav Bhatia via petsc-users <petsc-users at mcs.anl.gov> wrote:
> > > >>> 
> > > >>> 
> > > >>> Hi,
> > > >>> 
> > > >>>    I am running a code a nonlinear simulation using mesh-refinement on libMesh. The code runs without issues on a Mac (can run for days without issues), but crashes on Linux (Centos 6). I am using version 3.11 on Linux with openmpi 3.1.3 and gcc8.2. 
> > > >>> 
> > > >>>    I tried to use the -on_error_attach_debugger, but it only gave me this message. Does this message imply something to the more experienced eyes? 
> > > >>> 
> > > >>>    I am going to try to build a debug version of petsc to figure out what is going wrong. I will get and share more detailed logs in a bit. 
> > > >>> 
> > > >>> Regards,
> > > >>> Manav
> > > >>> 
> > > >>> [8]PETSC ERROR: ------------------------------------------------------------------------
> > > >>> [8]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
> > > >>> [8]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> > > >>> [8]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> > > >>> [8]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
> > > >>> [8]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
> > > >>> [8]PETSC ERROR: to get more information on the crash.
> > > >>> [8]PETSC ERROR: User provided function() line 0 in  unknown file  
> > > >>> PETSC: Attaching gdb to /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5 of pid 2108 on display localhost:10.0 on machine Warhawk1.HPC.MsState.Edu
> > > >>> PETSC: Attaching gdb to /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5 of pid 2112 on display localhost:10.0 on machine Warhawk1.HPC.MsState.Edu
> > > >>>           0 :INTERNAL Error: recvd root arrowhead 
> > > >>>           0 :not belonging to me. IARR,JARR=       67525       67525
> > > >>>           0 :IROW_GRID,JCOL_GRID=           0           4
> > > >>>           0 :MYROW, MYCOL=           0           0
> > > >>>           0 :IPOSROOT,JPOSROOT=    92264688    92264688
> > > >>> --------------------------------------------------------------------------
> > > >>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> > > >>> with errorcode -99.
> > > >>> 
> > > >>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> > > >>> You may or may not see output from other processes, depending on
> > > >>> exactly when Open MPI kills them.
> > > >>> --------------------------------------------------------------------------
> > > >>> 
> > > >> 
> > > >> 
> > > >> 
> > > >> ------------------------------
> > > >> 
> > > >> Message: 3
> > > >> Date: Mon, 8 Apr 2019 13:58:55 -0400
> > > >> From: Mark Adams <mfadams at lbl.gov>
> > > >> To: "Smith, Barry F." <bsmith at mcs.anl.gov>
> > > >> Cc: Manav Bhatia <bhatiamanav at gmail.com>, 
> > > >> Evan Um via petsc-users
> > > >> <petsc-users at mcs.anl.gov>
> > > >> Subject: Re: [petsc-users] Error with parallel solve
> > > >> Message-ID:
> > > >> <CADOhEh6EZqtFkwzJYm0iCOMX5_uTfdbuH6t4WwU7_dUeY9KpzA at mail.gmail.com>
> > > >> Content-Type: text/plain; charset="utf-8"
> > > >> 
> > > >> This looks like an error in MUMPS:
> > > >> 
> > > >>        IF ( IROW_GRID .NE. root%MYROW .OR.
> > > >>     &       JCOL_GRID .NE. root%MYCOL ) THEN
> > > >>            WRITE(*,*) MYID,':INTERNAL Error: recvd root arrowhead '
> > > >> 
> > > >> 
> > > >> On Mon, Apr 8, 2019 at 1:37 PM Smith, Barry F. via petsc-users <
> > > >> petsc-users at mcs.anl.gov> wrote:
> > > >> 
> > > >>>  Difficult to tell what is going on.
> > > >>> 
> > > >>>  The message User provided function() line 0 in  unknown file  indicates
> > > >>> the crash took place OUTSIDE of PETSc code and error message INTERNAL
> > > >>> Error: recvd root arrowhead  is definitely not coming from PETSc.
> > > >>> 
> > > >>>   Yes, debug with the debug version and also try valgrind.
> > > >>> 
> > > >>>   Barry
> > > >>> 
> > > >>> 
> > > >>>> On Apr 8, 2019, at 12:12 PM, Manav Bhatia via petsc-users <
> > > >>> petsc-users at mcs.anl.gov> wrote:
> > > >>>> 
> > > >>>> 
> > > >>>> Hi,
> > > >>>> 
> > > >>>>    I am running a code a nonlinear simulation using mesh-refinement on
> > > >>> libMesh. The code runs without issues on a Mac (can run for days without
> > > >>> issues), but crashes on Linux (Centos 6). I am using version 3.11 on Linux
> > > >>> with openmpi 3.1.3 and gcc8.2.
> > > >>>> 
> > > >>>>    I tried to use the -on_error_attach_debugger, but it only gave me
> > > >>> this message. Does this message imply something to the more experienced
> > > >>> eyes?
> > > >>>> 
> > > >>>>    I am going to try to build a debug version of petsc to figure out
> > > >>> what is going wrong. I will get and share more detailed logs in a bit.
> > > >>>> 
> > > >>>> Regards,
> > > >>>> Manav
> > > >>>> 
> > > >>>> [8]PETSC ERROR:
> > > >>> ------------------------------------------------------------------------
> > > >>>> [8]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
> > > >>> probably memory access out of range
> > > >>>> [8]PETSC ERROR: Try option -start_in_debugger or
> > > >>> -on_error_attach_debugger
> > > >>>> [8]PETSC ERROR: or see
> > > >>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> > > >>>> [8]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac
> > > >>> OS X to find memory corruption errors
> > > >>>> [8]PETSC ERROR: configure using --with-debugging=yes, recompile, link,
> > > >>> and run
> > > >>>> [8]PETSC ERROR: to get more information on the crash.
> > > >>>> [8]PETSC ERROR: User provided function() line 0 in  unknown file
> > > >>>> PETSC: Attaching gdb to
> > > >>> /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5
> > > >>> of pid 2108 on display localhost:10.0 on machine Warhawk1.HPC.MsState.Edu
> > > >>>> PETSC: Attaching gdb to
> > > >>> /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5
> > > >>> of pid 2112 on display localhost:10.0 on machine Warhawk1.HPC.MsState.Edu
> > > >>>>           0 :INTERNAL Error: recvd root arrowhead
> > > >>>>           0 :not belonging to me. IARR,JARR=       67525       67525
> > > >>>>           0 :IROW_GRID,JCOL_GRID=           0           4
> > > >>>>           0 :MYROW, MYCOL=           0           0
> > > >>>>           0 :IPOSROOT,JPOSROOT=    92264688    92264688
> > > >>>> 
> > > >>> --------------------------------------------------------------------------
> > > >>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> > > >>>> with errorcode -99.
> > > >>>> 
> > > >>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> > > >>>> You may or may not see output from other processes, depending on
> > > >>>> exactly when Open MPI kills them.
> > > >>>> 
> > > >>> --------------------------------------------------------------------------
> > > >>>> 
> > > >>> 
> > > >>> 
> > > >> -------------- next part --------------
> > > >> An HTML attachment was scrubbed...
> > > >> URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190408/d1a86834/attachment-0001.html>
> > > >> 
> > > >> ------------------------------
> > > >> 
> > > >> Message: 4
> > > >> Date: Mon, 8 Apr 2019 13:23:14 -0500
> > > >> From: Manav Bhatia <bhatiamanav at gmail.com>
> > > >> To: Mark Adams <mfadams at lbl.gov>
> > > >> Cc: "Smith, Barry F." <bsmith at mcs.anl.gov>,
> > > >> Evan Um via petsc-users
> > > >> <petsc-users at mcs.anl.gov>
> > > >> Subject: Re: [petsc-users] Error with parallel solve
> > > >> Message-ID: <E1168C22-17F5-4A5F-870A-88F0D5FBFA31 at gmail.com>
> > > >> Content-Type: text/plain; charset="us-ascii"
> > > >> 
> > > >> Thanks for identifying this, Mark. 
> > > >> 
> > > >> If I compile the debug version of Petsc, will it also build a debug version of Mumps? 
> > > >> 
> > > >>> On Apr 8, 2019, at 12:58 PM, Mark Adams <mfadams at lbl.gov> wrote:
> > > >>> 
> > > >>> This looks like an error in MUMPS:
> > > >>> 
> > > >>>        IF ( IROW_GRID .NE. root%MYROW .OR.
> > > >>>     &       JCOL_GRID .NE. root%MYCOL ) THEN
> > > >>>            WRITE(*,*) MYID,':INTERNAL Error: recvd root arrowhead '
> > > >>> 
> > > >>> On Mon, Apr 8, 2019 at 1:37 PM Smith, Barry F. via petsc-users <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>> wrote:
> > > >>>  Difficult to tell what is going on. 
> > > >>> 
> > > >>>  The message User provided function() line 0 in  unknown file  indicates the crash took place OUTSIDE of PETSc code and error message INTERNAL Error: recvd root arrowhead  is definitely not coming from PETSc. 
> > > >>> 
> > > >>>   Yes, debug with the debug version and also try valgrind.
> > > >>> 
> > > >>>   Barry
> > > >>> 
> > > >>> 
> > > >>>> On Apr 8, 2019, at 12:12 PM, Manav Bhatia via petsc-users <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>> wrote:
> > > >>>> 
> > > >>>> 
> > > >>>> Hi,
> > > >>>> 
> > > >>>>    I am running a code a nonlinear simulation using mesh-refinement on libMesh. The code runs without issues on a Mac (can run for days without issues), but crashes on Linux (Centos 6). I am using version 3.11 on Linux with openmpi 3.1.3 and gcc8.2. 
> > > >>>> 
> > > >>>>    I tried to use the -on_error_attach_debugger, but it only gave me this message. Does this message imply something to the more experienced eyes? 
> > > >>>> 
> > > >>>>    I am going to try to build a debug version of petsc to figure out what is going wrong. I will get and share more detailed logs in a bit. 
> > > >>>> 
> > > >>>> Regards,
> > > >>>> Manav
> > > >>>> 
> > > >>>> [8]PETSC ERROR: ------------------------------------------------------------------------
> > > >>>> [8]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
> > > >>>> [8]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> > > >>>> [8]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind <http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind>
> > > >>>> [8]PETSC ERROR: or try http://valgrind.org <http://valgrind.org/> on GNU/linux and Apple Mac OS X to find memory corruption errors
> > > >>>> [8]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
> > > >>>> [8]PETSC ERROR: to get more information on the crash.
> > > >>>> [8]PETSC ERROR: User provided function() line 0 in  unknown file  
> > > >>>> PETSC: Attaching gdb to /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5 of pid 2108 on display localhost:10.0 on machine Warhawk1.HPC.MsState.Edu <http://warhawk1.hpc.msstate.edu/>
> > > >>>> PETSC: Attaching gdb to /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5 of pid 2112 on display localhost:10.0 on machine Warhawk1.HPC.MsState.Edu <http://warhawk1.hpc.msstate.edu/>
> > > >>>>           0 :INTERNAL Error: recvd root arrowhead 
> > > >>>>           0 :not belonging to me. IARR,JARR=       67525       67525
> > > >>>>           0 :IROW_GRID,JCOL_GRID=           0           4
> > > >>>>           0 :MYROW, MYCOL=           0           0
> > > >>>>           0 :IPOSROOT,JPOSROOT=    92264688    92264688
> > > >>>> --------------------------------------------------------------------------
> > > >>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> > > >>>> with errorcode -99.
> > > >>>> 
> > > >>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> > > >>>> You may or may not see output from other processes, depending on
> > > >>>> exactly when Open MPI kills them.
> > > >>>> --------------------------------------------------------------------------
> > > >>>> 
> > > >>> 
> > > >> 
> > > >> -------------- next part --------------
> > > >> An HTML attachment was scrubbed...
> > > >> URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190408/04ea668f/attachment.html>
> > > >> 
> > > >> ------------------------------
> > > >> 
> > > >> Subject: Digest Footer
> > > >> 
> > > >> _______________________________________________
> > > >> petsc-users mailing list
> > > >> petsc-users at mcs.anl.gov
> > > >> https://lists.mcs.anl.gov/mailman/listinfo/petsc-users
> > > >> 
> > > >> 
> > > >> ------------------------------
> > > >> 
> > > >> End of petsc-users Digest, Vol 124, Issue 31
> > > >> ********************************************
> > > > 
> > > 
> > > 
> > 
> 


More information about the petsc-users mailing list