[petsc-users] parallel IO messages
Barry Smith
bsmith at mcs.anl.gov
Fri Nov 27 15:29:14 CST 2015
SIGTRAP is a way a process can interact with itself or another process asynchronously. It is possible that in all the mess of HDF5/MPI IO/OS code that manages getting the data in parallel from the MPI process memory to the hard disk some of the code uses SIGTRAP. PETSc, by default, always traps the SIGTRAP; thinking that it is indicating an error condition. The "randomness" could come from the fact that depending on how quickly the data is moving from the MPI processes to the disk only sometimes will the mess of code actually use a SIGTRAP. I could also be totally wrong and the SIGTRAP may just be triggered by errors in the IO system. Anyways give my suggestion a try and see if it helps, there is nothing else you can do.
Barry
> On Nov 27, 2015, at 2:27 PM, Fande Kong <fdkong.jd at gmail.com> wrote:
>
> Thanks, Barry,
>
> I also was wondering why this happens randomly? Any explanations? If this is something in PETSc, that should happen always?
>
> Thanks,
>
> Fande Kong,
>
> On Fri, Nov 27, 2015 at 1:20 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
> Edit PETSC_ARCH/include/petscconf.h and add
>
> #if !defined(PETSC_MISSING_SIGTRAP)
> #define PETSC_MISSING_SIGTRAP
> #endif
>
> then do
>
> make gnumake
>
> It is possible that they system you are using uses SIGTRAP in managing the IO; by making the change above you are telling PETSc to ignore SIGTRAPS. Let us know how this works out.
>
> Barry
>
>
> > On Nov 27, 2015, at 1:05 PM, Fande Kong <fdkong.jd at gmail.com> wrote:
> >
> > Hi all,
> >
> > I implemented a parallel IO based on the Vec and IS which uses HDF5. I am testing this loader on a supercomputer. I occasionally (not always) encounter the following errors (using 8192 cores):
> >
> > [7689]PETSC ERROR: ------------------------------------------------------------------------
> > [7689]PETSC ERROR: Caught signal number 5 TRAP
> > [7689]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> > [7689]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> > [7689]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
> > [7689]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
> > [7689]PETSC ERROR: to get more information on the crash.
> > [7689]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
> > [7689]PETSC ERROR: Signal received
> > [7689]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
> > [7689]PETSC ERROR: Petsc Release Version 3.6.2, unknown
> > [7689]PETSC ERROR: ./fsi on a arch-linux2-cxx-opt named ys6103 by fandek Fri Nov 27 11:26:30 2015
> > [7689]PETSC ERROR: Configure options --with-clanguage=cxx --with-shared-libraries=1 --download-fblaslapack=1 --with-mpi=1 --download-parmetis=1 --download-metis=1 --with-netcdf=1 --download-exodusii=1 --with-hdf5-dir=/glade/apps/opt/hdf5-mpi/1.8.12/intel/12.1.5 --with-debugging=no --with-c2html=0 --with-64-bit-indices=1
> > [7689]PETSC ERROR: #1 User provided function() line 0 in unknown file
> > Abort(59) on node 7689 (rank 7689 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 59) - process 7689
> > ERROR: 0031-300 Forcing all remote tasks to exit due to exit code 1 in task 7689
> >
> > Make and configure logs are attached.
> >
> > Thanks,
> >
> > Fande Kong,
> >
> > <configure_log><make_log>
>
>
More information about the petsc-users
mailing list