[petsc-dev] "thread safe"

Mark Adams mfadams at lbl.gov
Sun Feb 22 11:21:10 CST 2015


Barry, I get three errors with -ksp_converged_reason using your branch.

Thanks,
Mark

Linear col_f_ solve converged due to CONVERGED_RTOL iterations 1
[82]PETSC ERROR: --------------------- Error Message
--------------------------------------------------------------
[82]PETSC ERROR: Argument out of range
[82]PETSC ERROR: Too many pushes
[82]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html
for trouble shooting.
[82]PETSC ERROR: Petsc Development GIT revision: v3.5.3-2014-g463f016  GIT
Date: 2015-02-21 11:26:56 -0600
[82]PETSC ERROR: ../../epsi/XGCa/xgca on a arch-xc30-optts-intel named
nid03897 by madams Sun Feb 22 09:12:35 2015
[82]PETSC ERROR: Configure options --COPTFLAGS="-fast -no-ipo"
--CXXOPTFLAGS="-fast -no-ipo" --FOPTFLAGS="-fast -no-ipo"
--download-parmetis --download-metis --with-ssl=0 --with-threadsafety
--with-log=0 --with-cc=cc --with-clib-autodetect=0 --with-cxx=CC
--with-cxxlib-autodetect=0 --with-debugging=0 --with-fc=ftn
--with-fortranlib-autodetect=0
--with-hdf5-dir=/opt/cray/hdf5-parallel/1.8.13/intel/140/
--with-shared-libraries=0 --with-x=0 --with-mpiexec=aprun LIBS=-lstdc++
PETSC_ARCH=arch-xc30-optts-intel
PETSC_DIR=/global/homes/m/madams/petsc-barry
[82]PETSC ERROR: #1 PetscViewerPushFormat() line 144 in
/global/u2/m/madams/petsc-barry/src/sys/classes/viewer/interface/viewa.c
[82]PETSC ERROR: #2 KSPReasonViewFromOptionsUnsafe() line 424 in
/global/u2/m/madams/petsc-barry/src/ksp/ksp/interface/itfunc.c
[82]PETSC ERROR: #3 KSPSolve() line 592 in
/global/u2/m/madams/petsc-barry/src/ksp/ksp/interface/itfunc.c
Linear col_f_ solve converged due to CONVERGED_RTOL iterations 1

 [snip]

[13]PETSC ERROR:
------------------------------------------------------------------------
[13]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
probably memory access out of range
[13]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[13]PETSC ERROR: or see
http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[13]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X
to find memory corruption errors
[13]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and
run
[13]PETSC ERROR: to get more information on the crash.
Linear col_f_ solve converged due to CONVERGED_RTOL iterations 1
Linear col_f_ solve converged due to CONVERGED_RTOL iterations 1
[13]PETSC ERROR: --------------------- Error Message
--------------------------------------------------------------
[13]PETSC ERROR: Signal received
[13]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html
for trouble shooting.
[13]PETSC ERROR: Petsc Development GIT revision: v3.5.3-2014-g463f016  GIT
Date: 2015-02-21 11:26:56 -0600
[13]PETSC ERROR: ../../epsi/XGCa/xgca on a arch-xc30-optts-intel named
nid00713 by madams Sun Feb 22 09:12:35 2015
[13]PETSC ERROR: Configure options --COPTFLAGS="-fast -no-ipo"
--CXXOPTFLAGS="-fast -no-ipo" --FOPTFLAGS="-fast -no-ipo"
--download-parmetis --download-metis --with-ssl=0 --with-threadsafety
--with-log=0 --with-cc=cc --with-clib-autodetect=0 --with-cxx=CC
--with-cxxlib-autodetect=0 --with-debugging=0 --with-fc=ftn
--with-fortranlib-autodetect=0
--with-hdf5-dir=/opt/cray/hdf5-parallel/1.8.13/intel/140/
--with-shared-libraries=0 --with-x=0 --with-mpiexec=aprun LIBS=-lstdc++
PETSC_ARCH=arch-xc30-optts-intel
PETSC_DIR=/global/homes/m/madams/petsc-barry
[13]PETSC ERROR: #1 User provided function() line 0 in  unknown file
Rank 13 [Sun Feb 22 09:13:04 2015] [c3-0c2s2n1] application called
MPI_Abort(MPI_COMM_WORLD, 59) - process 13

 [snip]

Linear col_f_ solve converged due to CONVERGED_RTOL iterations 1
forrtl: error (76): Abort trap signal
Image              PC                Routine            Line        Source

xgca               0000000002E10A41  Unknown               Unknown  Unknown
xgca               0000000002E0F197  Unknown               Unknown  Unknown
xgca               0000000002DC5B24  Unknown               Unknown  Unknown
xgca               0000000002DC5936  Unknown               Unknown  Unknown
xgca               0000000002D59C64  Unknown               Unknown  Unknown
xgca               0000000002D60BE1  Unknown               Unknown  Unknown
xgca               00000000015217D0  Unknown               Unknown  Unknown
xgca               000000000152178B  Unknown               Unknown  Unknown
xgca               0000000002E32271  Unknown               Unknown  Unknown
xgca               0000000002BDCE52  Unknown               Unknown  Unknown
xgca               0000000002BACDE3  Unknown               Unknown  Unknown
xgca               0000000000A6DAB9  Unknown               Unknown  Unknown
xgca               0000000000A6D394  Unknown               Unknown  Unknown
xgca               00000000015217D0  Unknown               Unknown  Unknown
xgca               00000000008BEE48  Unknown               Unknown  Unknown
xgca               00000000008BE456  Unknown               Unknown  Unknown
xgca               0000000000F0E10F  Unknown               Unknown  Unknown
xgca               0000000000F0A1F2  Unknown               Unknown  Unknown
xgca               0000000000A373F2  Unknown               Unknown  Unknown
xgca               0000000000581957  petsc_lu_solver_          973
 collisionf2.F90
xgca               000000000057EB05  col_f_picard_step         372
 collisionf2.F90
xgca               0000000000564D6A  col_f_core_s_             945
 collisionf.F90
xgca               000000000056325F  f_collision_singl         254
 collisionf.F90
xgca               0000000000560409  f_collision_singl         350
 collisionf.F90
xgca               0000000002B1EF43  Unknown               UnLinear col_f_
solve converged due to CONVERGED_RTOL iterations 1
Linear col_f_ solve converged due to CONVERGED_RTOL iterations 1
Linear col_f_ solve converged due to CONVERGED_RTOL iterations 1
known  Unknown


On Sat, Feb 21, 2015 at 12:30 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:

>
> barry/threadsafety-kspconvergedreason-kspmonitor
>
> Note that as always the monitoring and converged reasons for the various
> threads will be printed jumbled up
>
> In your own code make sure that any routines that use the default viewers
> (like stdout) are in a omp critical section
>
>  Barry
>
> > On Feb 20, 2015, at 8:22 PM, Mark Adams <mfadams at lbl.gov> wrote:
> >
> > OK, I have a code setup to test it so feel free to make branch and I can
> test it.
> > Mark
> >
> > On Fri, Feb 20, 2015 at 7:13 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> >
> >   Mark,
> >
> >    Yes, after looking at the code it does make sense. The reason is that
> Matt made me "improve" the -xxx_converged_reason to use viewers; but in
> your case there will be multiple threads (each associated with different
> KSP objects) each monkeying with the same (default) viewer thus possibly
> corrupting it.
> >
> >    I'll have to think a little bit about the best way to keep the
> functionality but be thread safe.
> >
> >   Barry
> >
> > > On Feb 20, 2015, at 5:57 PM, Mark Adams <mfadams at lbl.gov> wrote:
> > >
> > > Barry,
> > >
> > > We had a problem with the thread safe version and found, by pure luck,
> that apparently if we use -ksp_converged_reason we get segv type failure.
> Does this sound sensible?
> > >
> > > I can give you an executable and environment the run this on Edison if
> that is useful.
> > >
> > > Thanks,
> > > Mark
> > >
> > >
> > > On Tue, Feb 17, 2015 at 9:27 PM, Barry Smith <bsmith at mcs.anl.gov>
> wrote:
> > >
> > >   You need to configure with --with-threadsafety and --with-log=0 and
> --with-debugging=0
> > >
> > >   Eventually we'll support at least the debugging with thread safety.
> > >
> > >   Barry
> > >
> > > Not sure about that strange message from the cray system.
> > >
> > >
> > > > On Feb 17, 2015, at 8:14 PM, Mark Adams <mfadams at lbl.gov> wrote:
> > > >
> > > > We have been testing master with a code that calls PETSc serial LU
> solvers from threads.  I have seen system messages with OMP (see way below)
> and Robert (cc'ed) reported this useful stack trace.
> > > >
> > > > I have not modified my (non-thread) build.  Perhaps I need to or are
> there PETSc runtime options?
> > > >
> > > > This is a Cray XC30 with Intel.
> > > >
> > > > Thanks,
> > > > Mark
> > > >
> > > > SC[0;39mESC[0;49m[116]PETSC ERROR: Object is in wrong state
> > > > [116]PETSC ERROR: Logging event had unbalanced begin/end pairs
> > > > [116]PETSC ERROR: See
> http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
> > > > [116]PETSC ERROR: Petsc Development GIT revision:
> v3.5.3-1570-gcaf1481  GIT Date: 2015-02-07 17:34:17 -0600
> > > > [116]PETSC ERROR: ./xgca_petsc36_col on a arch-xc30-opt64-intel
> named nid05975 by rhager Tue Feb 17 10:46:32 2015
> > > > [116]PETSC ERROR: Configure options --COPTFLAGS="-fast -no-ipo"
> --CXXOPTFLAGS="-fast -no-ipoi" --FOPTFLAGS="-fast -no-ipo" --download-hypre
> --download-superlu_dist --
> > > > download-parmetis --download-metis --with-ssl=0 --with-cc=cc
> --with-clib-autodetect=0 --with-cxx=CC --with-cxxlib-autodetect=0
> --with-debugging=0 --with-fc=ftn --with
> > > > -fortranlib-autodetect=0
> --with-hdf5-dir=/opt/cray/hdf5-parallel/1.8.13/intel/140/
> --with-shared-libraries=0 --with-x=0 --with-mpiexec=aprun LIBS=-lstdc++
> --with-64-b
> > > > it-indices PETSC_ARCH=arch-xc30-opt64-intel
> PETSC_DIR=/global/u2/m/madams/petsc_master
> > > > [116]PETSC ERROR: #1 PetscLogEventEndDefault() line 694 in
> /global/u2/m/madams/petsc_master/src/sys/logging/utils/eventlog.c
> > > > [116]PETSC ERROR: #2 MatLUFactorSymbolic() line 2894 in
> /global/u2/m/madams/petsc_master/src/mat/interface/matrix.c
> > > > [116]PETSC ERROR: #3 PCSetUp_LU() line 127 in
> /global/u2/m/madams/petsc_master/src/ksp/pc/impls/factor/lu/lu.c
> > > > [116]PETSC ERROR: #4 PCSetUp() line 918 in
> /global/u2/m/madams/petsc_master/src/ksp/pc/interface/precon.c
> > > > [116]PETSC ERROR: #5 KSPSetUp() line 306 in
> /global/u2/m/madams/petsc_master/src/ksp/ksp/interface/itfunc.c
> > > > [116]PETSC ERROR: #6 KSPSolve() line 503 in
> /global/u2/m/madams/petsc_master/src/ksp/ksp/interface/itfunc.c
> > > >
> > > >
> > > > Other error message:
> > > >
> > > >
> > > > OMP: Error #13: Assertion failure at kmp_runtime.c(1588).
> > > > OMP: Hint: Please submit a bug report with this message, compile and
> run commands used, and machine configuration info including native compiler
> and operating system versions. Faster response will be obtained by
> including all program sources. For information on submitting this issue,
> please see http://www.intel.com/software/products/support/.
> > > > _pmiu_daemon(SIGCHLD): [NID 05979] [c7-3c0s6n3] [Tue Feb 17 15:14:43
> 2015] PE RANK 23 exit signal Killed
> > > > _pmiu_daemon(SIGCHLD): [NID 05976] [c7-3c0s6n0] [Tue Feb 17 15:14:43
> 2015] PE RANK 10 exit signal Killed
> > > > [NID 05979] 2015-02-17 15:14:43 Apid 10147992: initiated application
> termination
> > > > [NID 05979] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 239]. Please contact admin for details. Killing
> pid 18637(xgca)
> > > > [NID 05976] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 73]. Please contact admin for details. Killing
> pid 15380(xgca)
> > > > [NID 05984] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 0]. Please contact admin for details. Killing
> pid 34636(xgca)
> > > > [NID 05988] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 59]. Please contact admin for details. Killing
> pid 38496(xgca)
> > > > [NID 06019] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 0]. Please contact admin for details. Killing
> pid 11132(xgca)
> > > > [NID 05980] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 0]. Please contact admin for details. Killing
> pid 8320(xgca)
> > > > [NID 05993] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 0]. Please contact admin for details. Killing
> pid 46182(xgca)
> > > > [NID 06020] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 249]. Please contact admin for details. Killing
> pid 23753(xgca)
> > > > [NID 05987] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 87]. Please contact admin for details. Killing
> pid 11254(xgca)
> > > > [NID 05986] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 41]. Please contact admin for details. Killing
> pid 6630(xgca)
> > > > [NID 05981] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 31]. Please contact admin for details. Killing
> pid 10520(xgca)
> > > > [NID 05999] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 7]. Please contact admin for details. Killing
> pid 1843(xgca)
> > > > [NID 05985] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 0]. Please contact admin for details. Killing
> pid 26498(xgca)
> > > > [NID 05998] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 209]. Please contact admin for details. Killing
> pid 20387(xgca)
> > > > [NID 05994] 2015-02-17 15:14:53 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 0]. Please contact admin for details. Killing
> pid 39462(xgca)
> > > > [NID 05983] 2015-02-17 15:14:53 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 0]. Please contact admin for details. Killing
> pid 18598(xgca)
> > > > [NID 05995] 2015-02-17 15:14:54 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 0]. Please contact admin for details. Killing
> pid 42322(xgca)
> > > > [NID 05996] 2015-02-17 15:14:54 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 0]. Please contact admin for details. Killing
> pid 34248(xgca)
> > > > [NID 05978] 2015-02-17 15:14:55 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 0]. Please contact admin for details. Killing
> pid 9483(xgca)
> > > > [NID 05975] 2015-02-17 15:14:56 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 0]. Please contact admin for details. Killing
> pid 11470(xgca)
> > > > Application 10147992 exit codes: 137
> > > > Application 10147992 exit signals: Killed
> > > > Application 10147992 resources: utime ~2194s, stime ~199s, Rss
> ~488560, inblocks ~908164, outblocks ~2571652
> > > >
> > >
> > >
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20150222/66cba2dc/attachment.html>


More information about the petsc-dev mailing list