[petsc-dev] "thread safe"

Mark Adams mfadams at lbl.gov
Sun Feb 22 23:50:57 CST 2015


On Sun, Feb 22, 2015 at 1:03 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:

>
>   Mark,
>
>     I think it is ignoring the openmp pragmas I added. Can you please
> configure with the additional argument --with-openmp
>

That works.  Thanks,
Mark



>
> Barry
>
> > On Feb 22, 2015, at 11:21 AM, Mark Adams <mfadams at lbl.gov> wrote:
> >
> > Barry, I get three errors with -ksp_converged_reason using your branch.
> >
> > Thanks,
> > Mark
> >
> > Linear col_f_ solve converged due to CONVERGED_RTOL iterations 1
> >  [1;31m[82]PETSC ERROR: --------------------- Error Message
> --------------------------------------------------------------
> >  [0;39m [0;49m[82]PETSC ERROR: Argument out of range
> > [82]PETSC ERROR: Too many pushes
> > [82]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html
> for trouble shooting.
> > [82]PETSC ERROR: Petsc Development GIT revision: v3.5.3-2014-g463f016
> GIT Date: 2015-02-21 11:26:56 -0600
> > [82]PETSC ERROR: ../../epsi/XGCa/xgca on a arch-xc30-optts-intel named
> nid03897 by madams Sun Feb 22 09:12:35 2015
> > [82]PETSC ERROR: Configure options --COPTFLAGS="-fast -no-ipo"
> --CXXOPTFLAGS="-fast -no-ipo" --FOPTFLAGS="-fast -no-ipo"
> --download-parmetis --download-metis --with-ssl=0 --with-threadsafety
> --with-log=0 --with-cc=cc --with-clib-autodetect=0 --with-cxx=CC
> --with-cxxlib-autodetect=0 --with-debugging=0 --with-fc=ftn
> --with-fortranlib-autodetect=0
> --with-hdf5-dir=/opt/cray/hdf5-parallel/1.8.13/intel/140/
> --with-shared-libraries=0 --with-x=0 --with-mpiexec=aprun LIBS=-lstdc++
> PETSC_ARCH=arch-xc30-optts-intel
> PETSC_DIR=/global/homes/m/madams/petsc-barry
> > [82]PETSC ERROR: #1 PetscViewerPushFormat() line 144 in
> /global/u2/m/madams/petsc-barry/src/sys/classes/viewer/interface/viewa.c
> > [82]PETSC ERROR: #2 KSPReasonViewFromOptionsUnsafe() line 424 in
> /global/u2/m/madams/petsc-barry/src/ksp/ksp/interface/itfunc.c
> > [82]PETSC ERROR: #3 KSPSolve() line 592 in
> /global/u2/m/madams/petsc-barry/src/ksp/ksp/interface/itfunc.c
> > Linear col_f_ solve converged due to CONVERGED_RTOL iterations 1
> >
> >  [snip]
> >
> > [13]PETSC ERROR:
> ------------------------------------------------------------------------
> > [13]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
> probably memory access out of range
> > [13]PETSC ERROR: Try option -start_in_debugger or
> -on_error_attach_debugger
> > [13]PETSC ERROR: or see
> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> > [13]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac
> OS X to find memory corruption errors
> > [13]PETSC ERROR: configure using --with-debugging=yes, recompile, link,
> and run
> > [13]PETSC ERROR: to get more information on the crash.
> > Linear col_f_ solve converged due to CONVERGED_RTOL iterations 1
> > Linear col_f_ solve converged due to CONVERGED_RTOL iterations 1
> >  [1;31m[13]PETSC ERROR: --------------------- Error Message
> --------------------------------------------------------------
> >  [0;39m [0;49m[13]PETSC ERROR: Signal received
> > [13]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html
> for trouble shooting.
> > [13]PETSC ERROR: Petsc Development GIT revision: v3.5.3-2014-g463f016
> GIT Date: 2015-02-21 11:26:56 -0600
> > [13]PETSC ERROR: ../../epsi/XGCa/xgca on a arch-xc30-optts-intel named
> nid00713 by madams Sun Feb 22 09:12:35 2015
> > [13]PETSC ERROR: Configure options --COPTFLAGS="-fast -no-ipo"
> --CXXOPTFLAGS="-fast -no-ipo" --FOPTFLAGS="-fast -no-ipo"
> --download-parmetis --download-metis --with-ssl=0 --with-threadsafety
> --with-log=0 --with-cc=cc --with-clib-autodetect=0 --with-cxx=CC
> --with-cxxlib-autodetect=0 --with-debugging=0 --with-fc=ftn
> --with-fortranlib-autodetect=0
> --with-hdf5-dir=/opt/cray/hdf5-parallel/1.8.13/intel/140/
> --with-shared-libraries=0 --with-x=0 --with-mpiexec=aprun LIBS=-lstdc++
> PETSC_ARCH=arch-xc30-optts-intel
> PETSC_DIR=/global/homes/m/madams/petsc-barry
> > [13]PETSC ERROR: #1 User provided function() line 0 in  unknown file
> > Rank 13 [Sun Feb 22 09:13:04 2015] [c3-0c2s2n1] application called
> MPI_Abort(MPI_COMM_WORLD, 59) - process 13
> >
> >  [snip]
> >
> > Linear col_f_ solve converged due to CONVERGED_RTOL iterations 1
> > forrtl: error (76): Abort trap signal
> > Image              PC                Routine            Line
> Source
> > xgca               0000000002E10A41  Unknown               Unknown
> Unknown
> > xgca               0000000002E0F197  Unknown               Unknown
> Unknown
> > xgca               0000000002DC5B24  Unknown               Unknown
> Unknown
> > xgca               0000000002DC5936  Unknown               Unknown
> Unknown
> > xgca               0000000002D59C64  Unknown               Unknown
> Unknown
> > xgca               0000000002D60BE1  Unknown               Unknown
> Unknown
> > xgca               00000000015217D0  Unknown               Unknown
> Unknown
> > xgca               000000000152178B  Unknown               Unknown
> Unknown
> > xgca               0000000002E32271  Unknown               Unknown
> Unknown
> > xgca               0000000002BDCE52  Unknown               Unknown
> Unknown
> > xgca               0000000002BACDE3  Unknown               Unknown
> Unknown
> > xgca               0000000000A6DAB9  Unknown               Unknown
> Unknown
> > xgca               0000000000A6D394  Unknown               Unknown
> Unknown
> > xgca               00000000015217D0  Unknown               Unknown
> Unknown
> > xgca               00000000008BEE48  Unknown               Unknown
> Unknown
> > xgca               00000000008BE456  Unknown               Unknown
> Unknown
> > xgca               0000000000F0E10F  Unknown               Unknown
> Unknown
> > xgca               0000000000F0A1F2  Unknown               Unknown
> Unknown
> > xgca               0000000000A373F2  Unknown               Unknown
> Unknown
> > xgca               0000000000581957  petsc_lu_solver_          973
> collisionf2.F90
> > xgca               000000000057EB05  col_f_picard_step         372
> collisionf2.F90
> > xgca               0000000000564D6A  col_f_core_s_             945
> collisionf.F90
> > xgca               000000000056325F  f_collision_singl         254
> collisionf.F90
> > xgca               0000000000560409  f_collision_singl         350
> collisionf.F90
> > xgca               0000000002B1EF43  Unknown               UnLinear
> col_f_ solve converged due to CONVERGED_RTOL iterations 1
> > Linear col_f_ solve converged due to CONVERGED_RTOL iterations 1
> > Linear col_f_ solve converged due to CONVERGED_RTOL iterations 1
> > known  Unknown
> >
> >
> > On Sat, Feb 21, 2015 at 12:30 PM, Barry Smith <bsmith at mcs.anl.gov>
> wrote:
> >
> > barry/threadsafety-kspconvergedreason-kspmonitor
> >
> > Note that as always the monitoring and converged reasons for the various
> threads will be printed jumbled up
> >
> > In your own code make sure that any routines that use the default
> viewers (like stdout) are in a omp critical section
> >
> >  Barry
> >
> > > On Feb 20, 2015, at 8:22 PM, Mark Adams <mfadams at lbl.gov> wrote:
> > >
> > > OK, I have a code setup to test it so feel free to make branch and I
> can test it.
> > > Mark
> > >
> > > On Fri, Feb 20, 2015 at 7:13 PM, Barry Smith <bsmith at mcs.anl.gov>
> wrote:
> > >
> > >   Mark,
> > >
> > >    Yes, after looking at the code it does make sense. The reason is
> that Matt made me "improve" the -xxx_converged_reason to use viewers; but
> in your case there will be multiple threads (each associated with different
> KSP objects) each monkeying with the same (default) viewer thus possibly
> corrupting it.
> > >
> > >    I'll have to think a little bit about the best way to keep the
> functionality but be thread safe.
> > >
> > >   Barry
> > >
> > > > On Feb 20, 2015, at 5:57 PM, Mark Adams <mfadams at lbl.gov> wrote:
> > > >
> > > > Barry,
> > > >
> > > > We had a problem with the thread safe version and found, by pure
> luck, that apparently if we use -ksp_converged_reason we get segv type
> failure.  Does this sound sensible?
> > > >
> > > > I can give you an executable and environment the run this on Edison
> if that is useful.
> > > >
> > > > Thanks,
> > > > Mark
> > > >
> > > >
> > > > On Tue, Feb 17, 2015 at 9:27 PM, Barry Smith <bsmith at mcs.anl.gov>
> wrote:
> > > >
> > > >   You need to configure with --with-threadsafety and --with-log=0
> and --with-debugging=0
> > > >
> > > >   Eventually we'll support at least the debugging with thread safety.
> > > >
> > > >   Barry
> > > >
> > > > Not sure about that strange message from the cray system.
> > > >
> > > >
> > > > > On Feb 17, 2015, at 8:14 PM, Mark Adams <mfadams at lbl.gov> wrote:
> > > > >
> > > > > We have been testing master with a code that calls PETSc serial LU
> solvers from threads.  I have seen system messages with OMP (see way below)
> and Robert (cc'ed) reported this useful stack trace.
> > > > >
> > > > > I have not modified my (non-thread) build.  Perhaps I need to or
> are there PETSc runtime options?
> > > > >
> > > > > This is a Cray XC30 with Intel.
> > > > >
> > > > > Thanks,
> > > > > Mark
> > > > >
> > > > > SC[0;39mESC[0;49m[116]PETSC ERROR: Object is in wrong state
> > > > > [116]PETSC ERROR: Logging event had unbalanced begin/end pairs
> > > > > [116]PETSC ERROR: See
> http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
> > > > > [116]PETSC ERROR: Petsc Development GIT revision:
> v3.5.3-1570-gcaf1481  GIT Date: 2015-02-07 17:34:17 -0600
> > > > > [116]PETSC ERROR: ./xgca_petsc36_col on a arch-xc30-opt64-intel
> named nid05975 by rhager Tue Feb 17 10:46:32 2015
> > > > > [116]PETSC ERROR: Configure options --COPTFLAGS="-fast -no-ipo"
> --CXXOPTFLAGS="-fast -no-ipoi" --FOPTFLAGS="-fast -no-ipo" --download-hypre
> --download-superlu_dist --
> > > > > download-parmetis --download-metis --with-ssl=0 --with-cc=cc
> --with-clib-autodetect=0 --with-cxx=CC --with-cxxlib-autodetect=0
> --with-debugging=0 --with-fc=ftn --with
> > > > > -fortranlib-autodetect=0
> --with-hdf5-dir=/opt/cray/hdf5-parallel/1.8.13/intel/140/
> --with-shared-libraries=0 --with-x=0 --with-mpiexec=aprun LIBS=-lstdc++
> --with-64-b
> > > > > it-indices PETSC_ARCH=arch-xc30-opt64-intel
> PETSC_DIR=/global/u2/m/madams/petsc_master
> > > > > [116]PETSC ERROR: #1 PetscLogEventEndDefault() line 694 in
> /global/u2/m/madams/petsc_master/src/sys/logging/utils/eventlog.c
> > > > > [116]PETSC ERROR: #2 MatLUFactorSymbolic() line 2894 in
> /global/u2/m/madams/petsc_master/src/mat/interface/matrix.c
> > > > > [116]PETSC ERROR: #3 PCSetUp_LU() line 127 in
> /global/u2/m/madams/petsc_master/src/ksp/pc/impls/factor/lu/lu.c
> > > > > [116]PETSC ERROR: #4 PCSetUp() line 918 in
> /global/u2/m/madams/petsc_master/src/ksp/pc/interface/precon.c
> > > > > [116]PETSC ERROR: #5 KSPSetUp() line 306 in
> /global/u2/m/madams/petsc_master/src/ksp/ksp/interface/itfunc.c
> > > > > [116]PETSC ERROR: #6 KSPSolve() line 503 in
> /global/u2/m/madams/petsc_master/src/ksp/ksp/interface/itfunc.c
> > > > >
> > > > >
> > > > > Other error message:
> > > > >
> > > > >
> > > > > OMP: Error #13: Assertion failure at kmp_runtime.c(1588).
> > > > > OMP: Hint: Please submit a bug report with this message, compile
> and run commands used, and machine configuration info including native
> compiler and operating system versions. Faster response will be obtained by
> including all program sources. For information on submitting this issue,
> please see http://www.intel.com/software/products/support/.
> > > > > _pmiu_daemon(SIGCHLD): [NID 05979] [c7-3c0s6n3] [Tue Feb 17
> 15:14:43 2015] PE RANK 23 exit signal Killed
> > > > > _pmiu_daemon(SIGCHLD): [NID 05976] [c7-3c0s6n0] [Tue Feb 17
> 15:14:43 2015] PE RANK 10 exit signal Killed
> > > > > [NID 05979] 2015-02-17 15:14:43 Apid 10147992: initiated
> application termination
> > > > > [NID 05979] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 239]. Please contact admin for details. Killing
> pid 18637(xgca)
> > > > > [NID 05976] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 73]. Please contact admin for details. Killing
> pid 15380(xgca)
> > > > > [NID 05984] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 0]. Please contact admin for details. Killing
> pid 34636(xgca)
> > > > > [NID 05988] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 59]. Please contact admin for details. Killing
> pid 38496(xgca)
> > > > > [NID 06019] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 0]. Please contact admin for details. Killing
> pid 11132(xgca)
> > > > > [NID 05980] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 0]. Please contact admin for details. Killing
> pid 8320(xgca)
> > > > > [NID 05993] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 0]. Please contact admin for details. Killing
> pid 46182(xgca)
> > > > > [NID 06020] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 249]. Please contact admin for details. Killing
> pid 23753(xgca)
> > > > > [NID 05987] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 87]. Please contact admin for details. Killing
> pid 11254(xgca)
> > > > > [NID 05986] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 41]. Please contact admin for details. Killing
> pid 6630(xgca)
> > > > > [NID 05981] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 31]. Please contact admin for details. Killing
> pid 10520(xgca)
> > > > > [NID 05999] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 7]. Please contact admin for details. Killing
> pid 1843(xgca)
> > > > > [NID 05985] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 0]. Please contact admin for details. Killing
> pid 26498(xgca)
> > > > > [NID 05998] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 209]. Please contact admin for details. Killing
> pid 20387(xgca)
> > > > > [NID 05994] 2015-02-17 15:14:53 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 0]. Please contact admin for details. Killing
> pid 39462(xgca)
> > > > > [NID 05983] 2015-02-17 15:14:53 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 0]. Please contact admin for details. Killing
> pid 18598(xgca)
> > > > > [NID 05995] 2015-02-17 15:14:54 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 0]. Please contact admin for details. Killing
> pid 42322(xgca)
> > > > > [NID 05996] 2015-02-17 15:14:54 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 0]. Please contact admin for details. Killing
> pid 34248(xgca)
> > > > > [NID 05978] 2015-02-17 15:14:55 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 0]. Please contact admin for details. Killing
> pid 9483(xgca)
> > > > > [NID 05975] 2015-02-17 15:14:56 Apid 10147992: Cray HSN detected
> critical error 0x4416[ptag 0]. Please contact admin for details. Killing
> pid 11470(xgca)
> > > > > Application 10147992 exit codes: 137
> > > > > Application 10147992 exit signals: Killed
> > > > > Application 10147992 resources: utime ~2194s, stime ~199s, Rss
> ~488560, inblocks ~908164, outblocks ~2571652
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20150223/3c83a5fd/attachment.html>


More information about the petsc-dev mailing list