[mpich-discuss] MPICH2-1.3.1 crash and SIGHUP issues.
Dave Goodell
goodell at mcs.anl.gov
Mon Jan 10 15:35:40 CST 2011
Ignore these failures. The cancel tests are failing because there is a known resource leak inside MPICH2 for a very rarely-used piece of functionality (ticket #287). The "large_message" test fails on Linux in some cases due to bugs in Linux: http://trac.mcs.anl.gov/projects/mpich2/ticket/1080
Unless you are sending more than 2GiB of data in a single message, it's probable that you aren't hitting the problem described in that ticket.
-Dave
On Jan 10, 2011, at 3:25 PM CST, Sunil Thomas wrote:
> Hi,
>
> Thanks for the response. I am still not yet sure about the error being due to HYPRE. I will try to
> isolate the behavior in the failing function to a smaller example and report back on this. It might still
> be mpich2-1.3.1.
>
> But just to follow-up, I ran the test suite after doing a fresh install (this time I had built with-device=
> ch3:sock although the end result with my application is the same as the with default communication
> device..I also included --enable-g and --enable-debuginfo). The test suite result is as follows (most the
> tests go through fine..but some point to point communication tests appear to fail, it seems):
>
> Looking in ./testlist
> Processing directory attr
> Looking in ./attr/testlist
> Processing directory coll
> Looking in ./coll/testlist
> Processing directory comm
> Looking in ./comm/testlist
> Processing directory datatype
> Looking in ./datatype/testlist
> Processing directory errhan
> Looking in ./errhan/testlist
> Processing directory group
> Looking in ./group/testlist
> Processing directory info
> Looking in ./info/testlist
> Processing directory init
> Looking in ./init/testlist
> Processing directory pt2pt
> Looking in ./pt2pt/testlist
> Unexpected output in scancel: In direct memory block for handle type REQUEST, 3 handles are still allocated
> Unexpected output in scancel: In direct memory block for handle type REQUEST, 4 handles are still allocated
> Unexpected output in scancel: In direct memory block for handle type COMM, 2 handles are still allocated
> Unexpected output in scancel: [0] 24 at [0x000000000e443158], rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]
> Unexpected output in scancel: [0] 32 at [0x000000000e443088], rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]
> Unexpected output in scancel: [0] 8 at [0x000000000e441a08], /mpi/mpich2-1.3.1/src/util/procmap/local_proc.c[91]
> Unexpected output in scancel: [0] 8 at [0x000000000e441958], /mpi/mpich2-1.3.1/src/util/procmap/local_proc.c[90]
> Unexpected output in scancel: [0] 32 at [0x000000000e442268], rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]
> Unexpected output in pscancel: In direct memory block for handle type REQUEST, 4 handles are still allocated
> Unexpected output in pscancel: In direct memory block for handle type COMM, 2 handles are still allocated
> Unexpected output in pscancel: In direct memory block for handle type REQUEST, 3 handles are still allocated
> Unexpected output in pscancel: [0] 24 at [0x000000000bec1158], rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]
> Unexpected output in pscancel: [0] 32 at [0x000000000bec1088], rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]
> Unexpected output in pscancel: [0] 8 at [0x000000000bebfa08], /mpi/mpich2-1.3.1/src/util/procmap/local_proc.c[91]
> Unexpected output in pscancel: [0] 8 at [0x000000000bebf958], /mpi/mpich2-1.3.1/src/util/procmap/local_proc.c[90]
> Unexpected output in pscancel: [0] 32 at [0x000000000bec0268], rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]
> Unexpected output in cancelrecv: In direct memory block for handle type REQUEST, 1 handles are still allocated
> Unexpected output in large_message: Fatal error in MPI_Recv: Other MPI error, error stack:
> Unexpected output in large_message: MPI_Recv(186).............................: MPI_Recv(buf=0x2ae606fa8010, count=270000000, MPI_LONG_LONG_INT, src=0, tag=0, MPI_COMM_WORLD, status=0x7fffa45959a0) failed
> Unexpected output in large_message: MPIDI_CH3i_Progress_wait(213).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
> Unexpected output in large_message: MPIDI_CH3I_Progress_handle_sock_event(456):
> Unexpected output in large_message: adjust_iov(828)...........................: ch3|sock|immedread 0x2ae60695bd40 0x4169698 0x4164318
> Unexpected output in large_message: MPIDU_Sock_readv(426).....................: connection closed by peer (set=0,sock=1)
> Unexpected output in large_message: [cli_1]: aborting job:
> Unexpected output in large_message: Fatal error in MPI_Recv: Other MPI error, error stack:
> Unexpected output in large_message: MPI_Recv(186).............................: MPI_Recv(buf=0x2ae606fa8010, count=270000000, MPI_LONG_LONG_INT, src=0, tag=0, MPI_COMM_WORLD, status=0x7fffa45959a0) failed
> Unexpected output in large_message: MPIDI_CH3i_Progress_wait(213).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
> Unexpected output in large_message: MPIDI_CH3I_Progress_handle_sock_event(456):
> Unexpected output in large_message: adjust_iov(828)...........................: ch3|sock|immedread 0x2ae60695bd40 0x4169698 0x4164318
> Unexpected output in large_message: MPIDU_Sock_readv(426).....................: connection closed by peer (set=0,sock=1)
> Unexpected output in large_message: APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)
> Program large_message exited without No Errors
> Looking in ./rma/testlist
> Processing directory spawn
> Looking in ./spawn/testlist
> Processing directory topo
> Looking in ./topo/testlist
> Processing directory perf
> Looking in ./perf/testlist
> Processing directory io
> Looking in ./io/testlist
> Processing directory f77
> Looking in ./f77/testlist
> Processing directory attr
> Looking in ./f77/attr/testlist
> Processing directory coll
> Looking in ./f77/coll/testlist
> Processing directory datatype
> Looking in ./f77/datatype/testlist
> Processing directory pt2pt
> Looking in ./f77/pt2pt/testlist
> Processing directory info
> Looking in ./f77/info/testlist
> ..
> .. (all remaining tests pass)
>
> Any ideas what may be causing this? It clearly seems like this issue could be related to the one I
> have in my application since the HYPRE functions are using MPI_Recv's... Greatly appreciate any
> thoughts on why pt2pt test is failing and how to resolve? Kindly note that I am testing this on
> RHEL5 gcc version 4.1.2
>
> Thanks!
> --Sunil.
>
>
> On Mon, Jan 10, 2011 at 12:55 PM, Sunil Thomas <sgthomas27 at gmail.com> wrote:
> Thanks for the response. Moving forward, upon further debugging of the example code resulting in the
> "APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)" error (using gdb and by
> attaching to each process), here is what I got so far:
>
> --------------------------
> 0x0000000000401ad0 in main (argc=1, argv=0x7fff12a68028) at ex5.c:57
> 57 while (DebugWait);
> (gdb) r
> The program being debugged has been started already.
> Start it from the beginning? (y or n) n
> Program not restarted.
> (gdb) set DebugWait = 0
> (gdb) s
> 61 n = 33;
> (gdb) n
> 62 solver_id = 0;
> (gdb) c
> Continuing.
> Program received signal SIGSEGV, Segmentation fault.
> 0x00002b71982477e0 in hypre_MatvecCommPkgCreate_core () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
> (gdb) bt
> #0 0x00002b71982477e0 in hypre_MatvecCommPkgCreate_core () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
> #1 0x00002b7198247d8c in hypre_MatvecCommPkgCreate () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
> #2 0x00002b7198234361 in hypre_BoomerAMGCreateS () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
> #3 0x00002b71981f10f5 in hypre_BoomerAMGSetup () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
> #4 0x0000000000402421 in main (argc=1, argv=0x7fff12a68028) at ex5.c:319
> (gdb) q
> --------------------------
> 0x0000000000401ad0 in main (argc=1, argv=0x7fff4b539af8) at ex5.c:57
> 57 while (DebugWait);
> (gdb) set DebugWait = 0
> (gdb) s
> 61 n = 33;
> (gdb) n
> 62 solver_id = 0;
> (gdb) c
> Continuing.
> Program received signal SIGSEGV, Segmentation fault.
> 0x00002b8a5f727f07 in hypre_BoomerAMGCoarsen () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
> (gdb) bt
> #0 0x00002b8a5f727f07 in hypre_BoomerAMGCoarsen () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
> #1 0x00002b8a5f72ab51 in hypre_BoomerAMGCoarsenFalgout () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
> #2 0x00002b8a5f721cdf in hypre_BoomerAMGSetup () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
> #3 0x0000000000402421 in main (argc=1, argv=0x7fff4b539af8) at ex5.c:319
> (gdb) q
> ------------------------
>
>
> Before digging any further in the 3rd party library HYPRE, does this give any useful info as to where the problem lies, in
> terms of ruling out say error with mpich2-1.3.1, etc? It seems like the problem is in the 3rd party library HYPRE (I am
> using version 2.4.0b), but I am not 100% sure.
>
> Thanks again.
> --Sunil.
>
>
>
> On Sun, Jan 9, 2011 at 6:22 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>
> Please keep mpich-discuss cc'ed.
>
> ----- Original Message -----
> > Thanks Pavan!
> >
> > No I am not. I was simply searching for the error message I got. The
> > fact
> > that the error is seen (whether using RMA or not) suggests the problem
> > could
> > still be with mpich2-1.3.1.
>
> If the application terminates (for any reason), the process manager will display this error string. These two could be (and most likely are) completely unrelated problems.
>
> -- Pavan
>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list