[mpich-discuss] MPICH2-1.3.1 crash and SIGHUP issues.
Sunil Thomas
sgthomas27 at gmail.com
Mon Jan 10 18:42:44 CST 2011
Hi,
Kindly ignore last note (it seemed to run OK on RHEL4 due to some
incorrect 3rdparty library config
settings in RHEL4). Actually, the similar error behavior is also observed on
RHEL4 actually (just as in
RHEL5). Apologies..
--Sunil.
On Mon, Jan 10, 2011 at 2:35 PM, Sunil Thomas <sgthomas27 at gmail.com> wrote:
> Hi,
>
> Yet another followup. When I build and install using the exact same
> configure options I have been
> using on RHEL4 (gcc version 3.4.6), the error is no longer seen. Everythign
> is running fine there.
> Appreciate if anybody lets me know if this info is of any help in
> identifying potential fix...I am meanwhile
> trying to isolate the MPI usage of the failing HYPRE function on RHEL5 into
> a simple example and will
> report back..
>
> Thanks!
> --Sunil.
>
>
> On Mon, Jan 10, 2011 at 1:37 PM, Darius Buntinas <buntinas at mcs.anl.gov>wrote:
>
>>
>> It's doubtful that any of these are causing the segfault you're seeing.
>>
>> The first two mean that there are some resources that weren't freed when
>> the test exited.
>>
>> The second one is related to sending large (4-8GB messages). I think the
>> linux kernel has a bug in the tcp stack when sending large messages with
>> iovecs. The bug results in a dropped connection, which terminates the
>> program with an error message, rather than a segfault.
>>
>> -d
>>
>> On Jan 10, 2011, at 3:25 PM, Sunil Thomas wrote:
>>
>> > Hi,
>> >
>> > Thanks for the response. I am still not yet sure about the error
>> being due to HYPRE. I will try to
>> > isolate the behavior in the failing function to a smaller example and
>> report back on this. It might still
>> > be mpich2-1.3.1.
>> >
>> > But just to follow-up, I ran the test suite after doing a fresh
>> install (this time I had built with-device=
>> > ch3:sock although the end result with my application is the same as the
>> with default communication
>> > device..I also included --enable-g and --enable-debuginfo). The test
>> suite result is as follows (most the
>> > tests go through fine..but some point to point communication tests
>> appear to fail, it seems):
>> >
>> > Looking in ./testlist
>> > Processing directory attr
>> > Looking in ./attr/testlist
>> > Processing directory coll
>> > Looking in ./coll/testlist
>> > Processing directory comm
>> > Looking in ./comm/testlist
>> > Processing directory datatype
>> > Looking in ./datatype/testlist
>> > Processing directory errhan
>> > Looking in ./errhan/testlist
>> > Processing directory group
>> > Looking in ./group/testlist
>> > Processing directory info
>> > Looking in ./info/testlist
>> > Processing directory init
>> > Looking in ./init/testlist
>> > Processing directory pt2pt
>> > Looking in ./pt2pt/testlist
>> > Unexpected output in scancel: In direct memory block for handle type
>> REQUEST, 3 handles are still allocated
>> > Unexpected output in scancel: In direct memory block for handle type
>> REQUEST, 4 handles are still allocated
>> > Unexpected output in scancel: In direct memory block for handle type
>> COMM, 2 handles are still allocated
>> > Unexpected output in scancel: [0] 24 at [0x000000000e443158],
>> rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]
>> > Unexpected output in scancel: [0] 32 at [0x000000000e443088],
>> rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]
>> > Unexpected output in scancel: [0] 8 at [0x000000000e441a08],
>> /mpi/mpich2-1.3.1/src/util/procmap/local_proc.c[91]
>> > Unexpected output in scancel: [0] 8 at [0x000000000e441958],
>> /mpi/mpich2-1.3.1/src/util/procmap/local_proc.c[90]
>> > Unexpected output in scancel: [0] 32 at [0x000000000e442268],
>> rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]
>> > Unexpected output in pscancel: In direct memory block for handle type
>> REQUEST, 4 handles are still allocated
>> > Unexpected output in pscancel: In direct memory block for handle type
>> COMM, 2 handles are still allocated
>> > Unexpected output in pscancel: In direct memory block for handle type
>> REQUEST, 3 handles are still allocated
>> > Unexpected output in pscancel: [0] 24 at [0x000000000bec1158],
>> rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]
>> > Unexpected output in pscancel: [0] 32 at [0x000000000bec1088],
>> rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]
>> > Unexpected output in pscancel: [0] 8 at [0x000000000bebfa08],
>> /mpi/mpich2-1.3.1/src/util/procmap/local_proc.c[91]
>> > Unexpected output in pscancel: [0] 8 at [0x000000000bebf958],
>> /mpi/mpich2-1.3.1/src/util/procmap/local_proc.c[90]
>> > Unexpected output in pscancel: [0] 32 at [0x000000000bec0268],
>> rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]
>> > Unexpected output in cancelrecv: In direct memory block for handle type
>> REQUEST, 1 handles are still allocated
>> > Unexpected output in large_message: Fatal error in MPI_Recv: Other MPI
>> error, error stack:
>> > Unexpected output in large_message:
>> MPI_Recv(186).............................: MPI_Recv(buf=0x2ae606fa8010,
>> count=270000000, MPI_LONG_LONG_INT, src=0, tag=0, MPI_COMM_WORLD,
>> status=0x7fffa45959a0) failed
>> > Unexpected output in large_message:
>> MPIDI_CH3i_Progress_wait(213).............: an error occurred while handling
>> an event returned by MPIDU_Sock_Wait()
>> > Unexpected output in large_message:
>> MPIDI_CH3I_Progress_handle_sock_event(456):
>> > Unexpected output in large_message:
>> adjust_iov(828)...........................: ch3|sock|immedread
>> 0x2ae60695bd40 0x4169698 0x4164318
>> > Unexpected output in large_message:
>> MPIDU_Sock_readv(426).....................: connection closed by peer
>> (set=0,sock=1)
>> > Unexpected output in large_message: [cli_1]: aborting job:
>> > Unexpected output in large_message: Fatal error in MPI_Recv: Other MPI
>> error, error stack:
>> > Unexpected output in large_message:
>> MPI_Recv(186).............................: MPI_Recv(buf=0x2ae606fa8010,
>> count=270000000, MPI_LONG_LONG_INT, src=0, tag=0, MPI_COMM_WORLD,
>> status=0x7fffa45959a0) failed
>> > Unexpected output in large_message:
>> MPIDI_CH3i_Progress_wait(213).............: an error occurred while handling
>> an event returned by MPIDU_Sock_Wait()
>> > Unexpected output in large_message:
>> MPIDI_CH3I_Progress_handle_sock_event(456):
>> > Unexpected output in large_message:
>> adjust_iov(828)...........................: ch3|sock|immedread
>> 0x2ae60695bd40 0x4169698 0x4164318
>> > Unexpected output in large_message:
>> MPIDU_Sock_readv(426).....................: connection closed by peer
>> (set=0,sock=1)
>> > Unexpected output in large_message: APPLICATION TERMINATED WITH THE EXIT
>> STRING: Hangup (signal 1)
>> > Program large_message exited without No Errors
>> > Looking in ./rma/testlist
>> > Processing directory spawn
>> > Looking in ./spawn/testlist
>> > Processing directory topo
>> > Looking in ./topo/testlist
>> > Processing directory perf
>> > Looking in ./perf/testlist
>> > Processing directory io
>> > Looking in ./io/testlist
>> > Processing directory f77
>> > Looking in ./f77/testlist
>> > Processing directory attr
>> > Looking in ./f77/attr/testlist
>> > Processing directory coll
>> > Looking in ./f77/coll/testlist
>> > Processing directory datatype
>> > Looking in ./f77/datatype/testlist
>> > Processing directory pt2pt
>> > Looking in ./f77/pt2pt/testlist
>> > Processing directory info
>> > Looking in ./f77/info/testlist
>> > ..
>> > .. (all remaining tests pass)
>> >
>> > Any ideas what may be causing this? It clearly seems like this issue
>> could be related to the one I
>> > have in my application since the HYPRE functions are using MPI_Recv's...
>> Greatly appreciate any
>> > thoughts on why pt2pt test is failing and how to resolve? Kindly note
>> that I am testing this on
>> > RHEL5 gcc version 4.1.2
>> >
>> > Thanks!
>> > --Sunil.
>> >
>> >
>> > On Mon, Jan 10, 2011 at 12:55 PM, Sunil Thomas <sgthomas27 at gmail.com>
>> wrote:
>> > Thanks for the response. Moving forward, upon further debugging of the
>> example code resulting in the
>> > "APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)" error
>> (using gdb and by
>> > attaching to each process), here is what I got so far:
>> >
>> > --------------------------
>> > 0x0000000000401ad0 in main (argc=1, argv=0x7fff12a68028) at ex5.c:57
>> > 57 while (DebugWait);
>> > (gdb) r
>> > The program being debugged has been started already.
>> > Start it from the beginning? (y or n) n
>> > Program not restarted.
>> > (gdb) set DebugWait = 0
>> > (gdb) s
>> > 61 n = 33;
>> > (gdb) n
>> > 62 solver_id = 0;
>> > (gdb) c
>> > Continuing.
>> > Program received signal SIGSEGV, Segmentation fault.
>> > 0x00002b71982477e0 in hypre_MatvecCommPkgCreate_core () from
>> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
>> > (gdb) bt
>> > #0 0x00002b71982477e0 in hypre_MatvecCommPkgCreate_core () from
>> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
>> > #1 0x00002b7198247d8c in hypre_MatvecCommPkgCreate () from
>> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
>> > #2 0x00002b7198234361 in hypre_BoomerAMGCreateS () from
>> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
>> > #3 0x00002b71981f10f5 in hypre_BoomerAMGSetup () from
>> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
>> > #4 0x0000000000402421 in main (argc=1, argv=0x7fff12a68028) at
>> ex5.c:319
>> > (gdb) q
>> > --------------------------
>> > 0x0000000000401ad0 in main (argc=1, argv=0x7fff4b539af8) at ex5.c:57
>> > 57 while (DebugWait);
>> > (gdb) set DebugWait = 0
>> > (gdb) s
>> > 61 n = 33;
>> > (gdb) n
>> > 62 solver_id = 0;
>> > (gdb) c
>> > Continuing.
>> > Program received signal SIGSEGV, Segmentation fault.
>> > 0x00002b8a5f727f07 in hypre_BoomerAMGCoarsen () from
>> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
>> > (gdb) bt
>> > #0 0x00002b8a5f727f07 in hypre_BoomerAMGCoarsen () from
>> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
>> > #1 0x00002b8a5f72ab51 in hypre_BoomerAMGCoarsenFalgout () from
>> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
>> > #2 0x00002b8a5f721cdf in hypre_BoomerAMGSetup () from
>> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
>> > #3 0x0000000000402421 in main (argc=1, argv=0x7fff4b539af8) at
>> ex5.c:319
>> > (gdb) q
>> > ------------------------
>> >
>> >
>> > Before digging any further in the 3rd party library HYPRE, does this
>> give any useful info as to where the problem lies, in
>> > terms of ruling out say error with mpich2-1.3.1, etc? It seems like the
>> problem is in the 3rd party library HYPRE (I am
>> > using version 2.4.0b), but I am not 100% sure.
>> >
>> > Thanks again.
>> > --Sunil.
>> >
>> >
>> >
>> > On Sun, Jan 9, 2011 at 6:22 PM, Pavan Balaji <balaji at mcs.anl.gov>
>> wrote:
>> >
>> > Please keep mpich-discuss cc'ed.
>> >
>> > ----- Original Message -----
>> > > Thanks Pavan!
>> > >
>> > > No I am not. I was simply searching for the error message I got. The
>> > > fact
>> > > that the error is seen (whether using RMA or not) suggests the problem
>> > > could
>> > > still be with mpich2-1.3.1.
>> >
>> > If the application terminates (for any reason), the process manager will
>> display this error string. These two could be (and most likely are)
>> completely unrelated problems.
>> >
>> > -- Pavan
>> >
>> >
>> > _______________________________________________
>> > mpich-discuss mailing list
>> > mpich-discuss at mcs.anl.gov
>> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110110/6dcf737b/attachment.htm>
More information about the mpich-discuss
mailing list