[mpich-discuss] MPICH2-1.3.1 crash and SIGHUP issues.

Sunil Thomas sgthomas27 at gmail.com
Mon Jan 10 15:25:43 CST 2011


Hi,

   Thanks for the response. I am still not yet sure about the error
being due to HYPRE. I will try to
isolate the behavior in the failing function to a smaller example and report
back on this. It might still
be mpich2-1.3.1.

   But just to follow-up, I ran the test suite after doing a fresh install
(this time I had built with-device=
ch3:sock although the end result with my application is the same as the with
default communication
device..I also included --enable-g and --enable-debuginfo). The test suite
result is as follows (most the
tests go through fine..but some point to point communication tests appear
to fail, it seems):

*Looking in ./testlist
Processing directory attr
Looking in ./attr/testlist
Processing directory coll
Looking in ./coll/testlist
Processing directory comm
Looking in ./comm/testlist
Processing directory datatype
Looking in ./datatype/testlist
Processing directory errhan
Looking in ./errhan/testlist
Processing directory group
Looking in ./group/testlist
Processing directory info
Looking in ./info/testlist
Processing directory init
Looking in ./init/testlist
Processing directory pt2pt*
*Looking in ./pt2pt/testlist
Unexpected output in scancel: In direct memory block for handle type
REQUEST, 3 handles are still allocated
Unexpected output in scancel: In direct memory block for handle type
REQUEST, 4 handles are still allocated
Unexpected output in scancel: In direct memory block for handle type COMM, 2
handles are still allocated
Unexpected output in scancel: [0] 24 at [0x000000000e443158],
rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]
Unexpected output in scancel: [0] 32 at [0x000000000e443088],
rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]
Unexpected output in scancel: [0] 8 at [0x000000000e441a08],
/mpi/mpich2-1.3.1/src/util/procmap/local_proc.c[91]
Unexpected output in scancel: [0] 8 at [0x000000000e441958],
/mpi/mpich2-1.3.1/src/util/procmap/local_proc.c[90]
Unexpected output in scancel: [0] 32 at [0x000000000e442268],
rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]
Unexpected output in pscancel: In direct memory block for handle type
REQUEST, 4 handles are still allocated
Unexpected output in pscancel: In direct memory block for handle type COMM,
2 handles are still allocated
Unexpected output in pscancel: In direct memory block for handle type
REQUEST, 3 handles are still allocated
Unexpected output in pscancel: [0] 24 at [0x000000000bec1158],
rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]
Unexpected output in pscancel: [0] 32 at [0x000000000bec1088],
rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]
Unexpected output in pscancel: [0] 8 at [0x000000000bebfa08],
/mpi/mpich2-1.3.1/src/util/procmap/local_proc.c[91]
Unexpected output in pscancel: [0] 8 at [0x000000000bebf958],
/mpi/mpich2-1.3.1/src/util/procmap/local_proc.c[90]
Unexpected output in pscancel: [0] 32 at [0x000000000bec0268],
rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]
Unexpected output in cancelrecv: In direct memory block for handle type
REQUEST, 1 handles are still allocated
Unexpected output in large_message: Fatal error in MPI_Recv: Other MPI
error, error stack:
Unexpected output in large_message:
MPI_Recv(186).............................: MPI_Recv(buf=0x2ae606fa8010,
count=270000000, MPI_LONG_LONG_INT, src=0, tag=0, MPI_COMM_WORLD,
status=0x7fffa45959a0) failed
Unexpected output in large_message:
MPIDI_CH3i_Progress_wait(213).............: an error occurred while handling
an event returned by MPIDU_Sock_Wait()
Unexpected output in large_message:
MPIDI_CH3I_Progress_handle_sock_event(456):
Unexpected output in large_message:
adjust_iov(828)...........................: ch3|sock|immedread
0x2ae60695bd40 0x4169698 0x4164318
Unexpected output in large_message:
MPIDU_Sock_readv(426).....................: connection closed by peer
(set=0,sock=1)
Unexpected output in large_message: [cli_1]: aborting job:
Unexpected output in large_message: Fatal error in MPI_Recv: Other MPI
error, error stack:
Unexpected output in large_message:
MPI_Recv(186).............................: MPI_Recv(buf=0x2ae606fa8010,
count=270000000, MPI_LONG_LONG_INT, src=0, tag=0, MPI_COMM_WORLD,
status=0x7fffa45959a0) failed
Unexpected output in large_message:
MPIDI_CH3i_Progress_wait(213).............: an error occurred while handling
an event returned by MPIDU_Sock_Wait()
Unexpected output in large_message:
MPIDI_CH3I_Progress_handle_sock_event(456):
Unexpected output in large_message:
adjust_iov(828)...........................: ch3|sock|immedread
0x2ae60695bd40 0x4169698 0x4164318
Unexpected output in large_message:
MPIDU_Sock_readv(426).....................: connection closed by peer
(set=0,sock=1)
Unexpected output in large_message: APPLICATION TERMINATED WITH THE EXIT
STRING: Hangup (signal 1)
Program large_message exited without No Errors*
*Looking in ./rma/testlist
Processing directory spawn
Looking in ./spawn/testlist
Processing directory topo
Looking in ./topo/testlist
Processing directory perf
Looking in ./perf/testlist
Processing directory io
Looking in ./io/testlist
Processing directory f77
Looking in ./f77/testlist
Processing directory attr
Looking in ./f77/attr/testlist
Processing directory coll
Looking in ./f77/coll/testlist
Processing directory datatype
Looking in ./f77/datatype/testlist
Processing directory pt2pt
Looking in ./f77/pt2pt/testlist
Processing directory info
Looking in ./f77/info/testlist*
*..*
*.. (all remaining tests pass)*

Any ideas what may be causing this? It clearly seems like this issue could
be related to the one I
have in my application since the HYPRE functions are using MPI_Recv's...
Greatly appreciate any
thoughts on why pt2pt test is failing and how to resolve? Kindly note that I
am testing this on
RHEL5 gcc version 4.1.2

Thanks!
--Sunil.


On Mon, Jan 10, 2011 at 12:55 PM, Sunil Thomas <sgthomas27 at gmail.com> wrote:

> Thanks for the response. Moving forward, upon further debugging of the
> example code resulting in the
> "APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)" error
> (using gdb and by
> attaching to each process), here is what I got so far:
>
> --------------------------
> 0x0000000000401ad0 in main (argc=1, argv=0x7fff12a68028) at ex5.c:57
> 57         while (DebugWait);
> (gdb) r
> The program being debugged has been started already.
> Start it from the beginning? (y or n) n
> Program not restarted.
> (gdb) set DebugWait = 0
> (gdb) s
> 61         n = 33;
> (gdb) n
> 62         solver_id = 0;
> (gdb) c
> Continuing.
> Program received signal SIGSEGV, Segmentation fault.
> 0x00002b71982477e0 in hypre_MatvecCommPkgCreate_core () from
> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
> (gdb) bt
> #0  0x00002b71982477e0 in hypre_MatvecCommPkgCreate_core () from
> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
> #1  0x00002b7198247d8c in hypre_MatvecCommPkgCreate () from
> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
> #2  0x00002b7198234361 in hypre_BoomerAMGCreateS () from
> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
> #3  0x00002b71981f10f5 in hypre_BoomerAMGSetup () from
> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
> #4  0x0000000000402421 in main (argc=1, argv=0x7fff12a68028) at ex5.c:319
> (gdb) q
> --------------------------
> 0x0000000000401ad0 in main (argc=1, argv=0x7fff4b539af8) at ex5.c:57
> 57         while (DebugWait);
> (gdb) set DebugWait = 0
> (gdb) s
> 61         n = 33;
> (gdb) n
> 62         solver_id = 0;
> (gdb) c
> Continuing.
> Program received signal SIGSEGV, Segmentation fault.
> 0x00002b8a5f727f07 in hypre_BoomerAMGCoarsen () from
> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
> (gdb) bt
> #0  0x00002b8a5f727f07 in hypre_BoomerAMGCoarsen () from
> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
> #1  0x00002b8a5f72ab51 in hypre_BoomerAMGCoarsenFalgout () from
> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
> #2  0x00002b8a5f721cdf in hypre_BoomerAMGSetup () from
> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
> #3  0x0000000000402421 in main (argc=1, argv=0x7fff4b539af8) at ex5.c:319
> (gdb) q
> ------------------------
>
>
> Before digging any further in the 3rd party library HYPRE, does this give
> any useful info as to where the problem lies, in
> terms of ruling out say error with mpich2-1.3.1, etc? It seems like the
> problem is in the 3rd party library HYPRE (I am
> using version 2.4.0b), but I am not 100% sure.
>
> Thanks again.
> --Sunil.
>
>
>
> On Sun, Jan 9, 2011 at 6:22 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>
>>
>> Please keep mpich-discuss cc'ed.
>>
>> ----- Original Message -----
>> > Thanks Pavan!
>> >
>> > No I am not. I was simply searching for the error message I got. The
>> > fact
>> > that the error is seen (whether using RMA or not) suggests the problem
>> > could
>> > still be with mpich2-1.3.1.
>>
>> If the application terminates (for any reason), the process manager will
>> display this error string. These two could be (and most likely are)
>> completely unrelated problems.
>>
>>  -- Pavan
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110110/dccb2bf5/attachment-0001.htm>


More information about the mpich-discuss mailing list