[mpich-discuss] MPICH2-1.3.1 crash and SIGHUP issues.

Sunil Thomas sgthomas27 at gmail.com
Tue Jan 11 12:04:37 CST 2011


Hi!

  I've finally resolved the problem. As we suspected (and thanks to all your
inputs), the problem was indeed
with the 3rd party HYPRE library. I was using a pre-configured and compiled
library which I mistakenly
assumed was appropriately configured to run in parallel as well.

  I reconfigured HYPRE again to fix some of the erroneous config settings;
and now all the examples are
running to completion on any number of processors.

Thanks again!
--Sunil.

On Mon, Jan 10, 2011 at 4:42 PM, Sunil Thomas <sgthomas27 at gmail.com> wrote:

> Hi,
>
>   Kindly ignore last note (it seemed to run OK on RHEL4 due to some
> incorrect 3rdparty library config
> settings in RHEL4). Actually, the similar error behavior is also observed
> on RHEL4 actually (just as in
> RHEL5). Apologies..
>
> --Sunil.
>
>
>
> On Mon, Jan 10, 2011 at 2:35 PM, Sunil Thomas <sgthomas27 at gmail.com>wrote:
>
>> Hi,
>>
>>   Yet another followup. When I build and install using the exact same
>> configure options I have been
>> using on RHEL4 (gcc version 3.4.6), the error is no longer seen.
>> Everythign is running fine there.
>>    Appreciate if anybody lets me know if this info is of any help in
>> identifying potential fix...I am meanwhile
>> trying to isolate the MPI usage of the failing HYPRE function on RHEL5
>> into a simple example and will
>> report back..
>>
>> Thanks!
>> --Sunil.
>>
>>
>> On Mon, Jan 10, 2011 at 1:37 PM, Darius Buntinas <buntinas at mcs.anl.gov>wrote:
>>
>>>
>>> It's doubtful that any of these are causing the segfault you're seeing.
>>>
>>> The first two mean that there are some resources that weren't freed when
>>> the test exited.
>>>
>>> The second one is related to sending large (4-8GB messages).  I think the
>>> linux kernel has a bug in the tcp stack when sending large messages with
>>> iovecs.  The bug results in a dropped connection, which terminates the
>>> program with an error message, rather than a segfault.
>>>
>>> -d
>>>
>>> On Jan 10, 2011, at 3:25 PM, Sunil Thomas wrote:
>>>
>>> > Hi,
>>> >
>>> >    Thanks for the response. I am still not yet sure about the error
>>> being due to HYPRE. I will try to
>>> > isolate the behavior in the failing function to a smaller example and
>>> report back on this. It might still
>>> > be mpich2-1.3.1.
>>> >
>>> >    But just to follow-up, I ran the test suite after doing a fresh
>>> install (this time I had built with-device=
>>> > ch3:sock although the end result with my application is the same as the
>>> with default communication
>>> > device..I also included --enable-g and --enable-debuginfo). The test
>>> suite result is as follows (most the
>>> > tests go through fine..but some point to point communication tests
>>> appear to fail, it seems):
>>> >
>>> > Looking in ./testlist
>>> > Processing directory attr
>>> > Looking in ./attr/testlist
>>> > Processing directory coll
>>> > Looking in ./coll/testlist
>>> > Processing directory comm
>>> > Looking in ./comm/testlist
>>> > Processing directory datatype
>>> > Looking in ./datatype/testlist
>>> > Processing directory errhan
>>> > Looking in ./errhan/testlist
>>> > Processing directory group
>>> > Looking in ./group/testlist
>>> > Processing directory info
>>> > Looking in ./info/testlist
>>> > Processing directory init
>>> > Looking in ./init/testlist
>>> > Processing directory pt2pt
>>> > Looking in ./pt2pt/testlist
>>> > Unexpected output in scancel: In direct memory block for handle type
>>> REQUEST, 3 handles are still allocated
>>> > Unexpected output in scancel: In direct memory block for handle type
>>> REQUEST, 4 handles are still allocated
>>> > Unexpected output in scancel: In direct memory block for handle type
>>> COMM, 2 handles are still allocated
>>> > Unexpected output in scancel: [0] 24 at [0x000000000e443158],
>>> rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]
>>> > Unexpected output in scancel: [0] 32 at [0x000000000e443088],
>>> rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]
>>> > Unexpected output in scancel: [0] 8 at [0x000000000e441a08],
>>> /mpi/mpich2-1.3.1/src/util/procmap/local_proc.c[91]
>>> > Unexpected output in scancel: [0] 8 at [0x000000000e441958],
>>> /mpi/mpich2-1.3.1/src/util/procmap/local_proc.c[90]
>>> > Unexpected output in scancel: [0] 32 at [0x000000000e442268],
>>> rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]
>>> > Unexpected output in pscancel: In direct memory block for handle type
>>> REQUEST, 4 handles are still allocated
>>> > Unexpected output in pscancel: In direct memory block for handle type
>>> COMM, 2 handles are still allocated
>>> > Unexpected output in pscancel: In direct memory block for handle type
>>> REQUEST, 3 handles are still allocated
>>> > Unexpected output in pscancel: [0] 24 at [0x000000000bec1158],
>>> rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]
>>> > Unexpected output in pscancel: [0] 32 at [0x000000000bec1088],
>>> rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]
>>> > Unexpected output in pscancel: [0] 8 at [0x000000000bebfa08],
>>> /mpi/mpich2-1.3.1/src/util/procmap/local_proc.c[91]
>>> > Unexpected output in pscancel: [0] 8 at [0x000000000bebf958],
>>> /mpi/mpich2-1.3.1/src/util/procmap/local_proc.c[90]
>>> > Unexpected output in pscancel: [0] 32 at [0x000000000bec0268],
>>> rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]
>>> > Unexpected output in cancelrecv: In direct memory block for handle type
>>> REQUEST, 1 handles are still allocated
>>> > Unexpected output in large_message: Fatal error in MPI_Recv: Other MPI
>>> error, error stack:
>>> > Unexpected output in large_message:
>>> MPI_Recv(186).............................: MPI_Recv(buf=0x2ae606fa8010,
>>> count=270000000, MPI_LONG_LONG_INT, src=0, tag=0, MPI_COMM_WORLD,
>>> status=0x7fffa45959a0) failed
>>> > Unexpected output in large_message:
>>> MPIDI_CH3i_Progress_wait(213).............: an error occurred while handling
>>> an event returned by MPIDU_Sock_Wait()
>>> > Unexpected output in large_message:
>>> MPIDI_CH3I_Progress_handle_sock_event(456):
>>> > Unexpected output in large_message:
>>> adjust_iov(828)...........................: ch3|sock|immedread
>>> 0x2ae60695bd40 0x4169698 0x4164318
>>> > Unexpected output in large_message:
>>> MPIDU_Sock_readv(426).....................: connection closed by peer
>>> (set=0,sock=1)
>>> > Unexpected output in large_message: [cli_1]: aborting job:
>>> > Unexpected output in large_message: Fatal error in MPI_Recv: Other MPI
>>> error, error stack:
>>> > Unexpected output in large_message:
>>> MPI_Recv(186).............................: MPI_Recv(buf=0x2ae606fa8010,
>>> count=270000000, MPI_LONG_LONG_INT, src=0, tag=0, MPI_COMM_WORLD,
>>> status=0x7fffa45959a0) failed
>>> > Unexpected output in large_message:
>>> MPIDI_CH3i_Progress_wait(213).............: an error occurred while handling
>>> an event returned by MPIDU_Sock_Wait()
>>> > Unexpected output in large_message:
>>> MPIDI_CH3I_Progress_handle_sock_event(456):
>>> > Unexpected output in large_message:
>>> adjust_iov(828)...........................: ch3|sock|immedread
>>> 0x2ae60695bd40 0x4169698 0x4164318
>>> > Unexpected output in large_message:
>>> MPIDU_Sock_readv(426).....................: connection closed by peer
>>> (set=0,sock=1)
>>> > Unexpected output in large_message: APPLICATION TERMINATED WITH THE
>>> EXIT STRING: Hangup (signal 1)
>>> > Program large_message exited without No Errors
>>> > Looking in ./rma/testlist
>>> > Processing directory spawn
>>> > Looking in ./spawn/testlist
>>> > Processing directory topo
>>> > Looking in ./topo/testlist
>>> > Processing directory perf
>>> > Looking in ./perf/testlist
>>> > Processing directory io
>>> > Looking in ./io/testlist
>>> > Processing directory f77
>>> > Looking in ./f77/testlist
>>> > Processing directory attr
>>> > Looking in ./f77/attr/testlist
>>> > Processing directory coll
>>> > Looking in ./f77/coll/testlist
>>> > Processing directory datatype
>>> > Looking in ./f77/datatype/testlist
>>> > Processing directory pt2pt
>>> > Looking in ./f77/pt2pt/testlist
>>> > Processing directory info
>>> > Looking in ./f77/info/testlist
>>> > ..
>>> > .. (all remaining tests pass)
>>> >
>>> > Any ideas what may be causing this? It clearly seems like this issue
>>> could be related to the one I
>>> > have in my application since the HYPRE functions are using
>>> MPI_Recv's... Greatly appreciate any
>>> > thoughts on why pt2pt test is failing and how to resolve? Kindly note
>>> that I am testing this on
>>> > RHEL5 gcc version 4.1.2
>>> >
>>> > Thanks!
>>> > --Sunil.
>>> >
>>> >
>>> > On Mon, Jan 10, 2011 at 12:55 PM, Sunil Thomas <sgthomas27 at gmail.com>
>>> wrote:
>>> > Thanks for the response. Moving forward, upon further debugging of the
>>> example code resulting in the
>>> > "APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)" error
>>> (using gdb and by
>>> > attaching to each process), here is what I got so far:
>>> >
>>> > --------------------------
>>> > 0x0000000000401ad0 in main (argc=1, argv=0x7fff12a68028) at ex5.c:57
>>> > 57         while (DebugWait);
>>> > (gdb) r
>>> > The program being debugged has been started already.
>>> > Start it from the beginning? (y or n) n
>>> > Program not restarted.
>>> > (gdb) set DebugWait = 0
>>> > (gdb) s
>>> > 61         n = 33;
>>> > (gdb) n
>>> > 62         solver_id = 0;
>>> > (gdb) c
>>> > Continuing.
>>> > Program received signal SIGSEGV, Segmentation fault.
>>> > 0x00002b71982477e0 in hypre_MatvecCommPkgCreate_core () from
>>> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
>>> > (gdb) bt
>>> > #0  0x00002b71982477e0 in hypre_MatvecCommPkgCreate_core () from
>>> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
>>> > #1  0x00002b7198247d8c in hypre_MatvecCommPkgCreate () from
>>> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
>>> > #2  0x00002b7198234361 in hypre_BoomerAMGCreateS () from
>>> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
>>> > #3  0x00002b71981f10f5 in hypre_BoomerAMGSetup () from
>>> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
>>> > #4  0x0000000000402421 in main (argc=1, argv=0x7fff12a68028) at
>>> ex5.c:319
>>> > (gdb) q
>>> > --------------------------
>>> > 0x0000000000401ad0 in main (argc=1, argv=0x7fff4b539af8) at ex5.c:57
>>> > 57         while (DebugWait);
>>> > (gdb) set DebugWait = 0
>>> > (gdb) s
>>> > 61         n = 33;
>>> > (gdb) n
>>> > 62         solver_id = 0;
>>> > (gdb) c
>>> > Continuing.
>>> > Program received signal SIGSEGV, Segmentation fault.
>>> > 0x00002b8a5f727f07 in hypre_BoomerAMGCoarsen () from
>>> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
>>> > (gdb) bt
>>> > #0  0x00002b8a5f727f07 in hypre_BoomerAMGCoarsen () from
>>> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
>>> > #1  0x00002b8a5f72ab51 in hypre_BoomerAMGCoarsenFalgout () from
>>> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
>>> > #2  0x00002b8a5f721cdf in hypre_BoomerAMGSetup () from
>>> /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so
>>> > #3  0x0000000000402421 in main (argc=1, argv=0x7fff4b539af8) at
>>> ex5.c:319
>>> > (gdb) q
>>> > ------------------------
>>> >
>>> >
>>> > Before digging any further in the 3rd party library HYPRE, does this
>>> give any useful info as to where the problem lies, in
>>> > terms of ruling out say error with mpich2-1.3.1, etc? It seems like the
>>> problem is in the 3rd party library HYPRE (I am
>>> > using version 2.4.0b), but I am not 100% sure.
>>> >
>>> > Thanks again.
>>> > --Sunil.
>>> >
>>> >
>>> >
>>> > On Sun, Jan 9, 2011 at 6:22 PM, Pavan Balaji <balaji at mcs.anl.gov>
>>> wrote:
>>> >
>>> > Please keep mpich-discuss cc'ed.
>>> >
>>> > ----- Original Message -----
>>> > > Thanks Pavan!
>>> > >
>>> > > No I am not. I was simply searching for the error message I got. The
>>> > > fact
>>> > > that the error is seen (whether using RMA or not) suggests the
>>> problem
>>> > > could
>>> > > still be with mpich2-1.3.1.
>>> >
>>> > If the application terminates (for any reason), the process manager
>>> will display this error string. These two could be (and most likely are)
>>> completely unrelated problems.
>>> >
>>> >  -- Pavan
>>> >
>>> >
>>>  > _______________________________________________
>>> > mpich-discuss mailing list
>>> > mpich-discuss at mcs.anl.gov
>>> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110111/2dddf4e5/attachment.htm>


More information about the mpich-discuss mailing list