<div>Hi,</div>
<div> </div>
<div> Yet another followup. When I build and install using the exact same configure options I have been </div>
<div>using on RHEL4 (gcc version 3.4.6), the error is no longer seen. Everythign is running fine there. <br></div>
<div> Appreciate if anybody lets me know if this info is of any help in identifying potential fix...I am meanwhile </div>
<div>trying to isolate the MPI usage of the failing HYPRE function on RHEL5 into a simple example and will </div>
<div>report back..</div>
<div> </div>
<div>Thanks!</div>
<div>--Sunil.</div>
<div><br> </div>
<div class="gmail_quote">On Mon, Jan 10, 2011 at 1:37 PM, Darius Buntinas <span dir="ltr"><<a href="mailto:buntinas@mcs.anl.gov">buntinas@mcs.anl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid"><br>It's doubtful that any of these are causing the segfault you're seeing.<br><br>The first two mean that there are some resources that weren't freed when the test exited.<br>
<br>The second one is related to sending large (4-8GB messages). I think the linux kernel has a bug in the tcp stack when sending large messages with iovecs. The bug results in a dropped connection, which terminates the program with an error message, rather than a segfault.<br>
<font color="#888888"><br>-d<br></font>
<div>
<div></div>
<div class="h5"><br>On Jan 10, 2011, at 3:25 PM, Sunil Thomas wrote:<br><br>> Hi,<br>><br>> Thanks for the response. I am still not yet sure about the error being due to HYPRE. I will try to<br>> isolate the behavior in the failing function to a smaller example and report back on this. It might still<br>
> be mpich2-1.3.1.<br>><br>> But just to follow-up, I ran the test suite after doing a fresh install (this time I had built with-device=<br>> ch3:sock although the end result with my application is the same as the with default communication<br>
> device..I also included --enable-g and --enable-debuginfo). The test suite result is as follows (most the<br>> tests go through fine..but some point to point communication tests appear to fail, it seems):<br>><br>
> Looking in ./testlist<br>> Processing directory attr<br>> Looking in ./attr/testlist<br>> Processing directory coll<br>> Looking in ./coll/testlist<br>> Processing directory comm<br>> Looking in ./comm/testlist<br>
> Processing directory datatype<br>> Looking in ./datatype/testlist<br>> Processing directory errhan<br>> Looking in ./errhan/testlist<br>> Processing directory group<br>> Looking in ./group/testlist<br>
> Processing directory info<br>> Looking in ./info/testlist<br>> Processing directory init<br>> Looking in ./init/testlist<br>> Processing directory pt2pt<br>> Looking in ./pt2pt/testlist<br>> Unexpected output in scancel: In direct memory block for handle type REQUEST, 3 handles are still allocated<br>
> Unexpected output in scancel: In direct memory block for handle type REQUEST, 4 handles are still allocated<br>> Unexpected output in scancel: In direct memory block for handle type COMM, 2 handles are still allocated<br>
> Unexpected output in scancel: [0] 24 at [0x000000000e443158], rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]<br>> Unexpected output in scancel: [0] 32 at [0x000000000e443088], rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]<br>
> Unexpected output in scancel: [0] 8 at [0x000000000e441a08], /mpi/mpich2-1.3.1/src/util/procmap/local_proc.c[91]<br>> Unexpected output in scancel: [0] 8 at [0x000000000e441958], /mpi/mpich2-1.3.1/src/util/procmap/local_proc.c[90]<br>
> Unexpected output in scancel: [0] 32 at [0x000000000e442268], rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]<br>> Unexpected output in pscancel: In direct memory block for handle type REQUEST, 4 handles are still allocated<br>
> Unexpected output in pscancel: In direct memory block for handle type COMM, 2 handles are still allocated<br>> Unexpected output in pscancel: In direct memory block for handle type REQUEST, 3 handles are still allocated<br>
> Unexpected output in pscancel: [0] 24 at [0x000000000bec1158], rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]<br>> Unexpected output in pscancel: [0] 32 at [0x000000000bec1088], rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]<br>
> Unexpected output in pscancel: [0] 8 at [0x000000000bebfa08], /mpi/mpich2-1.3.1/src/util/procmap/local_proc.c[91]<br>> Unexpected output in pscancel: [0] 8 at [0x000000000bebf958], /mpi/mpich2-1.3.1/src/util/procmap/local_proc.c[90]<br>
> Unexpected output in pscancel: [0] 32 at [0x000000000bec0268], rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]<br>> Unexpected output in cancelrecv: In direct memory block for handle type REQUEST, 1 handles are still allocated<br>
> Unexpected output in large_message: Fatal error in MPI_Recv: Other MPI error, error stack:<br>> Unexpected output in large_message: MPI_Recv(186).............................: MPI_Recv(buf=0x2ae606fa8010, count=270000000, MPI_LONG_LONG_INT, src=0, tag=0, MPI_COMM_WORLD, status=0x7fffa45959a0) failed<br>
> Unexpected output in large_message: MPIDI_CH3i_Progress_wait(213).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()<br>> Unexpected output in large_message: MPIDI_CH3I_Progress_handle_sock_event(456):<br>
> Unexpected output in large_message: adjust_iov(828)...........................: ch3|sock|immedread 0x2ae60695bd40 0x4169698 0x4164318<br>> Unexpected output in large_message: MPIDU_Sock_readv(426).....................: connection closed by peer (set=0,sock=1)<br>
> Unexpected output in large_message: [cli_1]: aborting job:<br>> Unexpected output in large_message: Fatal error in MPI_Recv: Other MPI error, error stack:<br>> Unexpected output in large_message: MPI_Recv(186).............................: MPI_Recv(buf=0x2ae606fa8010, count=270000000, MPI_LONG_LONG_INT, src=0, tag=0, MPI_COMM_WORLD, status=0x7fffa45959a0) failed<br>
> Unexpected output in large_message: MPIDI_CH3i_Progress_wait(213).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()<br>> Unexpected output in large_message: MPIDI_CH3I_Progress_handle_sock_event(456):<br>
> Unexpected output in large_message: adjust_iov(828)...........................: ch3|sock|immedread 0x2ae60695bd40 0x4169698 0x4164318<br>> Unexpected output in large_message: MPIDU_Sock_readv(426).....................: connection closed by peer (set=0,sock=1)<br>
> Unexpected output in large_message: APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)<br>> Program large_message exited without No Errors<br>> Looking in ./rma/testlist<br>> Processing directory spawn<br>
> Looking in ./spawn/testlist<br>> Processing directory topo<br>> Looking in ./topo/testlist<br>> Processing directory perf<br>> Looking in ./perf/testlist<br>> Processing directory io<br>> Looking in ./io/testlist<br>
> Processing directory f77<br>> Looking in ./f77/testlist<br>> Processing directory attr<br>> Looking in ./f77/attr/testlist<br>> Processing directory coll<br>> Looking in ./f77/coll/testlist<br>> Processing directory datatype<br>
> Looking in ./f77/datatype/testlist<br>> Processing directory pt2pt<br>> Looking in ./f77/pt2pt/testlist<br>> Processing directory info<br>> Looking in ./f77/info/testlist<br>> ..<br>> .. (all remaining tests pass)<br>
><br>> Any ideas what may be causing this? It clearly seems like this issue could be related to the one I<br>> have in my application since the HYPRE functions are using MPI_Recv's... Greatly appreciate any<br>
> thoughts on why pt2pt test is failing and how to resolve? Kindly note that I am testing this on<br>> RHEL5 gcc version 4.1.2<br>><br>> Thanks!<br>> --Sunil.<br>><br>><br>> On Mon, Jan 10, 2011 at 12:55 PM, Sunil Thomas <<a href="mailto:sgthomas27@gmail.com">sgthomas27@gmail.com</a>> wrote:<br>
> Thanks for the response. Moving forward, upon further debugging of the example code resulting in the<br>> "APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)" error (using gdb and by<br>> attaching to each process), here is what I got so far:<br>
><br>> --------------------------<br>> 0x0000000000401ad0 in main (argc=1, argv=0x7fff12a68028) at ex5.c:57<br>> 57 while (DebugWait);<br>> (gdb) r<br>> The program being debugged has been started already.<br>
> Start it from the beginning? (y or n) n<br>> Program not restarted.<br>> (gdb) set DebugWait = 0<br>> (gdb) s<br>> 61 n = 33;<br>> (gdb) n<br>> 62 solver_id = 0;<br>> (gdb) c<br>> Continuing.<br>
> Program received signal SIGSEGV, Segmentation fault.<br>> 0x00002b71982477e0 in hypre_MatvecCommPkgCreate_core () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so<br>> (gdb) bt<br>> #0 0x00002b71982477e0 in hypre_MatvecCommPkgCreate_core () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so<br>
> #1 0x00002b7198247d8c in hypre_MatvecCommPkgCreate () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so<br>> #2 0x00002b7198234361 in hypre_BoomerAMGCreateS () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so<br>
> #3 0x00002b71981f10f5 in hypre_BoomerAMGSetup () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so<br>> #4 0x0000000000402421 in main (argc=1, argv=0x7fff12a68028) at ex5.c:319<br>> (gdb) q<br>> --------------------------<br>
> 0x0000000000401ad0 in main (argc=1, argv=0x7fff4b539af8) at ex5.c:57<br>> 57 while (DebugWait);<br>> (gdb) set DebugWait = 0<br>> (gdb) s<br>> 61 n = 33;<br>> (gdb) n<br>> 62 solver_id = 0;<br>
> (gdb) c<br>> Continuing.<br>> Program received signal SIGSEGV, Segmentation fault.<br>> 0x00002b8a5f727f07 in hypre_BoomerAMGCoarsen () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so<br>> (gdb) bt<br>
> #0 0x00002b8a5f727f07 in hypre_BoomerAMGCoarsen () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so<br>> #1 0x00002b8a5f72ab51 in hypre_BoomerAMGCoarsenFalgout () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so<br>
> #2 0x00002b8a5f721cdf in hypre_BoomerAMGSetup () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so<br>> #3 0x0000000000402421 in main (argc=1, argv=0x7fff4b539af8) at ex5.c:319<br>> (gdb) q<br>> ------------------------<br>
><br>><br>> Before digging any further in the 3rd party library HYPRE, does this give any useful info as to where the problem lies, in<br>> terms of ruling out say error with mpich2-1.3.1, etc? It seems like the problem is in the 3rd party library HYPRE (I am<br>
> using version 2.4.0b), but I am not 100% sure.<br>><br>> Thanks again.<br>> --Sunil.<br>><br>><br>><br>> On Sun, Jan 9, 2011 at 6:22 PM, Pavan Balaji <<a href="mailto:balaji@mcs.anl.gov">balaji@mcs.anl.gov</a>> wrote:<br>
><br>> Please keep mpich-discuss cc'ed.<br>><br>> ----- Original Message -----<br>> > Thanks Pavan!<br>> ><br>> > No I am not. I was simply searching for the error message I got. The<br>> > fact<br>
> > that the error is seen (whether using RMA or not) suggests the problem<br>> > could<br>> > still be with mpich2-1.3.1.<br>><br>> If the application terminates (for any reason), the process manager will display this error string. These two could be (and most likely are) completely unrelated problems.<br>
><br>> -- Pavan<br>><br>><br></div></div>
<div>
<div></div>
<div class="h5">> _______________________________________________<br>> mpich-discuss mailing list<br>> <a href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a><br>> <a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>
<br>_______________________________________________<br>mpich-discuss mailing list<br><a href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a><br><a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>
</div></div></blockquote></div><br>