<div>Hi, </div>
<div> </div>
<div> Thanks for the response. I am still not yet sure about the error being due to HYPRE. I will try to </div>
<div>isolate the behavior in the failing function to a smaller example and report back on this. It might still </div>
<div>be mpich2-1.3.1.</div>
<div> </div>
<div> But just to follow-up, I ran the test suite after doing a fresh install (this time I had built with-device=</div>
<div>ch3:sock although the end result with my application is the same as the with default communication </div>
<div>device..I also included --enable-g and --enable-debuginfo). The test suite result is as follows (most the </div>
<div>tests go through fine..but some point to point communication tests appear to fail, it seems):</div>
<div> </div>
<div><em><font color="#ff0000" size="1">Looking in ./testlist<br>Processing directory attr<br>Looking in ./attr/testlist<br>Processing directory coll<br>Looking in ./coll/testlist<br>Processing directory comm<br>Looking in ./comm/testlist<br>
Processing directory datatype<br>Looking in ./datatype/testlist<br>Processing directory errhan<br>Looking in ./errhan/testlist<br>Processing directory group<br>Looking in ./group/testlist<br>Processing directory info<br>Looking in ./info/testlist<br>
Processing directory init<br>Looking in ./init/testlist<br>Processing directory pt2pt</font></em></div>
<div><em><font color="#ff0000" size="1">Looking in ./pt2pt/testlist<br>Unexpected output in scancel: In direct memory block for handle type REQUEST, 3 handles are still allocated<br>Unexpected output in scancel: In direct memory block for handle type REQUEST, 4 handles are still allocated<br>
Unexpected output in scancel: In direct memory block for handle type COMM, 2 handles are still allocated<br>Unexpected output in scancel: [0] 24 at [0x000000000e443158], rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]<br>
Unexpected output in scancel: [0] 32 at [0x000000000e443088], rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]<br>Unexpected output in scancel: [0] 8 at [0x000000000e441a08], /mpi/mpich2-1.3.1/src/util/procmap/local_proc.c[91]<br>
Unexpected output in scancel: [0] 8 at [0x000000000e441958], /mpi/mpich2-1.3.1/src/util/procmap/local_proc.c[90]<br>Unexpected output in scancel: [0] 32 at [0x000000000e442268], rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]<br>
Unexpected output in pscancel: In direct memory block for handle type REQUEST, 4 handles are still allocated<br>Unexpected output in pscancel: In direct memory block for handle type COMM, 2 handles are still allocated<br>
Unexpected output in pscancel: In direct memory block for handle type REQUEST, 3 handles are still allocated<br>Unexpected output in pscancel: [0] 24 at [0x000000000bec1158], rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]<br>
Unexpected output in pscancel: [0] 32 at [0x000000000bec1088], rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]<br>Unexpected output in pscancel: [0] 8 at [0x000000000bebfa08], /mpi/mpich2-1.3.1/src/util/procmap/local_proc.c[91]<br>
Unexpected output in pscancel: [0] 8 at [0x000000000bebf958], /mpi/mpich2-1.3.1/src/util/procmap/local_proc.c[90]<br>Unexpected output in pscancel: [0] 32 at [0x000000000bec0268], rty/mpi/mpich2-1.3.1/src/mpid/ch3/src/mpid_vc.c[79]<br>
Unexpected output in cancelrecv: In direct memory block for handle type REQUEST, 1 handles are still allocated<br>Unexpected output in large_message: Fatal error in MPI_Recv: Other MPI error, error stack:<br>Unexpected output in large_message: MPI_Recv(186).............................: MPI_Recv(buf=0x2ae606fa8010, count=270000000, MPI_LONG_LONG_INT, src=0, tag=0, MPI_COMM_WORLD, status=0x7fffa45959a0) failed<br>
Unexpected output in large_message: MPIDI_CH3i_Progress_wait(213).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()<br>Unexpected output in large_message: MPIDI_CH3I_Progress_handle_sock_event(456): <br>
Unexpected output in large_message: adjust_iov(828)...........................: ch3|sock|immedread 0x2ae60695bd40 0x4169698 0x4164318<br>Unexpected output in large_message: MPIDU_Sock_readv(426).....................: connection closed by peer (set=0,sock=1)<br>
Unexpected output in large_message: [cli_1]: aborting job:<br>Unexpected output in large_message: Fatal error in MPI_Recv: Other MPI error, error stack:<br>Unexpected output in large_message: MPI_Recv(186).............................: MPI_Recv(buf=0x2ae606fa8010, count=270000000, MPI_LONG_LONG_INT, src=0, tag=0, MPI_COMM_WORLD, status=0x7fffa45959a0) failed<br>
Unexpected output in large_message: MPIDI_CH3i_Progress_wait(213).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()<br>Unexpected output in large_message: MPIDI_CH3I_Progress_handle_sock_event(456): <br>
Unexpected output in large_message: adjust_iov(828)...........................: ch3|sock|immedread 0x2ae60695bd40 0x4169698 0x4164318<br>Unexpected output in large_message: MPIDU_Sock_readv(426).....................: connection closed by peer (set=0,sock=1)<br>
Unexpected output in large_message: APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)<br>Program large_message exited without No Errors</font></em></div>
<div><em><font color="#ff0000" size="1">Looking in ./rma/testlist<br>Processing directory spawn<br>Looking in ./spawn/testlist<br>Processing directory topo<br>Looking in ./topo/testlist<br>Processing directory perf<br>Looking in ./perf/testlist<br>
Processing directory io<br>Looking in ./io/testlist<br>Processing directory f77<br>Looking in ./f77/testlist<br>Processing directory attr<br>Looking in ./f77/attr/testlist<br>Processing directory coll<br>Looking in ./f77/coll/testlist<br>
Processing directory datatype<br>Looking in ./f77/datatype/testlist<br>Processing directory pt2pt<br>Looking in ./f77/pt2pt/testlist<br>Processing directory info<br>Looking in ./f77/info/testlist</font></em></div>
<div><em><font color="#ff0000" size="1">..</font></em></div>
<div><em><font color="#ff0000" size="1">.. (all remaining tests pass)</font></em></div>
<div><br>Any ideas what may be causing this? It clearly seems like this issue could be related to the one I </div>
<div>have in my application since the HYPRE functions are using MPI_Recv's... Greatly appreciate any </div>
<div>thoughts on why pt2pt test is failing and how to resolve? Kindly note that I am testing this on </div>
<div>RHEL5 gcc version 4.1.2</div>
<div> </div>
<div>Thanks!</div>
<div>--Sunil.</div>
<div><br> </div>
<div class="gmail_quote">On Mon, Jan 10, 2011 at 12:55 PM, Sunil Thomas <span dir="ltr"><<a href="mailto:sgthomas27@gmail.com" target="_blank">sgthomas27@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">
<div>Thanks for the response. Moving forward, upon further debugging of the example code resulting in the </div>
<div>"APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)" error (using gdb and by </div>
<div>attaching to each process), here is what I got so far:</div>
<div> </div>
<div>--------------------------</div>
<div>0x0000000000401ad0 in main (argc=1, argv=0x7fff12a68028) at ex5.c:57<br>57 while (DebugWait);<br>(gdb) r<br>The program being debugged has been started already.<br>Start it from the beginning? (y or n) n<br>
Program not restarted.<br>(gdb) set DebugWait = 0<br>(gdb) s<br>61 n = 33;<br>(gdb) n<br>62 solver_id = 0;<br>(gdb) c<br>Continuing.</div>
<div>Program received signal SIGSEGV, Segmentation fault.<br>0x00002b71982477e0 in hypre_MatvecCommPkgCreate_core () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so<br>(gdb) bt<br>#0 0x00002b71982477e0 in hypre_MatvecCommPkgCreate_core () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so<br>
#1 0x00002b7198247d8c in hypre_MatvecCommPkgCreate () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so<br>#2 0x00002b7198234361 in hypre_BoomerAMGCreateS () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so<br>
#3 0x00002b71981f10f5 in hypre_BoomerAMGSetup () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so<br>#4 0x0000000000402421 in main (argc=1, argv=0x7fff12a68028) at ex5.c:319<br>(gdb) q<br>--------------------------</div>
<div>0x0000000000401ad0 in main (argc=1, argv=0x7fff4b539af8) at ex5.c:57<br>57 while (DebugWait);<br>(gdb) set DebugWait = 0<br>(gdb) s<br>61 n = 33;<br>(gdb) n<br>62 solver_id = 0;<br>(gdb) c<br>
Continuing.</div>
<div>Program received signal SIGSEGV, Segmentation fault.<br>0x00002b8a5f727f07 in hypre_BoomerAMGCoarsen () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so<br>(gdb) bt<br>#0 0x00002b8a5f727f07 in hypre_BoomerAMGCoarsen () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so<br>
#1 0x00002b8a5f72ab51 in hypre_BoomerAMGCoarsenFalgout () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so<br>#2 0x00002b8a5f721cdf in hypre_BoomerAMGSetup () from /data/rpe/sypb/devl/3rdparty/hypre-2.4.0b/lib/libHYPRE.so<br>
#3 0x0000000000402421 in main (argc=1, argv=0x7fff4b539af8) at ex5.c:319<br>(gdb) q<br>------------------------</div>
<div> </div>
<div> </div>
<div>Before digging any further in the 3rd party library HYPRE, does this give any useful info as to where the problem lies, in </div>
<div>terms of ruling out say error with mpich2-1.3.1, etc? It seems like the problem is in the 3rd party library HYPRE (I am </div>
<div>using version 2.4.0b), but I am not 100% sure. </div>
<div> </div>
<div>Thanks again.</div>
<div>--Sunil.</div>
<div>
<div></div>
<div>
<div><br><br> </div>
<div class="gmail_quote">On Sun, Jan 9, 2011 at 6:22 PM, Pavan Balaji <span dir="ltr"><<a href="mailto:balaji@mcs.anl.gov" target="_blank">balaji@mcs.anl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid"><br>Please keep mpich-discuss cc'ed.<br>
<div><br>----- Original Message -----<br>> Thanks Pavan!<br>><br>> No I am not. I was simply searching for the error message I got. The<br>> fact<br>> that the error is seen (whether using RMA or not) suggests the problem<br>
> could<br>> still be with mpich2-1.3.1.<br><br></div>If the application terminates (for any reason), the process manager will display this error string. These two could be (and most likely are) completely unrelated problems.<br>
<font color="#888888"><br> -- Pavan<br></font></blockquote></div><br></div></div></blockquote></div><br>