<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.6000.16674" name=GENERATOR></HEAD>
<BODY text=#000000 bgColor=#ffffff>
<DIV dir=ltr align=left><SPAN class=290133418-17072008><FONT face=Arial
color=#0000ff size=2>Do you have a small test program we could use to reproduce
this error?</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=290133418-17072008><FONT face=Arial
color=#0000ff size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=290133418-17072008><FONT face=Arial
color=#0000ff size=2>Rajeev</FONT></SPAN></DIV><BR>
<BLOCKQUOTE dir=ltr
style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #0000ff 2px solid; MARGIN-RIGHT: 0px">
<DIV class=OutlookMessageHeader lang=en-us dir=ltr align=left>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> owner-mpich-discuss@mcs.anl.gov
[mailto:owner-mpich-discuss@mcs.anl.gov] <B>On Behalf Of </B>Roberto
Fichera<BR><B>Sent:</B> Thursday, July 17, 2008 12:05 PM<BR><B>To:</B>
mpich-discuss@mcs.anl.gov<BR><B>Subject:</B> [mpich-discuss] Deadlock when in
MPI_THREAD_MULTIPLE within the MPI_Comm_disconnect()<BR></FONT><BR></DIV>
<DIV></DIV><PRE wrap="">Hi All on the list,
I guess to have found a dead lock in the last MPICH2 v1.0.7, the scenery is the following:
thread 1 is the main user's application;
threads 2/3/4 are using the MPI functions for dynamically spawn a slave in a choosed node
exchange some data, waiting the slave termination and finally they calls the
MPI_Comm_disconnect() for releasing the master/slave intercommunicator;
thread 5 is the dispatcher of the 2/3/4 threads it waits their termination;
So, looking at the calltrace of the thread2 the MPI is waiting that the poll(), which was called by MPIDU_Sock_wait(),
returns, here we are within the MPI_Comm_disconnect(). The call trace of the thread3/4 is also in the
MPI_Comm_disconnect() but it's waiting in a condition called by the MPIDI_CH3I_Progress(). So basically all the
three threads are stuck in the MPI_Comm_disconnect()!
Does anyone have an idea what's going on here?
Thanks in advance.
Roberto Fichera.
(gdb) thread 1
[Switching to thread 1 (Thread 46912533127120 (LWP 30857))]#0 0x00000033ca40a8f9 in <A class=moz-txt-link-abbreviated href="mailto:pthread_cond_wait@@GLIBC_2.3.2">pthread_cond_wait@@GLIBC_2.3.2</A> () from /lib64/libpthread.so.0
(gdb) bt
#0 0x00000033ca40a8f9 in <A class=moz-txt-link-abbreviated href="mailto:pthread_cond_wait@@GLIBC_2.3.2">pthread_cond_wait@@GLIBC_2.3.2</A> () from /lib64/libpthread.so.0
#1 0x00002aaaab0b08a2 in Cond_wait () from <I class=moz-txt-slash><SPAN class=moz-txt-tag>/</SPAN>home/simone<SPAN class=moz-txt-tag>/</SPAN></I>.HRI/Proxy/HRI/Libraries/MThreads/1.1/lib/linux-x86_64-gcc-glibc2.3.4/libMThreads.so.1.1
#2 0x00002aaaabb8a787 in MTQueue_popWait (self=0x636b70, userClass=0x0, microsecs=0) at MTQueue.c:177
#3 0x000000000040642b in main (argc=1, argv=0x7fff4e7909f8) at ackley_master.cpp:265
//=============================================================================
(gdb) thread 2
[Switching to thread 2 (Thread 1094719824 (LWP 1279))]#0 0x00000033c94cbd66 in poll () from /lib64/libc.so.6
(gdb) bt
#0 0x00000033c94cbd66 in poll () from /lib64/libc.so.6
#1 0x00002aaaab5a3d2f in MPIDU_Sock_wait () from <I class=moz-txt-slash><SPAN class=moz-txt-tag>/home/roberto/</SPAN><SPAN class=moz-txt-tag></SPAN></I>.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
#2 0x00002aaaab52bdc7 in MPIDI_CH3I_Progress () from <I class=moz-txt-slash><SPAN class=moz-txt-tag>/home/roberto/</SPAN><SPAN class=moz-txt-tag></SPAN></I>.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
#3 0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from <I class=moz-txt-slash><SPAN class=moz-txt-tag>/home/roberto/</SPAN><SPAN class=moz-txt-tag></SPAN></I>.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
#4 0x00002aaaab56f162 in MPID_Comm_disconnect () from <I class=moz-txt-slash><SPAN class=moz-txt-tag>/home/roberto/</SPAN><SPAN class=moz-txt-tag></SPAN></I>.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
#5 0x00002aaaab5417ec in PMPI_Comm_disconnect () from <I class=moz-txt-slash><SPAN class=moz-txt-tag>/home/roberto/</SPAN><SPAN class=moz-txt-tag></SPAN></I>.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
#6 0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x6358e0) at ParallelWorker.c:819
#7 0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x6358e0) at ParallelWorker.c:515
#8 0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
#9 0x00000033c94d4b0d in clone () from /lib64/libc.so.6
//=============================================================================
(gdb) thread 3
[Switching to thread 3 (Thread 1084229968 (LWP 1278))]#0 0x00000033ca40a8f9 in <A class=moz-txt-link-abbreviated href="mailto:pthread_cond_wait@@GLIBC_2.3.2">pthread_cond_wait@@GLIBC_2.3.2</A> () from /lib64/libpthread.so.0
(gdb) bt
#0 0x00000033ca40a8f9 in <A class=moz-txt-link-abbreviated href="mailto:pthread_cond_wait@@GLIBC_2.3.2">pthread_cond_wait@@GLIBC_2.3.2</A> () from /lib64/libpthread.so.0
#1 0x00002aaaab52bec7 in MPIDI_CH3I_Progress () from <I class=moz-txt-slash><SPAN class=moz-txt-tag>/home/roberto/</SPAN><SPAN class=moz-txt-tag></SPAN></I>.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
#2 0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from <I class=moz-txt-slash><SPAN class=moz-txt-tag>/home/roberto/</SPAN><SPAN class=moz-txt-tag></SPAN></I>.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
#3 0x00002aaaab56f162 in MPID_Comm_disconnect () from <I class=moz-txt-slash><SPAN class=moz-txt-tag>/home/roberto/</SPAN><SPAN class=moz-txt-tag></SPAN></I>.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
#4 0x00002aaaab5417ec in PMPI_Comm_disconnect () from <I class=moz-txt-slash><SPAN class=moz-txt-tag>/home/roberto/</SPAN><SPAN class=moz-txt-tag></SPAN></I>.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
#5 0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x634d20) at ParallelWorker.c:819
#6 0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x634d20) at ParallelWorker.c:515
#7 0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
#8 0x00000033c94d4b0d in clone () from /lib64/libc.so.6
//=============================================================================
(gdb) thread 4
[Switching to thread 4 (Thread 1115699536 (LWP 1277))]#0 0x00000033ca40a8f9 in <A class=moz-txt-link-abbreviated href="mailto:pthread_cond_wait@@GLIBC_2.3.2">pthread_cond_wait@@GLIBC_2.3.2</A> () from /lib64/libpthread.so.0
(gdb) bt
#0 0x00000033ca40a8f9 in <A class=moz-txt-link-abbreviated href="mailto:pthread_cond_wait@@GLIBC_2.3.2">pthread_cond_wait@@GLIBC_2.3.2</A> () from /lib64/libpthread.so.0
#1 0x00002aaaab52bec7 in MPIDI_CH3I_Progress () from <I class=moz-txt-slash><SPAN class=moz-txt-tag>/home/roberto/</SPAN><SPAN class=moz-txt-tag></SPAN></I>.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
#2 0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from <I class=moz-txt-slash><SPAN class=moz-txt-tag>/home/roberto/</SPAN><SPAN class=moz-txt-tag></SPAN></I>.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
#3 0x00002aaaab56f162 in MPID_Comm_disconnect () from <I class=moz-txt-slash><SPAN class=moz-txt-tag>/home/roberto/</SPAN><SPAN class=moz-txt-tag></SPAN></I>.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
#4 0x00002aaaab5417ec in PMPI_Comm_disconnect () from <I class=moz-txt-slash><SPAN class=moz-txt-tag>/home/roberto/</SPAN><SPAN class=moz-txt-tag></SPAN></I>.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
#5 0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x6341a0) at ParallelWorker.c:819
#6 0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x6341a0) at ParallelWorker.c:515
#7 0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
#8 0x00000033c94d4b0d in clone () from /lib64/libc.so.6
//=============================================================================
(gdb) thread 5
[Switching to thread 5 (Thread 1105209680 (LWP 1276))]#0 0x00000033ca40a8f9 in <A class=moz-txt-link-abbreviated href="mailto:pthread_cond_wait@@GLIBC_2.3.2">pthread_cond_wait@@GLIBC_2.3.2</A> () from /lib64/libpthread.so.0
(gdb) bt
#0 0x00000033ca40a8f9 in <A class=moz-txt-link-abbreviated href="mailto:pthread_cond_wait@@GLIBC_2.3.2">pthread_cond_wait@@GLIBC_2.3.2</A> () from /lib64/libpthread.so.0
#1 0x00002aaaab0b08a2 in Cond_wait () from <I class=moz-txt-slash><SPAN class=moz-txt-tag>/home/roberto/</SPAN><SPAN class=moz-txt-tag></SPAN></I>.HRI/Proxy/HRI/Libraries/MThreads/1.1/lib/linux-x86_64-gcc-glibc2.3.4/libMThreads.so.1.1
#2 0x00002aaaabd9e775 in Parallel_threadMain (arg=0x636830) at Parallel.c:645
#3 0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
#4 0x00000033c94d4b0d in clone () from /lib64/libc.so.6</PRE></BLOCKQUOTE></BODY></HTML>