<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
  <title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
Rajeev Thakur ha scritto:
<blockquote cite="mid:004301c8e83b$c2888320$860add8c@mcs.anl.gov"
 type="cite">
  <meta http-equiv="Content-Type" content="text/html; ">
  <meta content="MSHTML 6.00.6000.16674" name="GENERATOR">
  <div dir="ltr" align="left"><span class="290133418-17072008"><font
 color="#0000ff" face="Arial" size="2">Do you have a small test program
we could use to reproduce this error?</font></span></div>
</blockquote>
Unfortunately not at this time. In general the threads that does the
MPI work are doing:<br>
<br>
1) MPI_Comm_spaw() with one slave, some MPI_Info for setting host and
wdir;<br>
2) Some send and receive using MPI_Send() and MPI_Recv();<br>
3) Final handshake for catching the slave termination. Slaves are using
MPI_Ssend( NULL, 0, MPI_BYTE, 0, JOB_TERMINATION, parentcomm );<br>
&nbsp;&nbsp;&nbsp; while the threads are using an MPI_Recv( NULL, 0, MPI_BYTE, 0,
JOB_TERMINATION, self-&gt;childComm, MPI_STATUS_IGNORE );<br>
4) Finally the slaves are "destroyed" using an MPI_Comm_disconnect();<br>
<br>
This kind of loop is executed in multithreading until all the jobs are
completed. The number of the threads are equal to the number <br>
of assigned nodes for a given distributed execution. <br>
<br>
Eventually, I could try to prepare a test program when possible. <br>
<br>
Finally, I see many times some lock in the condition within the
MPIDI_CH3I_Progress() caming from different library places. Sometimes<br>
I see also the MPI_Comm_spawn() on the back trace, when the application
locks for this problem.<br>
<span class="290133418-17072008"></span>
<blockquote cite="mid:004301c8e83b$c2888320$860add8c@mcs.anl.gov"
 type="cite">
  <div dir="ltr" align="left"><span class="290133418-17072008"><font
 color="#0000ff" face="Arial" size="2">Rajeev</font></span></div>
  <br>
  <blockquote dir="ltr"
 style="border-left: 2px solid rgb(0, 0, 255); padding-left: 5px; margin-left: 5px; margin-right: 0px;">
    <div class="OutlookMessageHeader" dir="ltr" align="left"
 lang="en-us">
    <hr tabindex="-1"> <font face="Tahoma" size="2"><b>From:</b>
<a class="moz-txt-link-abbreviated" href="mailto:owner-mpich-discuss@mcs.anl.gov">owner-mpich-discuss@mcs.anl.gov</a>
[<a class="moz-txt-link-freetext" href="mailto:owner-mpich-discuss@mcs.anl.gov">mailto:owner-mpich-discuss@mcs.anl.gov</a>] <b>On Behalf Of </b>Roberto
Fichera<br>
    <b>Sent:</b> Thursday, July 17, 2008 12:05 PM<br>
    <b>To:</b> <a class="moz-txt-link-abbreviated" href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a><br>
    <b>Subject:</b> [mpich-discuss] Deadlock when in
MPI_THREAD_MULTIPLE within the MPI_Comm_disconnect()<br>
    </font><br>
    </div>
    <pre wrap="">Hi All on the list,

I guess to have found a dead lock in the last MPICH2 v1.0.7, the scenery is the following:

thread 1 is the main user's application;
threads 2/3/4 are using the MPI functions for dynamically spawn a slave in a choosed node
        exchange some data, waiting the slave termination and finally they calls the 
        MPI_Comm_disconnect() for releasing the master/slave intercommunicator;
thread 5 is the dispatcher of the 2/3/4 threads it waits their termination;

So, looking at the calltrace of the thread2 the MPI is waiting that the poll(), which was called by MPIDU_Sock_wait(), 
returns, here we are within the MPI_Comm_disconnect(). The call trace of the thread3/4 is also in the 
MPI_Comm_disconnect() but it's waiting in a condition called by the MPIDI_CH3I_Progress(). So basically all the
three threads are stuck in the MPI_Comm_disconnect()! 

Does anyone have an idea what's going on here?

Thanks in advance.
Roberto Fichera.

&nbsp;
(gdb) thread 1
[Switching to thread 1 (Thread 46912533127120 (LWP 30857))]#0  0x00000033ca40a8f9 in <a
 moz-do-not-send="true" class="moz-txt-link-abbreviated"
 href="mailto:pthread_cond_wait@@GLIBC_2.3.2">pthread_cond_wait@@GLIBC_2.3.2</a> () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00000033ca40a8f9 in <a moz-do-not-send="true"
 class="moz-txt-link-abbreviated"
 href="mailto:pthread_cond_wait@@GLIBC_2.3.2">pthread_cond_wait@@GLIBC_2.3.2</a> () from /lib64/libpthread.so.0
#1  0x00002aaaab0b08a2 in Cond_wait () from <i class="moz-txt-slash"><span
 class="moz-txt-tag">/</span>home/simone<span class="moz-txt-tag">/</span></i>.HRI/Proxy/HRI/Libraries/MThreads/1.1/lib/linux-x86_64-gcc-glibc2.3.4/libMThreads.so.1.1
#2  0x00002aaaabb8a787 in MTQueue_popWait (self=0x636b70, userClass=0x0, microsecs=0) at MTQueue.c:177
#3  0x000000000040642b in main (argc=1, argv=0x7fff4e7909f8) at ackley_master.cpp:265

//=============================================================================

(gdb) thread 2
[Switching to thread 2 (Thread 1094719824 (LWP 1279))]#0  0x00000033c94cbd66 in poll () from /lib64/libc.so.6
(gdb) bt
#0  0x00000033c94cbd66 in poll () from /lib64/libc.so.6
#1  0x00002aaaab5a3d2f in MPIDU_Sock_wait () from <i
 class="moz-txt-slash"><span class="moz-txt-tag">/home/roberto/</span><span
 class="moz-txt-tag"></span></i>.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
#2  0x00002aaaab52bdc7 in MPIDI_CH3I_Progress () from <i
 class="moz-txt-slash"><span class="moz-txt-tag">/home/roberto/</span><span
 class="moz-txt-tag"></span></i>.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
#3  0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from <i
 class="moz-txt-slash"><span class="moz-txt-tag">/home/roberto/</span><span
 class="moz-txt-tag"></span></i>.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
#4  0x00002aaaab56f162 in MPID_Comm_disconnect () from <i
 class="moz-txt-slash"><span class="moz-txt-tag">/home/roberto/</span><span
 class="moz-txt-tag"></span></i>.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
#5  0x00002aaaab5417ec in PMPI_Comm_disconnect () from <i
 class="moz-txt-slash"><span class="moz-txt-tag">/home/roberto/</span><span
 class="moz-txt-tag"></span></i>.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
#6  0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x6358e0) at ParallelWorker.c:819
#7  0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x6358e0) at ParallelWorker.c:515
#8  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
#9  0x00000033c94d4b0d in clone () from /lib64/libc.so.6

//=============================================================================

(gdb) thread 3
[Switching to thread 3 (Thread 1084229968 (LWP 1278))]#0  0x00000033ca40a8f9 in <a
 moz-do-not-send="true" class="moz-txt-link-abbreviated"
 href="mailto:pthread_cond_wait@@GLIBC_2.3.2">pthread_cond_wait@@GLIBC_2.3.2</a> () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00000033ca40a8f9 in <a moz-do-not-send="true"
 class="moz-txt-link-abbreviated"
 href="mailto:pthread_cond_wait@@GLIBC_2.3.2">pthread_cond_wait@@GLIBC_2.3.2</a> () from /lib64/libpthread.so.0
#1  0x00002aaaab52bec7 in MPIDI_CH3I_Progress () from <i
 class="moz-txt-slash"><span class="moz-txt-tag">/home/roberto/</span><span
 class="moz-txt-tag"></span></i>.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
#2  0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from <i
 class="moz-txt-slash"><span class="moz-txt-tag">/home/roberto/</span><span
 class="moz-txt-tag"></span></i>.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
#3  0x00002aaaab56f162 in MPID_Comm_disconnect () from <i
 class="moz-txt-slash"><span class="moz-txt-tag">/home/roberto/</span><span
 class="moz-txt-tag"></span></i>.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
#4  0x00002aaaab5417ec in PMPI_Comm_disconnect () from <i
 class="moz-txt-slash"><span class="moz-txt-tag">/home/roberto/</span><span
 class="moz-txt-tag"></span></i>.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
#5  0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x634d20) at ParallelWorker.c:819
#6  0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x634d20) at ParallelWorker.c:515
#7  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
#8  0x00000033c94d4b0d in clone () from /lib64/libc.so.6

//=============================================================================

(gdb) thread 4
[Switching to thread 4 (Thread 1115699536 (LWP 1277))]#0  0x00000033ca40a8f9 in <a
 moz-do-not-send="true" class="moz-txt-link-abbreviated"
 href="mailto:pthread_cond_wait@@GLIBC_2.3.2">pthread_cond_wait@@GLIBC_2.3.2</a> () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00000033ca40a8f9 in <a moz-do-not-send="true"
 class="moz-txt-link-abbreviated"
 href="mailto:pthread_cond_wait@@GLIBC_2.3.2">pthread_cond_wait@@GLIBC_2.3.2</a> () from /lib64/libpthread.so.0
#1  0x00002aaaab52bec7 in MPIDI_CH3I_Progress () from <i
 class="moz-txt-slash"><span class="moz-txt-tag">/home/roberto/</span><span
 class="moz-txt-tag"></span></i>.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
#2  0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from <i
 class="moz-txt-slash"><span class="moz-txt-tag">/home/roberto/</span><span
 class="moz-txt-tag"></span></i>.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
#3  0x00002aaaab56f162 in MPID_Comm_disconnect () from <i
 class="moz-txt-slash"><span class="moz-txt-tag">/home/roberto/</span><span
 class="moz-txt-tag"></span></i>.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
#4  0x00002aaaab5417ec in PMPI_Comm_disconnect () from <i
 class="moz-txt-slash"><span class="moz-txt-tag">/home/roberto/</span><span
 class="moz-txt-tag"></span></i>.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
#5  0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x6341a0) at ParallelWorker.c:819
#6  0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x6341a0) at ParallelWorker.c:515
#7  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
#8  0x00000033c94d4b0d in clone () from /lib64/libc.so.6

//=============================================================================

(gdb) thread 5
[Switching to thread 5 (Thread 1105209680 (LWP 1276))]#0  0x00000033ca40a8f9 in <a
 moz-do-not-send="true" class="moz-txt-link-abbreviated"
 href="mailto:pthread_cond_wait@@GLIBC_2.3.2">pthread_cond_wait@@GLIBC_2.3.2</a> () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00000033ca40a8f9 in <a moz-do-not-send="true"
 class="moz-txt-link-abbreviated"
 href="mailto:pthread_cond_wait@@GLIBC_2.3.2">pthread_cond_wait@@GLIBC_2.3.2</a> () from /lib64/libpthread.so.0
#1  0x00002aaaab0b08a2 in Cond_wait () from <i class="moz-txt-slash"><span
 class="moz-txt-tag">/home/roberto/</span><span class="moz-txt-tag"></span></i>.HRI/Proxy/HRI/Libraries/MThreads/1.1/lib/linux-x86_64-gcc-glibc2.3.4/libMThreads.so.1.1
#2  0x00002aaaabd9e775 in Parallel_threadMain (arg=0x636830) at Parallel.c:645
#3  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
#4  0x00000033c94d4b0d in clone () from /lib64/libc.so.6</pre>
  </blockquote>
</blockquote>
<br>
</body>
</html>