[mpich-discuss] Deadlock when in MPI_THREAD_MULTIPLE within the MPI_Comm_disconnect()

Rajeev Thakur thakur at mcs.anl.gov
Thu Jul 17 13:34:33 CDT 2008


Do you have a small test program we could use to reproduce this error?
 
Rajeev


  _____  

From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Roberto Fichera
Sent: Thursday, July 17, 2008 12:05 PM
To: mpich-discuss at mcs.anl.gov
Subject: [mpich-discuss] Deadlock when in MPI_THREAD_MULTIPLE within the
MPI_Comm_disconnect()


Hi All on the list,



I guess to have found a dead lock in the last MPICH2 v1.0.7, the scenery is
the following:



thread 1 is the main user's application;

threads 2/3/4 are using the MPI functions for dynamically spawn a slave in a
choosed node

        exchange some data, waiting the slave termination and finally they
calls the 

        MPI_Comm_disconnect() for releasing the master/slave
intercommunicator;

thread 5 is the dispatcher of the 2/3/4 threads it waits their termination;



So, looking at the calltrace of the thread2 the MPI is waiting that the
poll(), which was called by MPIDU_Sock_wait(), 

returns, here we are within the MPI_Comm_disconnect(). The call trace of the
thread3/4 is also in the 

MPI_Comm_disconnect() but it's waiting in a condition called by the
MPIDI_CH3I_Progress(). So basically all the

three threads are stuck in the MPI_Comm_disconnect()! 



Does anyone have an idea what's going on here?



Thanks in advance.

Roberto Fichera.



 

(gdb) thread 1

[Switching to thread 1 (Thread 46912533127120 (LWP 30857))]#0
0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0

(gdb) bt

#0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0

#1  0x00002aaaab0b08a2 in Cond_wait () from
/home/simone/.HRI/Proxy/HRI/Libraries/MThreads/1.1/lib/linux-x86_64-gcc-glib
c2.3.4/libMThreads.so.1.1

#2  0x00002aaaabb8a787 in MTQueue_popWait (self=0x636b70, userClass=0x0,
microsecs=0) at MTQueue.c:177

#3  0x000000000040642b in main (argc=1, argv=0x7fff4e7909f8) at
ackley_master.cpp:265



//==========================================================================
===



(gdb) thread 2

[Switching to thread 2 (Thread 1094719824 (LWP 1279))]#0  0x00000033c94cbd66
in poll () from /lib64/libc.so.6

(gdb) bt

#0  0x00000033c94cbd66 in poll () from /lib64/libc.so.6

#1  0x00002aaaab5a3d2f in MPIDU_Sock_wait () from
/home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib
c2.3.4/libmpich.so.1.1

#2  0x00002aaaab52bdc7 in MPIDI_CH3I_Progress () from
/home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib
c2.3.4/libmpich.so.1.1

#3  0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from
/home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib
c2.3.4/libmpich.so.1.1

#4  0x00002aaaab56f162 in MPID_Comm_disconnect () from
/home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib
c2.3.4/libmpich.so.1.1

#5  0x00002aaaab5417ec in PMPI_Comm_disconnect () from
/home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib
c2.3.4/libmpich.so.1.1

#6  0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x6358e0) at
ParallelWorker.c:819

#7  0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x6358e0) at
ParallelWorker.c:515

#8  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0

#9  0x00000033c94d4b0d in clone () from /lib64/libc.so.6



//==========================================================================
===



(gdb) thread 3

[Switching to thread 3 (Thread 1084229968 (LWP 1278))]#0  0x00000033ca40a8f9
in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

(gdb) bt

#0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0

#1  0x00002aaaab52bec7 in MPIDI_CH3I_Progress () from
/home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib
c2.3.4/libmpich.so.1.1

#2  0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from
/home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib
c2.3.4/libmpich.so.1.1

#3  0x00002aaaab56f162 in MPID_Comm_disconnect () from
/home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib
c2.3.4/libmpich.so.1.1

#4  0x00002aaaab5417ec in PMPI_Comm_disconnect () from
/home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib
c2.3.4/libmpich.so.1.1

#5  0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x634d20) at
ParallelWorker.c:819

#6  0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x634d20) at
ParallelWorker.c:515

#7  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0

#8  0x00000033c94d4b0d in clone () from /lib64/libc.so.6



//==========================================================================
===



(gdb) thread 4

[Switching to thread 4 (Thread 1115699536 (LWP 1277))]#0  0x00000033ca40a8f9
in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

(gdb) bt

#0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0

#1  0x00002aaaab52bec7 in MPIDI_CH3I_Progress () from
/home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib
c2.3.4/libmpich.so.1.1

#2  0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from
/home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib
c2.3.4/libmpich.so.1.1

#3  0x00002aaaab56f162 in MPID_Comm_disconnect () from
/home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib
c2.3.4/libmpich.so.1.1

#4  0x00002aaaab5417ec in PMPI_Comm_disconnect () from
/home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib
c2.3.4/libmpich.so.1.1

#5  0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x6341a0) at
ParallelWorker.c:819

#6  0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x6341a0) at
ParallelWorker.c:515

#7  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0

#8  0x00000033c94d4b0d in clone () from /lib64/libc.so.6



//==========================================================================
===



(gdb) thread 5

[Switching to thread 5 (Thread 1105209680 (LWP 1276))]#0  0x00000033ca40a8f9
in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

(gdb) bt

#0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0

#1  0x00002aaaab0b08a2 in Cond_wait () from
/home/roberto/.HRI/Proxy/HRI/Libraries/MThreads/1.1/lib/linux-x86_64-gcc-gli
bc2.3.4/libMThreads.so.1.1

#2  0x00002aaaabd9e775 in Parallel_threadMain (arg=0x636830) at
Parallel.c:645

#3  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0

#4  0x00000033c94d4b0d in clone () from /lib64/libc.so.6

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080717/cfb0f5f7/attachment.htm>


More information about the mpich-discuss mailing list