[mpich-discuss] Deadlock when in MPI_THREAD_MULTIPLE within the MPI_Comm_disconnect()

Roberto Fichera kernel at tekno-soft.it
Fri Jul 18 03:02:25 CDT 2008


Rajeev Thakur ha scritto:
> Do you have a small test program we could use to reproduce this error?
Unfortunately not at this time. In general the threads that does the MPI 
work are doing:

1) MPI_Comm_spaw() with one slave, some MPI_Info for setting host and wdir;
2) Some send and receive using MPI_Send() and MPI_Recv();
3) Final handshake for catching the slave termination. Slaves are using 
MPI_Ssend( NULL, 0, MPI_BYTE, 0, JOB_TERMINATION, parentcomm );
    while the threads are using an MPI_Recv( NULL, 0, MPI_BYTE, 0, 
JOB_TERMINATION, self->childComm, MPI_STATUS_IGNORE );
4) Finally the slaves are "destroyed" using an MPI_Comm_disconnect();

This kind of loop is executed in multithreading until all the jobs are 
completed. The number of the threads are equal to the number
of assigned nodes for a given distributed execution.

Eventually, I could try to prepare a test program when possible.

Finally, I see many times some lock in the condition within the 
MPIDI_CH3I_Progress() caming from different library places. Sometimes
I see also the MPI_Comm_spawn() on the back trace, when the application 
locks for this problem.
> Rajeev
>
>     ------------------------------------------------------------------------
>     *From:* owner-mpich-discuss at mcs.anl.gov
>     [mailto:owner-mpich-discuss at mcs.anl.gov] *On Behalf Of *Roberto
>     Fichera
>     *Sent:* Thursday, July 17, 2008 12:05 PM
>     *To:* mpich-discuss at mcs.anl.gov
>     *Subject:* [mpich-discuss] Deadlock when in MPI_THREAD_MULTIPLE
>     within the MPI_Comm_disconnect()
>
>     Hi All on the list,
>
>     I guess to have found a dead lock in the last MPICH2 v1.0.7, the scenery is the following:
>
>     thread 1 is the main user's application;
>     threads 2/3/4 are using the MPI functions for dynamically spawn a slave in a choosed node
>             exchange some data, waiting the slave termination and finally they calls the 
>             MPI_Comm_disconnect() for releasing the master/slave intercommunicator;
>     thread 5 is the dispatcher of the 2/3/4 threads it waits their termination;
>
>     So, looking at the calltrace of the thread2 the MPI is waiting that the poll(), which was called by MPIDU_Sock_wait(), 
>     returns, here we are within the MPI_Comm_disconnect(). The call trace of the thread3/4 is also in the 
>     MPI_Comm_disconnect() but it's waiting in a condition called by the MPIDI_CH3I_Progress(). So basically all the
>     three threads are stuck in the MPI_Comm_disconnect()! 
>
>     Does anyone have an idea what's going on here?
>
>     Thanks in advance.
>     Roberto Fichera.
>
>      
>     (gdb) thread 1
>     [Switching to thread 1 (Thread 46912533127120 (LWP 30857))]#0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>     (gdb) bt
>     #0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>     #1  0x00002aaaab0b08a2 in Cond_wait () from /home/simone/.HRI/Proxy/HRI/Libraries/MThreads/1.1/lib/linux-x86_64-gcc-glibc2.3.4/libMThreads.so.1.1
>     #2  0x00002aaaabb8a787 in MTQueue_popWait (self=0x636b70, userClass=0x0, microsecs=0) at MTQueue.c:177
>     #3  0x000000000040642b in main (argc=1, argv=0x7fff4e7909f8) at ackley_master.cpp:265
>
>     //=============================================================================
>
>     (gdb) thread 2
>     [Switching to thread 2 (Thread 1094719824 (LWP 1279))]#0  0x00000033c94cbd66 in poll () from /lib64/libc.so.6
>     (gdb) bt
>     #0  0x00000033c94cbd66 in poll () from /lib64/libc.so.6
>     #1  0x00002aaaab5a3d2f in MPIDU_Sock_wait () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>     #2  0x00002aaaab52bdc7 in MPIDI_CH3I_Progress () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>     #3  0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>     #4  0x00002aaaab56f162 in MPID_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>     #5  0x00002aaaab5417ec in PMPI_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>     #6  0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x6358e0) at ParallelWorker.c:819
>     #7  0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x6358e0) at ParallelWorker.c:515
>     #8  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
>     #9  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
>
>     //=============================================================================
>
>     (gdb) thread 3
>     [Switching to thread 3 (Thread 1084229968 (LWP 1278))]#0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>     (gdb) bt
>     #0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>     #1  0x00002aaaab52bec7 in MPIDI_CH3I_Progress () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>     #2  0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>     #3  0x00002aaaab56f162 in MPID_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>     #4  0x00002aaaab5417ec in PMPI_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>     #5  0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x634d20) at ParallelWorker.c:819
>     #6  0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x634d20) at ParallelWorker.c:515
>     #7  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
>     #8  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
>
>     //=============================================================================
>
>     (gdb) thread 4
>     [Switching to thread 4 (Thread 1115699536 (LWP 1277))]#0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>     (gdb) bt
>     #0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>     #1  0x00002aaaab52bec7 in MPIDI_CH3I_Progress () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>     #2  0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>     #3  0x00002aaaab56f162 in MPID_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>     #4  0x00002aaaab5417ec in PMPI_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>     #5  0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x6341a0) at ParallelWorker.c:819
>     #6  0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x6341a0) at ParallelWorker.c:515
>     #7  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
>     #8  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
>
>     //=============================================================================
>
>     (gdb) thread 5
>     [Switching to thread 5 (Thread 1105209680 (LWP 1276))]#0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>     (gdb) bt
>     #0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>     #1  0x00002aaaab0b08a2 in Cond_wait () from /home/roberto/.HRI/Proxy/HRI/Libraries/MThreads/1.1/lib/linux-x86_64-gcc-glibc2.3.4/libMThreads.so.1.1
>     #2  0x00002aaaabd9e775 in Parallel_threadMain (arg=0x636830) at Parallel.c:645
>     #3  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
>     #4  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080718/b7ccbc7f/attachment.htm>


More information about the mpich-discuss mailing list