[mpich-discuss] Re: [mvapich-discuss] Races with MPI_THREAD_MULTI

Roberto Fichera kernel at tekno-soft.it
Fri Jul 18 11:49:48 CDT 2008


Dhabaleswar Panda ha scritto:
> Hi Roberto,
>
> Thanks for your note. You are using the ch3:sock device in MVAPICH2 which
> is the same as MPICH2. You are also seeing similar failure scenarios (but
> in different forms) with MPICH2 1.0.7. I am cc'ing this message to mpich2
> mailing list. One of the MPICH2 developers will be able to extend help on
> this issue faster.
>   
Thanks for that. About the mpich2 problem, I already sent an email 
regarding its related issue.
But the strange thing is that when linking against mpich2 I don't see a 
so fast race as I see in the
mvapich2. In the mpich2 case I had to wait 1 or 2 hours before the lock.
> Thanks,
>
> DK
>
>
> On Fri, 18 Jul 2008, Roberto Fichera wrote:
>
>   
>> Hi All on the list,
>>
>> I'm trying to use mvapich2 v1.2rc1 in a multithreaded application,
>> initialize using MPI_THREAD_MULTI.
>> I've the master application doing the following thing, start several
>> thread depending by the assigned nodes,
>> on each node a slave application is spawned using the MPI_Comm_spawn().
>> Before to call the
>> MPI_Comm_spawn() I prepare the given MPI_Info struct, one for each
>> thread, in order to set the all keys
>> (host and wdir) for addressing the wanted behaviour. So, as sooner as
>> the master application starts, it races
>> immediately with 4 nodes, 1 master and 3 slaves. Below you can see the
>> status of the master application at race
>> time. It seems stuck on the PMIU_readline() which never returns so the
>> global lock is never relesead. MVAPICH2
>> is compiled with:
>>
>> PKG_PATH=/HRI/External/mvapich2/1.2rc1
>>
>> ./configure --prefix=$PKG_PATH \
>>             --bindir=$PKG_PATH/bin/linux-x86_64-gcc-glibc2.3.4 \
>>             --sbindir=$PKG_PATH/bin/linux-x86_64-gcc-glibc2.3.4 \
>>             --libdir=$PKG_PATH/lib/linux-x86_64-gcc-glibc2.3.4 \
>>             --enable-sharedlibs=gcc \
>>             --enable-f90 \
>>             --enable-threads=multiple \
>>             --enable-g=-ggdb \
>>             --enable-debuginfo \
>>             --with-device=ch3:sock \
>>             --datadir=$PKG_PATH/data  \
>>             --with-htmldir=$PKG_PATH/doc/html \
>>             --with-docdir=$PKG_PATH/doc \
>>             LDFLAGS='-Wl,-z,noexecstack'
>>
>> so I'm using the ch3:sock device.
>>
>> -----Thread 2
>> [Switching to thread 2 (Thread 1115699536 (LWP 29479))]#0
>> 0x00000033ca40cef4 in __lll_lock_wait () from /lib64/libpthread.so.0
>> (gdb) bt
>> #0  0x00000033ca40cef4 in __lll_lock_wait () from /lib64/libpthread.so.0
>> #1  0x00000033ca408915 in _L_lock_102 () from /lib64/libpthread.so.0
>> --->>#2  0x00000033ca408390 in pthread_mutex_lock () from
>> /lib64/libpthread.so.0
>> --->>#3  0x00002aaaab382654 in PMPI_Info_set () from
>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>> #4  0x0000000000417627 in ParallelWorker_setSlaveInfo (self=<value
>> optimized out>, key=0x0, value=0x33ca40ff58
>> "!\204��\r\206��\030\204��3\206��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\177\205��\177\205��\177\205��\177\205��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\033\205��\033\205��\033\205��\033\205��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\033\205��\033\205��"...)
>> at ParallelWorker.c:664
>> #5  0x0000000000418905 in ParallelWorker_handleParallel (self=0x62ff50)
>> at ParallelWorker.c:719
>> #6  0x000000000041b39e in ParallelWorker_threadMain (arg=0x62ff50) at
>> ParallelWorker.c:504
>> #7  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
>> #8  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
>>
>> -----Thread 3
>> [Switching to thread 3 (Thread 1105209680 (LWP 29478))]#0
>> 0x00000033ca40cef4 in __lll_lock_wait () from /lib64/libpthread.so.0
>> (gdb) bt
>> #0  0x00000033ca40cef4 in __lll_lock_wait () from /lib64/libpthread.so.0
>> #1  0x00000033ca408915 in _L_lock_102 () from /lib64/libpthread.so.0
>> --->>#2  0x00000033ca408390 in pthread_mutex_lock () from
>> /lib64/libpthread.so.0
>> --->>#3  0x00002aaaab382654 in PMPI_Info_set () from
>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>> #4  0x0000000000417627 in ParallelWorker_setSlaveInfo (self=<value
>> optimized out>, key=0x0, value=0x33ca40ff58
>> "!\204��\r\206��\030\204��3\206��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\177\205��\177\205��\177\205��\177\205��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\033\205��\033\205��\033\205��\033\205��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\033\205��\033\205��"...)
>> at ParallelWorker.c:664
>> #5  0x0000000000418905 in ParallelWorker_handleParallel (self=0x62f270)
>> at ParallelWorker.c:719
>> #6  0x000000000041b39e in ParallelWorker_threadMain (arg=0x62f270) at
>> ParallelWorker.c:504
>> #7  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
>> #8  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
>>
>> -----Thread 4
>> [Switching to thread 4 (Thread 1094719824 (LWP 29477))]#0
>> 0x00000033ca40d34b in read () from /lib64/libpthread.so.0
>> (gdb) bt
>> #0  0x00000033ca40d34b in read () from /lib64/libpthread.so.0
>> --->>#1  0x00002aaaab3db84a in PMIU_readline () from
>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>> --->>#2  0x00002aaaab3d9d37 in PMI_Spawn_multiple () from
>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>> #3  0x00002aaaab333893 in MPIDI_Comm_spawn_multiple () from
>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>> #4  0x00002aaaab38bcf6 in MPID_Comm_spawn_multiple () from
>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>> #5  0x00002aaaab355a10 in PMPI_Comm_spawn () from
>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>> #6  0x00000000004189d8 in ParallelWorker_handleParallel (self=0x62ad40)
>> at ParallelWorker.c:754
>> #7  0x000000000041b39e in ParallelWorker_threadMain (arg=0x62ad40) at
>> ParallelWorker.c:504
>> #8  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
>> #9  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
>>
>> I also tried to run against MPICH2 v1.0.7, but here I got a similar
>> scenery which show up after between 1 - 2 hours of execution,
>> see below:
>>
>> ----- thread 2
>> [Switching to thread 2 (Thread 1094719824 (LWP 1279))]#0  0x00000033c94cbd66 in poll () from /lib64/libc.so.6
>> (gdb) bt
>> #0  0x00000033c94cbd66 in poll () from /lib64/libc.so.6
>> #1  0x00002aaaab5a3d2f in MPIDU_Sock_wait () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>> #2  0x00002aaaab52bdc7 in MPIDI_CH3I_Progress () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>> #3  0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>> #4  0x00002aaaab56f162 in MPID_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>> #5  0x00002aaaab5417ec in PMPI_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>> #6  0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x6358e0) at ParallelWorker.c:819
>> #7  0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x6358e0) at ParallelWorker.c:515
>> #8  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
>> #9  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
>>
>> ----- thread 3
>> [Switching to thread 3 (Thread 1084229968 (LWP 1278))]#0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>> (gdb) bt
>> #0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>> #1  0x00002aaaab52bec7 in MPIDI_CH3I_Progress () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>> #2  0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>> #3  0x00002aaaab56f162 in MPID_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>> #4  0x00002aaaab5417ec in PMPI_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>> #5  0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x634d20) at ParallelWorker.c:819
>> #6  0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x634d20) at ParallelWorker.c:515
>> #7  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
>> #8  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
>>
>>
>> ----- thread 4
>> [Switching to thread 4 (Thread 1115699536 (LWP 1277))]#0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>> (gdb) bt
>> #0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>> #1  0x00002aaaab52bec7 in MPIDI_CH3I_Progress () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>> #2  0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>> #3  0x00002aaaab56f162 in MPID_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>> #4  0x00002aaaab5417ec in PMPI_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>> #5  0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x6341a0) at ParallelWorker.c:819
>> #6  0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x6341a0) at ParallelWorker.c:515
>> #7  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
>> #8  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
>>
>> where the thread 2 is poll()ing never never returns, so never signals
>> the poll() completion and than all the others
>> waiters in the MPIDI_CH3I_Progress() condition will never wake up.
>>
>> Does anyone is having the same problem?
>>
>> Thanks in advance,
>> Roberto Fichera.
>>
>>     
>
>
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080718/375664b2/attachment.htm>


More information about the mpich-discuss mailing list