[mpich-discuss] Re: [mvapich-discuss] Races with MPI_THREAD_MULTI

Roberto Fichera kernel at tekno-soft.it
Fri Jul 18 13:17:55 CDT 2008


Matthew Koop ha scritto:
> Hi Roberto,
>
> Are you using the new 'mpirun_rsh' command for launching your job? If so,
> that would explain the hang you see in the PMI calls (and why they happen
> at the spawn).
>
> We currently do not support spawn functionality in this release for
> mpirun_rsh. You will need to use MPD if your application needs spawn
> functionality until we release an updated version of mpirun_rsh.
>   
No! I use a torque script for executing it or the torque interactive way, so
qsub -I TestParallel1.pbs. Below there is the relevant part of the pbs 
script:

## Compute the number of associated nodes
NODES=`wc -l < $PBS_NODEFILE`

# Arrange the PBS host file for MPI handling
TMPFILE=`mktemp` || exit 1
sort $PBS_NODEFILE | uniq -c | awk '{ printf("%s:%s\n", $2, $1); }' > 
$TMPFILE

## start MPI with the requested nodes
$HGR/External/mvapich2/1.2/bin/$MAKEFILE_PLATFORM/mpdboot -n $NODES -f 
$TMPFILE

## Run the application
/data/roberto/newBST/Libraries/Parallelization/Parallel/1.0/examples/TestParallel1.sh

# remove the temporary file
rm -f $TMPFILE

## Quit from MPI
$HGR/External/mvapich2/1.2/bin/$MAKEFILE_PLATFORM/mpdallexit





> Thanks,
>
> Matt
>
> On Fri, 18 Jul 2008, Roberto Fichera wrote:
>
>   
>> Dhabaleswar Panda ha scritto:
>>     
>>> Hi Roberto,
>>>
>>> Thanks for your note. You are using the ch3:sock device in MVAPICH2 which
>>> is the same as MPICH2. You are also seeing similar failure scenarios (but
>>> in different forms) with MPICH2 1.0.7. I am cc'ing this message to mpich2
>>> mailing list. One of the MPICH2 developers will be able to extend help on
>>> this issue faster.
>>>
>>>       
>> Thanks for that. About the mpich2 problem, I already sent an email
>> regarding its related issue.
>> But the strange thing is that when linking against mpich2 I don't see a
>> so fast race as I see in the
>> mvapich2. In the mpich2 case I had to wait 1 or 2 hours before the lock.
>>     
>>> Thanks,
>>>
>>> DK
>>>
>>>
>>> On Fri, 18 Jul 2008, Roberto Fichera wrote:
>>>
>>>
>>>       
>>>> Hi All on the list,
>>>>
>>>> I'm trying to use mvapich2 v1.2rc1 in a multithreaded application,
>>>> initialize using MPI_THREAD_MULTI.
>>>> I've the master application doing the following thing, start several
>>>> thread depending by the assigned nodes,
>>>> on each node a slave application is spawned using the MPI_Comm_spawn().
>>>> Before to call the
>>>> MPI_Comm_spawn() I prepare the given MPI_Info struct, one for each
>>>> thread, in order to set the all keys
>>>> (host and wdir) for addressing the wanted behaviour. So, as sooner as
>>>> the master application starts, it races
>>>> immediately with 4 nodes, 1 master and 3 slaves. Below you can see the
>>>> status of the master application at race
>>>> time. It seems stuck on the PMIU_readline() which never returns so the
>>>> global lock is never relesead. MVAPICH2
>>>> is compiled with:
>>>>
>>>> PKG_PATH=/HRI/External/mvapich2/1.2rc1
>>>>
>>>> ./configure --prefix=$PKG_PATH \
>>>>             --bindir=$PKG_PATH/bin/linux-x86_64-gcc-glibc2.3.4 \
>>>>             --sbindir=$PKG_PATH/bin/linux-x86_64-gcc-glibc2.3.4 \
>>>>             --libdir=$PKG_PATH/lib/linux-x86_64-gcc-glibc2.3.4 \
>>>>             --enable-sharedlibs=gcc \
>>>>             --enable-f90 \
>>>>             --enable-threads=multiple \
>>>>             --enable-g=-ggdb \
>>>>             --enable-debuginfo \
>>>>             --with-device=ch3:sock \
>>>>             --datadir=$PKG_PATH/data  \
>>>>             --with-htmldir=$PKG_PATH/doc/html \
>>>>             --with-docdir=$PKG_PATH/doc \
>>>>             LDFLAGS='-Wl,-z,noexecstack'
>>>>
>>>> so I'm using the ch3:sock device.
>>>>
>>>> -----Thread 2
>>>> [Switching to thread 2 (Thread 1115699536 (LWP 29479))]#0
>>>> 0x00000033ca40cef4 in __lll_lock_wait () from /lib64/libpthread.so.0
>>>> (gdb) bt
>>>> #0  0x00000033ca40cef4 in __lll_lock_wait () from /lib64/libpthread.so.0
>>>> #1  0x00000033ca408915 in _L_lock_102 () from /lib64/libpthread.so.0
>>>> --->>#2  0x00000033ca408390 in pthread_mutex_lock () from
>>>> /lib64/libpthread.so.0
>>>> --->>#3  0x00002aaaab382654 in PMPI_Info_set () from
>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>> #4  0x0000000000417627 in ParallelWorker_setSlaveInfo (self=<value
>>>> optimized out>, key=0x0, value=0x33ca40ff58
>>>> "!\204��\r\206��\030\204��3\206��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\177\205��\177\205��\177\205��\177\205��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\033\205��\033\205��\033\205��\033\205��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\033\205��\033\205��"...)
>>>> at ParallelWorker.c:664
>>>> #5  0x0000000000418905 in ParallelWorker_handleParallel (self=0x62ff50)
>>>> at ParallelWorker.c:719
>>>> #6  0x000000000041b39e in ParallelWorker_threadMain (arg=0x62ff50) at
>>>> ParallelWorker.c:504
>>>> #7  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
>>>> #8  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
>>>>
>>>> -----Thread 3
>>>> [Switching to thread 3 (Thread 1105209680 (LWP 29478))]#0
>>>> 0x00000033ca40cef4 in __lll_lock_wait () from /lib64/libpthread.so.0
>>>> (gdb) bt
>>>> #0  0x00000033ca40cef4 in __lll_lock_wait () from /lib64/libpthread.so.0
>>>> #1  0x00000033ca408915 in _L_lock_102 () from /lib64/libpthread.so.0
>>>> --->>#2  0x00000033ca408390 in pthread_mutex_lock () from
>>>> /lib64/libpthread.so.0
>>>> --->>#3  0x00002aaaab382654 in PMPI_Info_set () from
>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>> #4  0x0000000000417627 in ParallelWorker_setSlaveInfo (self=<value
>>>> optimized out>, key=0x0, value=0x33ca40ff58
>>>> "!\204��\r\206��\030\204��3\206��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\177\205��\177\205��\177\205��\177\205��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\033\205��\033\205��\033\205��\033\205��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\033\205��\033\205��"...)
>>>> at ParallelWorker.c:664
>>>> #5  0x0000000000418905 in ParallelWorker_handleParallel (self=0x62f270)
>>>> at ParallelWorker.c:719
>>>> #6  0x000000000041b39e in ParallelWorker_threadMain (arg=0x62f270) at
>>>> ParallelWorker.c:504
>>>> #7  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
>>>> #8  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
>>>>
>>>> -----Thread 4
>>>> [Switching to thread 4 (Thread 1094719824 (LWP 29477))]#0
>>>> 0x00000033ca40d34b in read () from /lib64/libpthread.so.0
>>>> (gdb) bt
>>>> #0  0x00000033ca40d34b in read () from /lib64/libpthread.so.0
>>>> --->>#1  0x00002aaaab3db84a in PMIU_readline () from
>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>> --->>#2  0x00002aaaab3d9d37 in PMI_Spawn_multiple () from
>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>> #3  0x00002aaaab333893 in MPIDI_Comm_spawn_multiple () from
>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>> #4  0x00002aaaab38bcf6 in MPID_Comm_spawn_multiple () from
>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>> #5  0x00002aaaab355a10 in PMPI_Comm_spawn () from
>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>> #6  0x00000000004189d8 in ParallelWorker_handleParallel (self=0x62ad40)
>>>> at ParallelWorker.c:754
>>>> #7  0x000000000041b39e in ParallelWorker_threadMain (arg=0x62ad40) at
>>>> ParallelWorker.c:504
>>>> #8  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
>>>> #9  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
>>>>
>>>> I also tried to run against MPICH2 v1.0.7, but here I got a similar
>>>> scenery which show up after between 1 - 2 hours of execution,
>>>> see below:
>>>>
>>>> ----- thread 2
>>>> [Switching to thread 2 (Thread 1094719824 (LWP 1279))]#0  0x00000033c94cbd66 in poll () from /lib64/libc.so.6
>>>> (gdb) bt
>>>> #0  0x00000033c94cbd66 in poll () from /lib64/libc.so.6
>>>> #1  0x00002aaaab5a3d2f in MPIDU_Sock_wait () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>> #2  0x00002aaaab52bdc7 in MPIDI_CH3I_Progress () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>> #3  0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>> #4  0x00002aaaab56f162 in MPID_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>> #5  0x00002aaaab5417ec in PMPI_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>> #6  0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x6358e0) at ParallelWorker.c:819
>>>> #7  0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x6358e0) at ParallelWorker.c:515
>>>> #8  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
>>>> #9  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
>>>>
>>>> ----- thread 3
>>>> [Switching to thread 3 (Thread 1084229968 (LWP 1278))]#0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>>> (gdb) bt
>>>> #0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>>> #1  0x00002aaaab52bec7 in MPIDI_CH3I_Progress () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>> #2  0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>> #3  0x00002aaaab56f162 in MPID_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>> #4  0x00002aaaab5417ec in PMPI_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>> #5  0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x634d20) at ParallelWorker.c:819
>>>> #6  0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x634d20) at ParallelWorker.c:515
>>>> #7  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
>>>> #8  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
>>>>
>>>>
>>>> ----- thread 4
>>>> [Switching to thread 4 (Thread 1115699536 (LWP 1277))]#0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>>> (gdb) bt
>>>> #0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>>> #1  0x00002aaaab52bec7 in MPIDI_CH3I_Progress () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>> #2  0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>> #3  0x00002aaaab56f162 in MPID_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>> #4  0x00002aaaab5417ec in PMPI_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>> #5  0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x6341a0) at ParallelWorker.c:819
>>>> #6  0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x6341a0) at ParallelWorker.c:515
>>>> #7  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
>>>> #8  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
>>>>
>>>> where the thread 2 is poll()ing never never returns, so never signals
>>>> the poll() completion and than all the others
>>>> waiters in the MPIDI_CH3I_Progress() condition will never wake up.
>>>>
>>>> Does anyone is having the same problem?
>>>>
>>>> Thanks in advance,
>>>> Roberto Fichera.
>>>>
>>>>
>>>>         
>>>
>>>       
>>     
>
>
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080718/af1f7f6c/attachment.htm>


More information about the mpich-discuss mailing list