[mpich-discuss] Re: [mvapich-discuss] Races with MPI_THREAD_MULTI

Roberto Fichera kernel at tekno-soft.it
Fri Aug 29 06:26:54 CDT 2008


Roberto Fichera ha scritto:

Just an update about this issue, I managed to run this test application 
using the HP-MPI (http://www.hp.com/go/mpi) implementation
and it seems working as expected, so after ~24h of execution it 
completes its jobs without any error.

>  Dhabaleswar Panda ha scritto:
>> Hi Roberto,
>>
>> We have done several rounds of checks and do not see any difference
>> between MPICH2 1.0.7 and the TCP/IP interface of MVAPICH2 1.2. Both these
>> should perform exactly the same. We are continuing our investigation.
>>
>> We are wondering whether you can send us a sample code piece to reproduce
>> the problem you are indicating across these two interfaces.  This will
>> help us to debug this problem faster and help you to solve your problem.
>>   
> I've added other CCs in this email, maybe other people are interested 
> to have a look in.
>
> Attached you find the test program, which I'm working on, to turn up 
> the problem. I'm not completely sure if it works perfectly since I wasn't
> able to complete its execution, but please let me know if I made 
> something wrong inside the code. The testmaster is quite easy, you 
> must provide the number
> of jobs to simulate (say 50000) and the node file that the resource 
> manager provide for its schedule. Actually the node that matches the 
> master will
> be excluded by the slave nodes.
>
> The testmain creates a ring of threads from the assigned nodes. So 
> walking in the ring, for each free node it find, a thread is started 
> so you should have as
> many threads as the number of assigned nodes working in 
> multithreading. For simulating something to do each thread internally 
> generate a random integer,
> sets some MPI_Info (host and pwd), spawn the testslave job, send it 
> the generated random number, wait that the testslave receive and send 
> back that
> number, sent and received numbers are comparated in order to verify 
> their coherency, the slave send an empty MPI_Send() for signaling its 
> termination,
> the thread now calls MPI_Comm_disconnect() for closing the slave 
> connection, and finally all the MPI_Info are cleared. At this time the 
> thread terminate.
> When the number of requested jobs are correctly "worked out" the 
> application should terminate ... but without cleaning up (too tired 
> sorry ;-), so it just wait a
> bit and finalize the MPI.
>
> At this time, I wasn't able to complete any execution. Currently the 
> application still crashing with the backtrace you find below. Only one 
> time
> I was able to reach 3500 jobs but one thread was stuck in a mutex. 
> Looking in the backtrace you can find the same race I'm getting in my 
> applications.
>
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 1087666512 (LWP 18231)]
> 0x00000000006a3902 in MPIDI_PG_Dup_vcr () from 
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> Missing separate debuginfos, use: debuginfo-install glibc.x86_64
> (gdb) info threads
>   29 Thread 1121462608 (LWP 18232)  0x0000003465a0a8f9 in 
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> * 28 Thread 1087666512 (LWP 18231)  0x00000000006a3902 in 
> MPIDI_PG_Dup_vcr () from 
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>   27 Thread 1142442320 (LWP 18230)  0x0000003464ecbd66 in poll () from 
> /lib64/libc.so.6
>   26 Thread 1098156368 (LWP 18229)  0x0000003464e9ac61 in nanosleep () 
> from /lib64/libc.so.6
>   1 Thread 140135980537584 (LWP 18029)  main (argc=3, 
> argv=0x7ffffb5992d8) at testmaster.c:437
>
> (gdb) bt
> #0  0x00000000006a3902 in MPIDI_PG_Dup_vcr () from 
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #1  0x0000000000668012 in SetupNewIntercomm () from 
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #2  0x00000000006682c8 in MPIDI_Comm_accept () from 
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #3  0x00000000006a6617 in MPID_Comm_accept () from 
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #4  0x000000000065ec5f in MPIDI_Comm_spawn_multiple () from 
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #5  0x00000000006a17e6 in MPID_Comm_spawn_multiple () from 
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #6  0x00000000006783fd in PMPI_Comm_spawn () from 
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #7  0x00000000004017de in NodeThread_threadMain (arg=0x120a790) at 
> testmaster.c:314
> #8  0x0000003465a06407 in start_thread () from /lib64/libpthread.so.0
> #9  0x0000003464ed4b0d in clone () from /lib64/libc.so.6
> (gdb) thread 29
>
> [Switching to thread 29 (Thread 1121462608 (LWP 18232))]#0  
> 0x0000003465a0a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> (gdb) bt
> #0  0x0000003465a0a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x000000000065e2e7 in MPIDI_CH3I_Progress () from 
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #2  0x00000000006675ca in FreeNewVC () from 
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #3  0x0000000000668302 in MPIDI_Comm_accept () from 
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #4  0x00000000006a6617 in MPID_Comm_accept () from 
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #5  0x000000000065ec5f in MPIDI_Comm_spawn_multiple () from 
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #6  0x00000000006a17e6 in MPID_Comm_spawn_multiple () from 
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #7  0x00000000006783fd in PMPI_Comm_spawn () from 
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #8  0x00000000004017de in NodeThread_threadMain (arg=0x120d590) at 
> testmaster.c:314
> #9  0x0000003465a06407 in start_thread () from /lib64/libpthread.so.0
> #10 0x0000003464ed4b0d in clone () from /lib64/libc.so.6
> (gdb) thread 27
>
> [Switching to thread 27 (Thread 1142442320 (LWP 18230))]#0  
> 0x0000003464ecbd66 in poll () from /lib64/libc.so.6
> (gdb) bt
> #0  0x0000003464ecbd66 in poll () from /lib64/libc.so.6
> #1  0x00000000006d63bf in MPIDU_Sock_wait () from 
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #2  0x000000000065e1e7 in MPIDI_CH3I_Progress () from 
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #3  0x00000000006cf87c in PMPI_Send () from 
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #4  0x0000000000401831 in NodeThread_threadMain (arg=0x120a6f0) at 
> testmaster.c:480
> #5  0x0000003465a06407 in start_thread () from /lib64/libpthread.so.0
> #6  0x0000003464ed4b0d in clone () from /lib64/libc.so.6
>
> (gdb) thread 26
> [Switching to thread 26 (Thread 1098156368 (LWP 18229))]#0  
> 0x0000003464e9ac61 in nanosleep () from /lib64/libc.so.6
> (gdb) bt
> #0  0x0000003464e9ac61 in nanosleep () from /lib64/libc.so.6
> #1  0x0000003464e9aa84 in sleep () from /lib64/libc.so.6
> #2  0x000000000040197c in NodeThread_threadMain (arg=0x120d630) at 
> testmaster.c:505
> #3  0x0000003465a06407 in start_thread () from /lib64/libpthread.so.0
> #4  0x0000003464ed4b0d in clone () from /lib64/libc.so.6
> (gdb)
>
>> Thanks,
>>
>> DK
>>
>> On Tue, 22 Jul 2008, Roberto Fichera wrote:
>>
>>   
>>> Roberto Fichera ha scritto:
>>>     
>>>> Dhabaleswar Panda ha scritto:
>>>>       
>>>>> Hi Roberto,
>>>>>
>>>>> Thanks for your note. You are using the ch3:sock device in MVAPICH2 which
>>>>> is the same as MPICH2. You are also seeing similar failure scenarios (but
>>>>> in different forms) with MPICH2 1.0.7. I am cc'ing this message to mpich2
>>>>> mailing list. One of the MPICH2 developers will be able to extend help on
>>>>> this issue faster.
>>>>>
>>>>>         
>>>> Thanks for that. About the mpich2 problem, I already sent an email
>>>> regarding its related issue.
>>>> But the strange thing is that when linking against mpich2 I don't see
>>>> a so fast race as I see in the
>>>> mvapich2. In the mpich2 case I had to wait 1 or 2 hours before the lock.
>>>>       
>>> Just an update about the problem I got. After replacing all the
>>> MPI_Send() to MPI_Ssend() everything
>>> seems working well with mpich2 v1.0.7. My application doesn't race
>>> anymore at least after dispatching
>>> 50.000 jobs across 4 nodes, but trying to execute the same application
>>> against the last mvapich2 1.2rc1
>>> I'm still getting the same problem as shown below.
>>>
>>> I've another question, since this multithreaded application has to run
>>> into a cluster with 1024 nodes equiped
>>> with Mellanox IB card, I really like to know if the OpenFabrics-IB
>>> interface does support the MPI_THREAD_MULTIPLE
>>> initialization and also the MPI_Comm_spawn() implementation.
>>>
>>> Thanks a lot for the feedback.
>>>     
>>>>> Thanks,
>>>>>
>>>>> DK
>>>>>
>>>>>
>>>>> On Fri, 18 Jul 2008, Roberto Fichera wrote:
>>>>>
>>>>>
>>>>>         
>>>>>> Hi All on the list,
>>>>>>
>>>>>> I'm trying to use mvapich2 v1.2rc1 in a multithreaded application,
>>>>>> initialize using MPI_THREAD_MULTI.
>>>>>> I've the master application doing the following thing, start several
>>>>>> thread depending by the assigned nodes,
>>>>>> on each node a slave application is spawned using the MPI_Comm_spawn().
>>>>>> Before to call the
>>>>>> MPI_Comm_spawn() I prepare the given MPI_Info struct, one for each
>>>>>> thread, in order to set the all keys
>>>>>> (host and wdir) for addressing the wanted behaviour. So, as sooner as
>>>>>> the master application starts, it races
>>>>>> immediately with 4 nodes, 1 master and 3 slaves. Below you can see the
>>>>>> status of the master application at race
>>>>>> time. It seems stuck on the PMIU_readline() which never returns so the
>>>>>> global lock is never relesead. MVAPICH2
>>>>>> is compiled with:
>>>>>>
>>>>>> PKG_PATH=/HRI/External/mvapich2/1.2rc1
>>>>>>
>>>>>> ./configure --prefix=$PKG_PATH \
>>>>>>             --bindir=$PKG_PATH/bin/linux-x86_64-gcc-glibc2.3.4 \
>>>>>>             --sbindir=$PKG_PATH/bin/linux-x86_64-gcc-glibc2.3.4 \
>>>>>>             --libdir=$PKG_PATH/lib/linux-x86_64-gcc-glibc2.3.4 \
>>>>>>             --enable-sharedlibs=gcc \
>>>>>>             --enable-f90 \
>>>>>>             --enable-threads=multiple \
>>>>>>             --enable-g=-ggdb \
>>>>>>             --enable-debuginfo \
>>>>>>             --with-device=ch3:sock \
>>>>>>             --datadir=$PKG_PATH/data  \
>>>>>>             --with-htmldir=$PKG_PATH/doc/html \
>>>>>>             --with-docdir=$PKG_PATH/doc \
>>>>>>             LDFLAGS='-Wl,-z,noexecstack'
>>>>>>
>>>>>> so I'm using the ch3:sock device.
>>>>>>
>>>>>> -----Thread 2
>>>>>> [Switching to thread 2 (Thread 1115699536 (LWP 29479))]#0
>>>>>> 0x00000033ca40cef4 in __lll_lock_wait () from /lib64/libpthread.so.0
>>>>>> (gdb) bt
>>>>>> #0  0x00000033ca40cef4 in __lll_lock_wait () from /lib64/libpthread.so.0
>>>>>> #1  0x00000033ca408915 in _L_lock_102 () from /lib64/libpthread.so.0
>>>>>> --->>#2  0x00000033ca408390 in pthread_mutex_lock () from
>>>>>> /lib64/libpthread.so.0
>>>>>> --->>#3  0x00002aaaab382654 in PMPI_Info_set () from
>>>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>>>> #4  0x0000000000417627 in ParallelWorker_setSlaveInfo (self=<value
>>>>>> optimized out>, key=0x0, value=0x33ca40ff58
>>>>>> "!\204��\r\206��\030\204��3\206��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\177\205��\177\205��\177\205��\177\205��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\033\205��\033\205��\033\205��\033\205��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\033\205��\033\205��"...)
>>>>>> at ParallelWorker.c:664
>>>>>> #5  0x0000000000418905 in ParallelWorker_handleParallel (self=0x62ff50)
>>>>>> at ParallelWorker.c:719
>>>>>> #6  0x000000000041b39e in ParallelWorker_threadMain (arg=0x62ff50) at
>>>>>> ParallelWorker.c:504
>>>>>> #7  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
>>>>>> #8  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
>>>>>>
>>>>>> -----Thread 3
>>>>>> [Switching to thread 3 (Thread 1105209680 (LWP 29478))]#0
>>>>>> 0x00000033ca40cef4 in __lll_lock_wait () from /lib64/libpthread.so.0
>>>>>> (gdb) bt
>>>>>> #0  0x00000033ca40cef4 in __lll_lock_wait () from /lib64/libpthread.so.0
>>>>>> #1  0x00000033ca408915 in _L_lock_102 () from /lib64/libpthread.so.0
>>>>>> --->>#2  0x00000033ca408390 in pthread_mutex_lock () from
>>>>>> /lib64/libpthread.so.0
>>>>>> --->>#3  0x00002aaaab382654 in PMPI_Info_set () from
>>>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>>>> #4  0x0000000000417627 in ParallelWorker_setSlaveInfo (self=<value
>>>>>> optimized out>, key=0x0, value=0x33ca40ff58
>>>>>> "!\204��\r\206��\030\204��3\206��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\177\205��\177\205��\177\205��\177\205��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\033\205��\033\205��\033\205��\033\205��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\033\205��\033\205��"...)
>>>>>> at ParallelWorker.c:664
>>>>>> #5  0x0000000000418905 in ParallelWorker_handleParallel (self=0x62f270)
>>>>>> at ParallelWorker.c:719
>>>>>> #6  0x000000000041b39e in ParallelWorker_threadMain (arg=0x62f270) at
>>>>>> ParallelWorker.c:504
>>>>>> #7  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
>>>>>> #8  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
>>>>>>
>>>>>> -----Thread 4
>>>>>> [Switching to thread 4 (Thread 1094719824 (LWP 29477))]#0
>>>>>> 0x00000033ca40d34b in read () from /lib64/libpthread.so.0
>>>>>> (gdb) bt
>>>>>> #0  0x00000033ca40d34b in read () from /lib64/libpthread.so.0
>>>>>> --->>#1  0x00002aaaab3db84a in PMIU_readline () from
>>>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>>>> --->>#2  0x00002aaaab3d9d37 in PMI_Spawn_multiple () from
>>>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>>>> #3  0x00002aaaab333893 in MPIDI_Comm_spawn_multiple () from
>>>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>>>> #4  0x00002aaaab38bcf6 in MPID_Comm_spawn_multiple () from
>>>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>>>> #5  0x00002aaaab355a10 in PMPI_Comm_spawn () from
>>>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>>>> #6  0x00000000004189d8 in ParallelWorker_handleParallel (self=0x62ad40)
>>>>>> at ParallelWorker.c:754
>>>>>> #7  0x000000000041b39e in ParallelWorker_threadMain (arg=0x62ad40) at
>>>>>> ParallelWorker.c:504
>>>>>> #8  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
>>>>>> #9  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
>>>>>>
>>>>>> I also tried to run against MPICH2 v1.0.7, but here I got a similar
>>>>>> scenery which show up after between 1 - 2 hours of execution,
>>>>>> see below:
>>>>>>
>>>>>> ----- thread 2
>>>>>> [Switching to thread 2 (Thread 1094719824 (LWP 1279))]#0  0x00000033c94cbd66 in poll () from /lib64/libc.so.6
>>>>>> (gdb) bt
>>>>>> #0  0x00000033c94cbd66 in poll () from /lib64/libc.so.6
>>>>>> #1  0x00002aaaab5a3d2f in MPIDU_Sock_wait () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>>>> #2  0x00002aaaab52bdc7 in MPIDI_CH3I_Progress () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>>>> #3  0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>>>> #4  0x00002aaaab56f162 in MPID_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>>>> #5  0x00002aaaab5417ec in PMPI_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>>>> #6  0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x6358e0) at ParallelWorker.c:819
>>>>>> #7  0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x6358e0) at ParallelWorker.c:515
>>>>>> #8  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
>>>>>> #9  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
>>>>>>
>>>>>> ----- thread 3
>>>>>> [Switching to thread 3 (Thread 1084229968 (LWP 1278))]#0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>>>>> (gdb) bt
>>>>>> #0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>>>>> #1  0x00002aaaab52bec7 in MPIDI_CH3I_Progress () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>>>> #2  0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>>>> #3  0x00002aaaab56f162 in MPID_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>>>> #4  0x00002aaaab5417ec in PMPI_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>>>> #5  0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x634d20) at ParallelWorker.c:819
>>>>>> #6  0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x634d20) at ParallelWorker.c:515
>>>>>> #7  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
>>>>>> #8  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
>>>>>>
>>>>>>
>>>>>> ----- thread 4
>>>>>> [Switching to thread 4 (Thread 1115699536 (LWP 1277))]#0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>>>>> (gdb) bt
>>>>>> #0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>>>>> #1  0x00002aaaab52bec7 in MPIDI_CH3I_Progress () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>>>> #2  0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>>>> #3  0x00002aaaab56f162 in MPID_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>>>> #4  0x00002aaaab5417ec in PMPI_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>>>>>> #5  0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x6341a0) at ParallelWorker.c:819
>>>>>> #6  0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x6341a0) at ParallelWorker.c:515
>>>>>> #7  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
>>>>>> #8  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
>>>>>>
>>>>>> where the thread 2 is poll()ing never never returns, so never signals
>>>>>> the poll() completion and than all the others
>>>>>> waiters in the MPIDI_CH3I_Progress() condition will never wake up.
>>>>>>
>>>>>> Does anyone is having the same problem?
>>>>>>
>>>>>> Thanks in advance,
>>>>>> Roberto Fichera.
>>>>>>
>>>>>>
>>>>>>           
>>>>>         
>>>> ------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> mvapich-discuss mailing list
>>>> mvapich-discuss at cse.ohio-state.edu
>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>
>>>>       
>>>     
>>
>>
>>   
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080829/d869032d/attachment.htm>


More information about the mpich-discuss mailing list