[mpich-discuss] Re: [mvapich-discuss] Races with MPI_THREAD_MULTI

Dhabaleswar Panda panda at cse.ohio-state.edu
Sat Jul 26 22:36:20 CDT 2008


Roberto - Thanks for sending us the code. We will take a look at it.

Thanks,

DK

On Fri, 25 Jul 2008, Roberto Fichera wrote:

> Dhabaleswar Panda ha scritto:
> > Hi Roberto,
> >
> > We have done several rounds of checks and do not see any difference
> > between MPICH2 1.0.7 and the TCP/IP interface of MVAPICH2 1.2. Both these
> > should perform exactly the same. We are continuing our investigation.
> >
> > We are wondering whether you can send us a sample code piece to reproduce
> > the problem you are indicating across these two interfaces.  This will
> > help us to debug this problem faster and help you to solve your problem.
> >
> I've added other CCs in this email, maybe other people are interested to
> have a look in.
>
> Attached you find the test program, which I'm working on, to turn up the
> problem. I'm not completely sure if it works perfectly since I wasn't
> able to complete its execution, but please let me know if I made
> something wrong inside the code. The testmaster is quite easy, you must
> provide the number
> of jobs to simulate (say 50000) and the node file that the resource
> manager provide for its schedule. Actually the node that matches the
> master will
> be excluded by the slave nodes.
>
> The testmain creates a ring of threads from the assigned nodes. So
> walking in the ring, for each free node it find, a thread is started so
> you should have as
> many threads as the number of assigned nodes working in multithreading.
> For simulating something to do each thread internally generate a random
> integer,
> sets some MPI_Info (host and pwd), spawn the testslave job, send it the
> generated random number, wait that the testslave receive and send back that
> number, sent and received numbers are comparated in order to verify
> their coherency, the slave send an empty MPI_Send() for signaling its
> termination,
> the thread now calls MPI_Comm_disconnect() for closing the slave
> connection, and finally all the MPI_Info are cleared. At this time the
> thread terminate.
> When the number of requested jobs are correctly "worked out" the
> application should terminate ... but without cleaning up (too tired
> sorry ;-), so it just wait a
> bit and finalize the MPI.
>
> At this time, I wasn't able to complete any execution. Currently the
> application still crashing with the backtrace you find below. Only one time
> I was able to reach 3500 jobs but one thread was stuck in a mutex.
> Looking in the backtrace you can find the same race I'm getting in my
> applications.
>
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 1087666512 (LWP 18231)]
> 0x00000000006a3902 in MPIDI_PG_Dup_vcr () from
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> Missing separate debuginfos, use: debuginfo-install glibc.x86_64
> (gdb) info threads
>   29 Thread 1121462608 (LWP 18232)  0x0000003465a0a8f9 in
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> * 28 Thread 1087666512 (LWP 18231)  0x00000000006a3902 in
> MPIDI_PG_Dup_vcr () from
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
>   27 Thread 1142442320 (LWP 18230)  0x0000003464ecbd66 in poll () from
> /lib64/libc.so.6
>   26 Thread 1098156368 (LWP 18229)  0x0000003464e9ac61 in nanosleep ()
> from /lib64/libc.so.6
>   1 Thread 140135980537584 (LWP 18029)  main (argc=3,
> argv=0x7ffffb5992d8) at testmaster.c:437
>
> (gdb) bt
> #0  0x00000000006a3902 in MPIDI_PG_Dup_vcr () from
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #1  0x0000000000668012 in SetupNewIntercomm () from
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #2  0x00000000006682c8 in MPIDI_Comm_accept () from
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #3  0x00000000006a6617 in MPID_Comm_accept () from
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #4  0x000000000065ec5f in MPIDI_Comm_spawn_multiple () from
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #5  0x00000000006a17e6 in MPID_Comm_spawn_multiple () from
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #6  0x00000000006783fd in PMPI_Comm_spawn () from
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #7  0x00000000004017de in NodeThread_threadMain (arg=0x120a790) at
> testmaster.c:314
> #8  0x0000003465a06407 in start_thread () from /lib64/libpthread.so.0
> #9  0x0000003464ed4b0d in clone () from /lib64/libc.so.6
> (gdb) thread 29
>
> [Switching to thread 29 (Thread 1121462608 (LWP 18232))]#0
> 0x0000003465a0a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> (gdb) bt
> #0  0x0000003465a0a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1  0x000000000065e2e7 in MPIDI_CH3I_Progress () from
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #2  0x00000000006675ca in FreeNewVC () from
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #3  0x0000000000668302 in MPIDI_Comm_accept () from
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #4  0x00000000006a6617 in MPID_Comm_accept () from
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #5  0x000000000065ec5f in MPIDI_Comm_spawn_multiple () from
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #6  0x00000000006a17e6 in MPID_Comm_spawn_multiple () from
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #7  0x00000000006783fd in PMPI_Comm_spawn () from
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #8  0x00000000004017de in NodeThread_threadMain (arg=0x120d590) at
> testmaster.c:314
> #9  0x0000003465a06407 in start_thread () from /lib64/libpthread.so.0
> #10 0x0000003464ed4b0d in clone () from /lib64/libc.so.6
> (gdb) thread 27
>
> [Switching to thread 27 (Thread 1142442320 (LWP 18230))]#0
> 0x0000003464ecbd66 in poll () from /lib64/libc.so.6
> (gdb) bt
> #0  0x0000003464ecbd66 in poll () from /lib64/libc.so.6
> #1  0x00000000006d63bf in MPIDU_Sock_wait () from
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #2  0x000000000065e1e7 in MPIDI_CH3I_Progress () from
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #3  0x00000000006cf87c in PMPI_Send () from
> /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> #4  0x0000000000401831 in NodeThread_threadMain (arg=0x120a6f0) at
> testmaster.c:480
> #5  0x0000003465a06407 in start_thread () from /lib64/libpthread.so.0
> #6  0x0000003464ed4b0d in clone () from /lib64/libc.so.6
>
> (gdb) thread 26
> [Switching to thread 26 (Thread 1098156368 (LWP 18229))]#0
> 0x0000003464e9ac61 in nanosleep () from /lib64/libc.so.6
> (gdb) bt
> #0  0x0000003464e9ac61 in nanosleep () from /lib64/libc.so.6
> #1  0x0000003464e9aa84 in sleep () from /lib64/libc.so.6
> #2  0x000000000040197c in NodeThread_threadMain (arg=0x120d630) at
> testmaster.c:505
> #3  0x0000003465a06407 in start_thread () from /lib64/libpthread.so.0
> #4  0x0000003464ed4b0d in clone () from /lib64/libc.so.6
> (gdb)
>
> > Thanks,
> >
> > DK
> >
> > On Tue, 22 Jul 2008, Roberto Fichera wrote:
> >
> >
> >> Roberto Fichera ha scritto:
> >>
> >>> Dhabaleswar Panda ha scritto:
> >>>
> >>>> Hi Roberto,
> >>>>
> >>>> Thanks for your note. You are using the ch3:sock device in MVAPICH2 which
> >>>> is the same as MPICH2. You are also seeing similar failure scenarios (but
> >>>> in different forms) with MPICH2 1.0.7. I am cc'ing this message to mpich2
> >>>> mailing list. One of the MPICH2 developers will be able to extend help on
> >>>> this issue faster.
> >>>>
> >>>>
> >>> Thanks for that. About the mpich2 problem, I already sent an email
> >>> regarding its related issue.
> >>> But the strange thing is that when linking against mpich2 I don't see
> >>> a so fast race as I see in the
> >>> mvapich2. In the mpich2 case I had to wait 1 or 2 hours before the lock.
> >>>
> >> Just an update about the problem I got. After replacing all the
> >> MPI_Send() to MPI_Ssend() everything
> >> seems working well with mpich2 v1.0.7. My application doesn't race
> >> anymore at least after dispatching
> >> 50.000 jobs across 4 nodes, but trying to execute the same application
> >> against the last mvapich2 1.2rc1
> >> I'm still getting the same problem as shown below.
> >>
> >> I've another question, since this multithreaded application has to run
> >> into a cluster with 1024 nodes equiped
> >> with Mellanox IB card, I really like to know if the OpenFabrics-IB
> >> interface does support the MPI_THREAD_MULTIPLE
> >> initialization and also the MPI_Comm_spawn() implementation.
> >>
> >> Thanks a lot for the feedback.
> >>
> >>>> Thanks,
> >>>>
> >>>> DK
> >>>>
> >>>>
> >>>> On Fri, 18 Jul 2008, Roberto Fichera wrote:
> >>>>
> >>>>
> >>>>
> >>>>> Hi All on the list,
> >>>>>
> >>>>> I'm trying to use mvapich2 v1.2rc1 in a multithreaded application,
> >>>>> initialize using MPI_THREAD_MULTI.
> >>>>> I've the master application doing the following thing, start several
> >>>>> thread depending by the assigned nodes,
> >>>>> on each node a slave application is spawned using the MPI_Comm_spawn().
> >>>>> Before to call the
> >>>>> MPI_Comm_spawn() I prepare the given MPI_Info struct, one for each
> >>>>> thread, in order to set the all keys
> >>>>> (host and wdir) for addressing the wanted behaviour. So, as sooner as
> >>>>> the master application starts, it races
> >>>>> immediately with 4 nodes, 1 master and 3 slaves. Below you can see the
> >>>>> status of the master application at race
> >>>>> time. It seems stuck on the PMIU_readline() which never returns so the
> >>>>> global lock is never relesead. MVAPICH2
> >>>>> is compiled with:
> >>>>>
> >>>>> PKG_PATH=/HRI/External/mvapich2/1.2rc1
> >>>>>
> >>>>> ./configure --prefix=$PKG_PATH \
> >>>>>             --bindir=$PKG_PATH/bin/linux-x86_64-gcc-glibc2.3.4 \
> >>>>>             --sbindir=$PKG_PATH/bin/linux-x86_64-gcc-glibc2.3.4 \
> >>>>>             --libdir=$PKG_PATH/lib/linux-x86_64-gcc-glibc2.3.4 \
> >>>>>             --enable-sharedlibs=gcc \
> >>>>>             --enable-f90 \
> >>>>>             --enable-threads=multiple \
> >>>>>             --enable-g=-ggdb \
> >>>>>             --enable-debuginfo \
> >>>>>             --with-device=ch3:sock \
> >>>>>             --datadir=$PKG_PATH/data  \
> >>>>>             --with-htmldir=$PKG_PATH/doc/html \
> >>>>>             --with-docdir=$PKG_PATH/doc \
> >>>>>             LDFLAGS='-Wl,-z,noexecstack'
> >>>>>
> >>>>> so I'm using the ch3:sock device.
> >>>>>
> >>>>> -----Thread 2
> >>>>> [Switching to thread 2 (Thread 1115699536 (LWP 29479))]#0
> >>>>> 0x00000033ca40cef4 in __lll_lock_wait () from /lib64/libpthread.so.0
> >>>>> (gdb) bt
> >>>>> #0  0x00000033ca40cef4 in __lll_lock_wait () from /lib64/libpthread.so.0
> >>>>> #1  0x00000033ca408915 in _L_lock_102 () from /lib64/libpthread.so.0
> >>>>> --->>#2  0x00000033ca408390 in pthread_mutex_lock () from
> >>>>> /lib64/libpthread.so.0
> >>>>> --->>#3  0x00002aaaab382654 in PMPI_Info_set () from
> >>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> >>>>> #4  0x0000000000417627 in ParallelWorker_setSlaveInfo (self=<value
> >>>>> optimized out>, key=0x0, value=0x33ca40ff58
> >>>>> "!\204��\r\206��\030\204��3\206��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\177\205��\177\205��\177\205��\177\205��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\033\205��\033\205��\033\205��\033\205��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\033\205��\033\205��"...)
> >>>>> at ParallelWorker.c:664
> >>>>> #5  0x0000000000418905 in ParallelWorker_handleParallel (self=0x62ff50)
> >>>>> at ParallelWorker.c:719
> >>>>> #6  0x000000000041b39e in ParallelWorker_threadMain (arg=0x62ff50) at
> >>>>> ParallelWorker.c:504
> >>>>> #7  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
> >>>>> #8  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
> >>>>>
> >>>>> -----Thread 3
> >>>>> [Switching to thread 3 (Thread 1105209680 (LWP 29478))]#0
> >>>>> 0x00000033ca40cef4 in __lll_lock_wait () from /lib64/libpthread.so.0
> >>>>> (gdb) bt
> >>>>> #0  0x00000033ca40cef4 in __lll_lock_wait () from /lib64/libpthread.so.0
> >>>>> #1  0x00000033ca408915 in _L_lock_102 () from /lib64/libpthread.so.0
> >>>>> --->>#2  0x00000033ca408390 in pthread_mutex_lock () from
> >>>>> /lib64/libpthread.so.0
> >>>>> --->>#3  0x00002aaaab382654 in PMPI_Info_set () from
> >>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> >>>>> #4  0x0000000000417627 in ParallelWorker_setSlaveInfo (self=<value
> >>>>> optimized out>, key=0x0, value=0x33ca40ff58
> >>>>> "!\204��\r\206��\030\204��3\206��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\177\205��\177\205��\177\205��\177\205��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\033\205��\033\205��\033\205��\033\205��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\n\204��\033\205��\033\205��"...)
> >>>>> at ParallelWorker.c:664
> >>>>> #5  0x0000000000418905 in ParallelWorker_handleParallel (self=0x62f270)
> >>>>> at ParallelWorker.c:719
> >>>>> #6  0x000000000041b39e in ParallelWorker_threadMain (arg=0x62f270) at
> >>>>> ParallelWorker.c:504
> >>>>> #7  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
> >>>>> #8  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
> >>>>>
> >>>>> -----Thread 4
> >>>>> [Switching to thread 4 (Thread 1094719824 (LWP 29477))]#0
> >>>>> 0x00000033ca40d34b in read () from /lib64/libpthread.so.0
> >>>>> (gdb) bt
> >>>>> #0  0x00000033ca40d34b in read () from /lib64/libpthread.so.0
> >>>>> --->>#1  0x00002aaaab3db84a in PMIU_readline () from
> >>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> >>>>> --->>#2  0x00002aaaab3d9d37 in PMI_Spawn_multiple () from
> >>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> >>>>> #3  0x00002aaaab333893 in MPIDI_Comm_spawn_multiple () from
> >>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> >>>>> #4  0x00002aaaab38bcf6 in MPID_Comm_spawn_multiple () from
> >>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> >>>>> #5  0x00002aaaab355a10 in PMPI_Comm_spawn () from
> >>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> >>>>> #6  0x00000000004189d8 in ParallelWorker_handleParallel (self=0x62ad40)
> >>>>> at ParallelWorker.c:754
> >>>>> #7  0x000000000041b39e in ParallelWorker_threadMain (arg=0x62ad40) at
> >>>>> ParallelWorker.c:504
> >>>>> #8  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
> >>>>> #9  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
> >>>>>
> >>>>> I also tried to run against MPICH2 v1.0.7, but here I got a similar
> >>>>> scenery which show up after between 1 - 2 hours of execution,
> >>>>> see below:
> >>>>>
> >>>>> ----- thread 2
> >>>>> [Switching to thread 2 (Thread 1094719824 (LWP 1279))]#0  0x00000033c94cbd66 in poll () from /lib64/libc.so.6
> >>>>> (gdb) bt
> >>>>> #0  0x00000033c94cbd66 in poll () from /lib64/libc.so.6
> >>>>> #1  0x00002aaaab5a3d2f in MPIDU_Sock_wait () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> >>>>> #2  0x00002aaaab52bdc7 in MPIDI_CH3I_Progress () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> >>>>> #3  0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> >>>>> #4  0x00002aaaab56f162 in MPID_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> >>>>> #5  0x00002aaaab5417ec in PMPI_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> >>>>> #6  0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x6358e0) at ParallelWorker.c:819
> >>>>> #7  0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x6358e0) at ParallelWorker.c:515
> >>>>> #8  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
> >>>>> #9  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
> >>>>>
> >>>>> ----- thread 3
> >>>>> [Switching to thread 3 (Thread 1084229968 (LWP 1278))]#0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> >>>>> (gdb) bt
> >>>>> #0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> >>>>> #1  0x00002aaaab52bec7 in MPIDI_CH3I_Progress () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> >>>>> #2  0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> >>>>> #3  0x00002aaaab56f162 in MPID_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> >>>>> #4  0x00002aaaab5417ec in PMPI_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> >>>>> #5  0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x634d20) at ParallelWorker.c:819
> >>>>> #6  0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x634d20) at ParallelWorker.c:515
> >>>>> #7  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
> >>>>> #8  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
> >>>>>
> >>>>>
> >>>>> ----- thread 4
> >>>>> [Switching to thread 4 (Thread 1115699536 (LWP 1277))]#0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> >>>>> (gdb) bt
> >>>>> #0  0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> >>>>> #1  0x00002aaaab52bec7 in MPIDI_CH3I_Progress () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> >>>>> #2  0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> >>>>> #3  0x00002aaaab56f162 in MPID_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> >>>>> #4  0x00002aaaab5417ec in PMPI_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1
> >>>>> #5  0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x6341a0) at ParallelWorker.c:819
> >>>>> #6  0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x6341a0) at ParallelWorker.c:515
> >>>>> #7  0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0
> >>>>> #8  0x00000033c94d4b0d in clone () from /lib64/libc.so.6
> >>>>>
> >>>>> where the thread 2 is poll()ing never never returns, so never signals
> >>>>> the poll() completion and than all the others
> >>>>> waiters in the MPIDI_CH3I_Progress() condition will never wake up.
> >>>>>
> >>>>> Does anyone is having the same problem?
> >>>>>
> >>>>> Thanks in advance,
> >>>>> Roberto Fichera.
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>> ------------------------------------------------------------------------
> >>>
> >>> _______________________________________________
> >>> mvapich-discuss mailing list
> >>> mvapich-discuss at cse.ohio-state.edu
> >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>>
> >>>
> >>
> >
> >
> >
>
>




More information about the mpich-discuss mailing list