[mpich-discuss] crash in MPICH2 1.3.1

Dave Goodell goodell at mcs.anl.gov
Thu Jan 6 15:06:05 CST 2011


Are both of these stack traces from the same stopped point in gdb?  That is, can I assume that both threads are in these call stacks simultaneously?

If so, then we have a bug in MPICH2 somewhere (missing mutex lock?) because two threads in the same process should never be in MPIDI_CH3I_Progress at the same time (for nemesis, at least).  Indeed, a quick check of the MPI_Request_get_status shows no MPIU_THREAD_CS_ENTER/EXIT macros in use...

-Dave

On Jan 6, 2011, at 2:57 PM CST, Blair, David wrote:

> Can anyone help me understand the following crash in MPICH2 1.3.1 running on Ubuntu Linux and using the Nemesis channel.  The application has a couple of threads and is starting with MPI_THREAD_MULTIPLE.  Most of the application is just doing MPI_ISend/Irecv.  There is a progress thread that essentially is in an MPI_Waitsome loop.  A 3rd thread periodically wakes up the progress thread by use of a generalized request.  
> 
> In case it matters, I built  64 bit using —disable-fc —enable-g=dbg and —disable-fast
> 
> Here is the thread that suffers the SIGSEGV.  As you can see this
> thread is in MPI_Waitsome waiting on an array of requests.  The last
> request in the array is a generalized request that the other thread is
> interested in (see below).
> 
> (gdb) display/i $pc
> 1: x/i $pc
> => 0xdc7ec1 <poll_active_fboxes+212>:	mov    0x10(%rax),%rax
> (gdb) print $rax
> $6 = 0
> (gdb) bt
> #0  0x0000000000dc7ec1 in poll_active_fboxes (cell=0x7f7f38e0d1e0)
>     at /home/dblair/insight/mpich2-1.3.1/src/mpid/ch3/channels/nemesis/nemesis/include/mpid_nem_fbox.h:51
> #1  0x0000000000dc7f7e in MPID_nem_mpich2_test_recv (cell=0x7f7f38e0d1e0, 
>     in_fbox=0x7f7f38e0d210, in_blocking_progress=1)
>     at /home/dblair/insight/mpich2-1.3.1/src/mpid/ch3/channels/nemesis/nemesis/include/mpid_nem_inline.h:741
> #2  0x0000000000dc75a5 in MPIDI_CH3I_Progress (progress_state=0x7f7f38e0d340, 
>     is_blocking=1) at ch3_progress.c:333
> #3  0x0000000000dbcb74 in PMPI_Waitsome (incount=5, 
>     array_of_requests=0x2f87400, outcount=0x2bb4ca8, 
>     array_of_indices=0x2f87440, array_of_statuses=0x2f8b930) at waitsome.c:255
> #4  0x0000000000d21a3d in RuntimeMessageDemuxOperator::onEvent (
>     this=0x2bb4c30, port=0x2c84cc0) at dataflow/MessagePassing.cc:173
> #5  0x0000000000d275d9 in DataflowScheduler::runOperator (this=0x2c4a000, 
>     port=...) at dataflow/DataflowRuntime.cc:91
> #6  0x0000000000d2799a in DataflowScheduler::run (this=0x2c4a000)
>     at dataflow/DataflowRuntime.cc:232
> #7  0x0000000000c6f55a in RuntimeProcess::run (s=...)
>     at dataflow/RuntimeProcess.cc:424
> #8  0x0000000000c84d9d in boost::_bi::list1<boost::reference_wrapper<DataflowScheduler> >::operator()<void (*)(DataflowScheduler&), boost::_bi::list0> (
>     this=0x2bfbc78, f=@0x2bfbc70, a=...)
>     at boost_1_42_0/boost/bind/bind.hpp:253
> #9  0x0000000000c84dda in boost::_bi::bind_t<void, void (*)(DataflowScheduler&), boost::_bi::list1<boost::reference_wrapper<DataflowScheduler> > >::operator()
>     (this=0x2bfbc70) at boost_1_42_0/boost/bind/bind_template.hpp:20
> #10 0x0000000000c84df8 in boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(DataflowScheduler&), boost::_bi::list1<boost::reference_wrapper<DataflowScheduler> > > >::run (this=0x2bfbb40)
>     at boost_1_42_0/boost/thread/detail/thread.hpp:56
> #11 0x0000000000c6487f in thread_proxy (param=0x2bfbb40)
>     at boost_1_42_0/libs/thread/src/pthread/thread.cpp:120
> #12 0x00007f7f3aa199ca in start_thread () from /lib/libpthread.so.0
> #13 0x00007f7f399a370d in clone () from /lib/libc.so.6
> #14 0x0000000000000000 in ?? ()
> 
> This thread is calling MPI_Request_get_status to check on whether
> a generalized request is completed (if not it will be calling MPI_Grequest_complete).
> 
> (gdb) bt
> #0  poll_active_fboxes (cell=0x7f7f37e0b7f0)
>     at /home/dblair/insight/mpich2-1.3.1/src/mpid/ch3/channels/nemesis/nemesis/include/mpid_nem_fbox.h:43
> #1  0x0000000000dc7f7e in MPID_nem_mpich2_test_recv (cell=0x7f7f37e0b7f0, 
>     in_fbox=0x7f7f37e0b820, in_blocking_progress=0)
>     at /home/dblair/insight/mpich2-1.3.1/src/mpid/ch3/channels/nemesis/nemesis/include/mpid_nem_inline.h:741
> #2  0x0000000000dc75a5 in MPIDI_CH3I_Progress (progress_state=0x0, 
>     is_blocking=0) at ch3_progress.c:333
> #3  0x0000000000dba6c2 in PMPI_Request_get_status (request=-1409286144, 
>     flag=0x7f7f37e0b928, status=0x7f7f37e0b8f0) at request_get_status.c:110
> #4  0x0000000000d2265a in RuntimeMessageDemuxOperator::wakeupTimer (
>     this=0x2bb4c30) at dataflow/MessagePassing.cc:88
> #5  0x0000000000d2327c in boost::_mfi::mf0<void, RuntimeMessageDemuxOperator>::operator() (this=0x2c2f810, p=0x2bb4c30)
>     at boost_1_42_0/boost/bind/mem_fn_template.hpp:49
> #6  0x0000000000d232e9 in boost::_bi::list1<boost::_bi::value<RuntimeMessageDemuxOperator*> >::operator()<boost::_mfi::mf0<void, RuntimeMessageDemuxOperator>, boost::_bi::list0> (this=0x2c2f820, f=..., a=...)
>     at boost_1_42_0/boost/bind/bind.hpp:253
> #7  0x0000000000d23326 in boost::_bi::bind_t<void, boost::_mfi::mf0<void, RuntimeMessageDemuxOperator>, boost::_bi::list1<boost::_bi::value<RuntimeMessageDemuxOperator*> > >::operator() (this=0x2c2f810)
>     at boost_1_42_0/boost/bind/bind_template.hpp:20
> #8  0x0000000000d23344 in boost::detail::thread_data<boost::_bi::bind_t<void, boost::_mfi::mf0<void, RuntimeMessageDemuxOperator>, boost::_bi::list1<boost::_bi::value<RuntimeMessageDemuxOperator*> > > >::run (this=0x2c2f6e0)
>     at boost_1_42_0/boost/thread/detail/thread.hpp:56
> #9  0x0000000000c6487f in thread_proxy (param=0x2c2f6e0)
>     at boost_1_42_0/libs/thread/src/pthread/thread.cpp:120
> #10 0x00007f7f3aa199ca in start_thread () from /lib/libpthread.so.0
> #11 0x00007f7f399a370d in clone () from /lib/libc.so.6
> #12 0x0000000000000000 in ?? ()
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list