[mpich-discuss] crash in MPICH2 1.3.1

Blair, David dblair at akamai.com
Thu Jan 6 16:01:50 CST 2011


Indeed these are from the same stopped point.  I thought something looked
a bit fishy but I was wondering if it was further down in the Nemesis
stack.  Sounds like taking the lock is the responsibility of the caller of
the progress engine. So the right fix is to add
MPIU_THREAD_CS_ENTER/EXIT(ALLFUNC,) in MPI_Request_get_status?

Thanks for the quick response!

On 1/6/11 4:06 PM, "Dave Goodell" <goodell at mcs.anl.gov> wrote:

>Are both of these stack traces from the same stopped point in gdb?  That
>is, can I assume that both threads are in these call stacks
>simultaneously?
>
>If so, then we have a bug in MPICH2 somewhere (missing mutex lock?)
>because two threads in the same process should never be in
>MPIDI_CH3I_Progress at the same time (for nemesis, at least).  Indeed, a
>quick check of the MPI_Request_get_status shows no
>MPIU_THREAD_CS_ENTER/EXIT macros in use...
>
>-Dave
>
>On Jan 6, 2011, at 2:57 PM CST, Blair, David wrote:
>
>> Can anyone help me understand the following crash in MPICH2 1.3.1
>>running on Ubuntu Linux and using the Nemesis channel.  The application
>>has a couple of threads and is starting with MPI_THREAD_MULTIPLE.  Most
>>of the application is just doing MPI_ISend/Irecv.  There is a progress
>>thread that essentially is in an MPI_Waitsome loop.  A 3rd thread
>>periodically wakes up the progress thread by use of a generalized
>>request.  
>> 
>> In case it matters, I built  64 bit using ‹disable-fc ‹enable-g=dbg and
>>‹disable-fast
>> 
>> Here is the thread that suffers the SIGSEGV.  As you can see this
>> thread is in MPI_Waitsome waiting on an array of requests.  The last
>> request in the array is a generalized request that the other thread is
>> interested in (see below).
>> 
>> (gdb) display/i $pc
>> 1: x/i $pc
>> => 0xdc7ec1 <poll_active_fboxes+212>:    mov    0x10(%rax),%rax
>> (gdb) print $rax
>> $6 = 0
>> (gdb) bt
>> #0  0x0000000000dc7ec1 in poll_active_fboxes (cell=0x7f7f38e0d1e0)
>>     at 
>>/home/dblair/insight/mpich2-1.3.1/src/mpid/ch3/channels/nemesis/nemesis/i
>>nclude/mpid_nem_fbox.h:51
>> #1  0x0000000000dc7f7e in MPID_nem_mpich2_test_recv
>>(cell=0x7f7f38e0d1e0,
>>     in_fbox=0x7f7f38e0d210, in_blocking_progress=1)
>>     at 
>>/home/dblair/insight/mpich2-1.3.1/src/mpid/ch3/channels/nemesis/nemesis/i
>>nclude/mpid_nem_inline.h:741
>> #2  0x0000000000dc75a5 in MPIDI_CH3I_Progress
>>(progress_state=0x7f7f38e0d340,
>>     is_blocking=1) at ch3_progress.c:333
>> #3  0x0000000000dbcb74 in PMPI_Waitsome (incount=5,
>>     array_of_requests=0x2f87400, outcount=0x2bb4ca8,
>>     array_of_indices=0x2f87440, array_of_statuses=0x2f8b930) at
>>waitsome.c:255
>> #4  0x0000000000d21a3d in RuntimeMessageDemuxOperator::onEvent (
>>     this=0x2bb4c30, port=0x2c84cc0) at dataflow/MessagePassing.cc:173
>> #5  0x0000000000d275d9 in DataflowScheduler::runOperator
>>(this=0x2c4a000, 
>>     port=...) at dataflow/DataflowRuntime.cc:91
>> #6  0x0000000000d2799a in DataflowScheduler::run (this=0x2c4a000)
>>     at dataflow/DataflowRuntime.cc:232
>> #7  0x0000000000c6f55a in RuntimeProcess::run (s=...)
>>     at dataflow/RuntimeProcess.cc:424
>> #8  0x0000000000c84d9d in
>>boost::_bi::list1<boost::reference_wrapper<DataflowScheduler>
>>>::operator()<void (*)(DataflowScheduler&), boost::_bi::list0> (
>>     this=0x2bfbc78, f=@0x2bfbc70, a=...)
>>     at boost_1_42_0/boost/bind/bind.hpp:253
>> #9  0x0000000000c84dda in boost::_bi::bind_t<void, void
>>(*)(DataflowScheduler&),
>>boost::_bi::list1<boost::reference_wrapper<DataflowScheduler> >
>>>::operator()
>>     (this=0x2bfbc70) at boost_1_42_0/boost/bind/bind_template.hpp:20
>> #10 0x0000000000c84df8 in
>>boost::detail::thread_data<boost::_bi::bind_t<void, void
>>(*)(DataflowScheduler&),
>>boost::_bi::list1<boost::reference_wrapper<DataflowScheduler> > > >::run
>>(this=0x2bfbb40)
>>     at boost_1_42_0/boost/thread/detail/thread.hpp:56
>> #11 0x0000000000c6487f in thread_proxy (param=0x2bfbb40)
>>     at boost_1_42_0/libs/thread/src/pthread/thread.cpp:120
>> #12 0x00007f7f3aa199ca in start_thread () from /lib/libpthread.so.0
>> #13 0x00007f7f399a370d in clone () from /lib/libc.so.6
>> #14 0x0000000000000000 in ?? ()
>> 
>> This thread is calling MPI_Request_get_status to check on whether
>> a generalized request is completed (if not it will be calling
>>MPI_Grequest_complete).
>> 
>> (gdb) bt
>> #0  poll_active_fboxes (cell=0x7f7f37e0b7f0)
>>     at 
>>/home/dblair/insight/mpich2-1.3.1/src/mpid/ch3/channels/nemesis/nemesis/i
>>nclude/mpid_nem_fbox.h:43
>> #1  0x0000000000dc7f7e in MPID_nem_mpich2_test_recv
>>(cell=0x7f7f37e0b7f0,
>>     in_fbox=0x7f7f37e0b820, in_blocking_progress=0)
>>     at 
>>/home/dblair/insight/mpich2-1.3.1/src/mpid/ch3/channels/nemesis/nemesis/i
>>nclude/mpid_nem_inline.h:741
>> #2  0x0000000000dc75a5 in MPIDI_CH3I_Progress (progress_state=0x0,
>>     is_blocking=0) at ch3_progress.c:333
>> #3  0x0000000000dba6c2 in PMPI_Request_get_status (request=-1409286144,
>>     flag=0x7f7f37e0b928, status=0x7f7f37e0b8f0) at
>>request_get_status.c:110
>> #4  0x0000000000d2265a in RuntimeMessageDemuxOperator::wakeupTimer (
>>     this=0x2bb4c30) at dataflow/MessagePassing.cc:88
>> #5  0x0000000000d2327c in boost::_mfi::mf0<void,
>>RuntimeMessageDemuxOperator>::operator() (this=0x2c2f810, p=0x2bb4c30)
>>     at boost_1_42_0/boost/bind/mem_fn_template.hpp:49
>> #6  0x0000000000d232e9 in
>>boost::_bi::list1<boost::_bi::value<RuntimeMessageDemuxOperator*>
>>>::operator()<boost::_mfi::mf0<void, RuntimeMessageDemuxOperator>,
>>>boost::_bi::list0> (this=0x2c2f820, f=..., a=...)
>>     at boost_1_42_0/boost/bind/bind.hpp:253
>> #7  0x0000000000d23326 in boost::_bi::bind_t<void,
>>boost::_mfi::mf0<void, RuntimeMessageDemuxOperator>,
>>boost::_bi::list1<boost::_bi::value<RuntimeMessageDemuxOperator*> >
>>>::operator() (this=0x2c2f810)
>>     at boost_1_42_0/boost/bind/bind_template.hpp:20
>> #8  0x0000000000d23344 in
>>boost::detail::thread_data<boost::_bi::bind_t<void,
>>boost::_mfi::mf0<void, RuntimeMessageDemuxOperator>,
>>boost::_bi::list1<boost::_bi::value<RuntimeMessageDemuxOperator*> > >
>>>::run (this=0x2c2f6e0)
>>     at boost_1_42_0/boost/thread/detail/thread.hpp:56
>> #9  0x0000000000c6487f in thread_proxy (param=0x2c2f6e0)
>>     at boost_1_42_0/libs/thread/src/pthread/thread.cpp:120
>> #10 0x00007f7f3aa199ca in start_thread () from /lib/libpthread.so.0
>> #11 0x00007f7f399a370d in clone () from /lib/libc.so.6
>> #12 0x0000000000000000 in ?? ()
>> 
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>_______________________________________________
>mpich-discuss mailing list
>mpich-discuss at mcs.anl.gov
>https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list