[mpich-discuss] crash in MPICH2 1.3.1

Blair, David dblair at akamai.com
Thu Jan 6 14:57:48 CST 2011


Can anyone help me understand the following crash in MPICH2 1.3.1 running on Ubuntu Linux and using the Nemesis channel.  The application has a couple of threads and is starting with MPI_THREAD_MULTIPLE.  Most of the application is just doing MPI_ISend/Irecv.  There is a progress thread that essentially is in an MPI_Waitsome loop.  A 3rd thread periodically wakes up the progress thread by use of a generalized request.

In case it matters, I built  64 bit using —disable-fc —enable-g=dbg and —disable-fast

Here is the thread that suffers the SIGSEGV.  As you can see this
thread is in MPI_Waitsome waiting on an array of requests.  The last
request in the array is a generalized request that the other thread is
interested in (see below).

(gdb) display/i $pc
1: x/i $pc
=> 0xdc7ec1 <poll_active_fboxes+212>: mov    0x10(%rax),%rax
(gdb) print $rax
$6 = 0
(gdb) bt
#0  0x0000000000dc7ec1 in poll_active_fboxes (cell=0x7f7f38e0d1e0)
    at /home/dblair/insight/mpich2-1.3.1/src/mpid/ch3/channels/nemesis/nemesis/include/mpid_nem_fbox.h:51
#1  0x0000000000dc7f7e in MPID_nem_mpich2_test_recv (cell=0x7f7f38e0d1e0,
    in_fbox=0x7f7f38e0d210, in_blocking_progress=1)
    at /home/dblair/insight/mpich2-1.3.1/src/mpid/ch3/channels/nemesis/nemesis/include/mpid_nem_inline.h:741
#2  0x0000000000dc75a5 in MPIDI_CH3I_Progress (progress_state=0x7f7f38e0d340,
    is_blocking=1) at ch3_progress.c:333
#3  0x0000000000dbcb74 in PMPI_Waitsome (incount=5,
    array_of_requests=0x2f87400, outcount=0x2bb4ca8,
    array_of_indices=0x2f87440, array_of_statuses=0x2f8b930) at waitsome.c:255
#4  0x0000000000d21a3d in RuntimeMessageDemuxOperator::onEvent (
    this=0x2bb4c30, port=0x2c84cc0) at dataflow/MessagePassing.cc:173
#5  0x0000000000d275d9 in DataflowScheduler::runOperator (this=0x2c4a000,
    port=...) at dataflow/DataflowRuntime.cc:91
#6  0x0000000000d2799a in DataflowScheduler::run (this=0x2c4a000)
    at dataflow/DataflowRuntime.cc:232
#7  0x0000000000c6f55a in RuntimeProcess::run (s=...)
    at dataflow/RuntimeProcess.cc:424
#8  0x0000000000c84d9d in boost::_bi::list1<boost::reference_wrapper<DataflowScheduler> >::operator()<void (*)(DataflowScheduler&), boost::_bi::list0> (
    this=0x2bfbc78, f=@0x2bfbc70, a=...)
    at boost_1_42_0/boost/bind/bind.hpp:253
#9  0x0000000000c84dda in boost::_bi::bind_t<void, void (*)(DataflowScheduler&), boost::_bi::list1<boost::reference_wrapper<DataflowScheduler> > >::operator()
    (this=0x2bfbc70) at boost_1_42_0/boost/bind/bind_template.hpp:20
#10 0x0000000000c84df8 in boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(DataflowScheduler&), boost::_bi::list1<boost::reference_wrapper<DataflowScheduler> > > >::run (this=0x2bfbb40)
    at boost_1_42_0/boost/thread/detail/thread.hpp:56
#11 0x0000000000c6487f in thread_proxy (param=0x2bfbb40)
    at boost_1_42_0/libs/thread/src/pthread/thread.cpp:120
#12 0x00007f7f3aa199ca in start_thread () from /lib/libpthread.so.0
#13 0x00007f7f399a370d in clone () from /lib/libc.so.6
#14 0x0000000000000000 in ?? ()

This thread is calling MPI_Request_get_status to check on whether
a generalized request is completed (if not it will be calling MPI_Grequest_complete).

(gdb) bt
#0  poll_active_fboxes (cell=0x7f7f37e0b7f0)
    at /home/dblair/insight/mpich2-1.3.1/src/mpid/ch3/channels/nemesis/nemesis/include/mpid_nem_fbox.h:43
#1  0x0000000000dc7f7e in MPID_nem_mpich2_test_recv (cell=0x7f7f37e0b7f0,
    in_fbox=0x7f7f37e0b820, in_blocking_progress=0)
    at /home/dblair/insight/mpich2-1.3.1/src/mpid/ch3/channels/nemesis/nemesis/include/mpid_nem_inline.h:741
#2  0x0000000000dc75a5 in MPIDI_CH3I_Progress (progress_state=0x0,
    is_blocking=0) at ch3_progress.c:333
#3  0x0000000000dba6c2 in PMPI_Request_get_status (request=-1409286144,
    flag=0x7f7f37e0b928, status=0x7f7f37e0b8f0) at request_get_status.c:110
#4  0x0000000000d2265a in RuntimeMessageDemuxOperator::wakeupTimer (
    this=0x2bb4c30) at dataflow/MessagePassing.cc:88
#5  0x0000000000d2327c in boost::_mfi::mf0<void, RuntimeMessageDemuxOperator>::operator() (this=0x2c2f810, p=0x2bb4c30)
    at boost_1_42_0/boost/bind/mem_fn_template.hpp:49
#6  0x0000000000d232e9 in boost::_bi::list1<boost::_bi::value<RuntimeMessageDemuxOperator*> >::operator()<boost::_mfi::mf0<void, RuntimeMessageDemuxOperator>, boost::_bi::list0> (this=0x2c2f820, f=..., a=...)
    at boost_1_42_0/boost/bind/bind.hpp:253
#7  0x0000000000d23326 in boost::_bi::bind_t<void, boost::_mfi::mf0<void, RuntimeMessageDemuxOperator>, boost::_bi::list1<boost::_bi::value<RuntimeMessageDemuxOperator*> > >::operator() (this=0x2c2f810)
    at boost_1_42_0/boost/bind/bind_template.hpp:20
#8  0x0000000000d23344 in boost::detail::thread_data<boost::_bi::bind_t<void, boost::_mfi::mf0<void, RuntimeMessageDemuxOperator>, boost::_bi::list1<boost::_bi::value<RuntimeMessageDemuxOperator*> > > >::run (this=0x2c2f6e0)
    at boost_1_42_0/boost/thread/detail/thread.hpp:56
#9  0x0000000000c6487f in thread_proxy (param=0x2c2f6e0)
    at boost_1_42_0/libs/thread/src/pthread/thread.cpp:120
#10 0x00007f7f3aa199ca in start_thread () from /lib/libpthread.so.0
#11 0x00007f7f399a370d in clone () from /lib/libc.so.6
#12 0x0000000000000000 in ?? ()

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110106/13067837/attachment.htm>


More information about the mpich-discuss mailing list