[mpich-discuss] crash in MPICH2 1.3.1
Dave Goodell
goodell at mcs.anl.gov
Thu Jan 6 15:06:05 CST 2011
Are both of these stack traces from the same stopped point in gdb? That is, can I assume that both threads are in these call stacks simultaneously?
If so, then we have a bug in MPICH2 somewhere (missing mutex lock?) because two threads in the same process should never be in MPIDI_CH3I_Progress at the same time (for nemesis, at least). Indeed, a quick check of the MPI_Request_get_status shows no MPIU_THREAD_CS_ENTER/EXIT macros in use...
-Dave
On Jan 6, 2011, at 2:57 PM CST, Blair, David wrote:
> Can anyone help me understand the following crash in MPICH2 1.3.1 running on Ubuntu Linux and using the Nemesis channel. The application has a couple of threads and is starting with MPI_THREAD_MULTIPLE. Most of the application is just doing MPI_ISend/Irecv. There is a progress thread that essentially is in an MPI_Waitsome loop. A 3rd thread periodically wakes up the progress thread by use of a generalized request.
>
> In case it matters, I built 64 bit using —disable-fc —enable-g=dbg and —disable-fast
>
> Here is the thread that suffers the SIGSEGV. As you can see this
> thread is in MPI_Waitsome waiting on an array of requests. The last
> request in the array is a generalized request that the other thread is
> interested in (see below).
>
> (gdb) display/i $pc
> 1: x/i $pc
> => 0xdc7ec1 <poll_active_fboxes+212>: mov 0x10(%rax),%rax
> (gdb) print $rax
> $6 = 0
> (gdb) bt
> #0 0x0000000000dc7ec1 in poll_active_fboxes (cell=0x7f7f38e0d1e0)
> at /home/dblair/insight/mpich2-1.3.1/src/mpid/ch3/channels/nemesis/nemesis/include/mpid_nem_fbox.h:51
> #1 0x0000000000dc7f7e in MPID_nem_mpich2_test_recv (cell=0x7f7f38e0d1e0,
> in_fbox=0x7f7f38e0d210, in_blocking_progress=1)
> at /home/dblair/insight/mpich2-1.3.1/src/mpid/ch3/channels/nemesis/nemesis/include/mpid_nem_inline.h:741
> #2 0x0000000000dc75a5 in MPIDI_CH3I_Progress (progress_state=0x7f7f38e0d340,
> is_blocking=1) at ch3_progress.c:333
> #3 0x0000000000dbcb74 in PMPI_Waitsome (incount=5,
> array_of_requests=0x2f87400, outcount=0x2bb4ca8,
> array_of_indices=0x2f87440, array_of_statuses=0x2f8b930) at waitsome.c:255
> #4 0x0000000000d21a3d in RuntimeMessageDemuxOperator::onEvent (
> this=0x2bb4c30, port=0x2c84cc0) at dataflow/MessagePassing.cc:173
> #5 0x0000000000d275d9 in DataflowScheduler::runOperator (this=0x2c4a000,
> port=...) at dataflow/DataflowRuntime.cc:91
> #6 0x0000000000d2799a in DataflowScheduler::run (this=0x2c4a000)
> at dataflow/DataflowRuntime.cc:232
> #7 0x0000000000c6f55a in RuntimeProcess::run (s=...)
> at dataflow/RuntimeProcess.cc:424
> #8 0x0000000000c84d9d in boost::_bi::list1<boost::reference_wrapper<DataflowScheduler> >::operator()<void (*)(DataflowScheduler&), boost::_bi::list0> (
> this=0x2bfbc78, f=@0x2bfbc70, a=...)
> at boost_1_42_0/boost/bind/bind.hpp:253
> #9 0x0000000000c84dda in boost::_bi::bind_t<void, void (*)(DataflowScheduler&), boost::_bi::list1<boost::reference_wrapper<DataflowScheduler> > >::operator()
> (this=0x2bfbc70) at boost_1_42_0/boost/bind/bind_template.hpp:20
> #10 0x0000000000c84df8 in boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(DataflowScheduler&), boost::_bi::list1<boost::reference_wrapper<DataflowScheduler> > > >::run (this=0x2bfbb40)
> at boost_1_42_0/boost/thread/detail/thread.hpp:56
> #11 0x0000000000c6487f in thread_proxy (param=0x2bfbb40)
> at boost_1_42_0/libs/thread/src/pthread/thread.cpp:120
> #12 0x00007f7f3aa199ca in start_thread () from /lib/libpthread.so.0
> #13 0x00007f7f399a370d in clone () from /lib/libc.so.6
> #14 0x0000000000000000 in ?? ()
>
> This thread is calling MPI_Request_get_status to check on whether
> a generalized request is completed (if not it will be calling MPI_Grequest_complete).
>
> (gdb) bt
> #0 poll_active_fboxes (cell=0x7f7f37e0b7f0)
> at /home/dblair/insight/mpich2-1.3.1/src/mpid/ch3/channels/nemesis/nemesis/include/mpid_nem_fbox.h:43
> #1 0x0000000000dc7f7e in MPID_nem_mpich2_test_recv (cell=0x7f7f37e0b7f0,
> in_fbox=0x7f7f37e0b820, in_blocking_progress=0)
> at /home/dblair/insight/mpich2-1.3.1/src/mpid/ch3/channels/nemesis/nemesis/include/mpid_nem_inline.h:741
> #2 0x0000000000dc75a5 in MPIDI_CH3I_Progress (progress_state=0x0,
> is_blocking=0) at ch3_progress.c:333
> #3 0x0000000000dba6c2 in PMPI_Request_get_status (request=-1409286144,
> flag=0x7f7f37e0b928, status=0x7f7f37e0b8f0) at request_get_status.c:110
> #4 0x0000000000d2265a in RuntimeMessageDemuxOperator::wakeupTimer (
> this=0x2bb4c30) at dataflow/MessagePassing.cc:88
> #5 0x0000000000d2327c in boost::_mfi::mf0<void, RuntimeMessageDemuxOperator>::operator() (this=0x2c2f810, p=0x2bb4c30)
> at boost_1_42_0/boost/bind/mem_fn_template.hpp:49
> #6 0x0000000000d232e9 in boost::_bi::list1<boost::_bi::value<RuntimeMessageDemuxOperator*> >::operator()<boost::_mfi::mf0<void, RuntimeMessageDemuxOperator>, boost::_bi::list0> (this=0x2c2f820, f=..., a=...)
> at boost_1_42_0/boost/bind/bind.hpp:253
> #7 0x0000000000d23326 in boost::_bi::bind_t<void, boost::_mfi::mf0<void, RuntimeMessageDemuxOperator>, boost::_bi::list1<boost::_bi::value<RuntimeMessageDemuxOperator*> > >::operator() (this=0x2c2f810)
> at boost_1_42_0/boost/bind/bind_template.hpp:20
> #8 0x0000000000d23344 in boost::detail::thread_data<boost::_bi::bind_t<void, boost::_mfi::mf0<void, RuntimeMessageDemuxOperator>, boost::_bi::list1<boost::_bi::value<RuntimeMessageDemuxOperator*> > > >::run (this=0x2c2f6e0)
> at boost_1_42_0/boost/thread/detail/thread.hpp:56
> #9 0x0000000000c6487f in thread_proxy (param=0x2c2f6e0)
> at boost_1_42_0/libs/thread/src/pthread/thread.cpp:120
> #10 0x00007f7f3aa199ca in start_thread () from /lib/libpthread.so.0
> #11 0x00007f7f399a370d in clone () from /lib/libc.so.6
> #12 0x0000000000000000 in ?? ()
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list