[mpich2-dev] [mvapich-discuss] Need a hint in debugging a problem that only affects a few machines in our cluster.
Mike Heinz
michael.heinz at qlogic.com
Wed Jul 15 14:02:03 CDT 2009
Krishna, thanks for the suggestion - but setting MV2_USE_SHMEM_COLL to zero did not seem to change the stack trace much:
Node 0:
0x00002aaaaab5d8b7 in MPIDI_CH3I_MRAILI_Cq_poll (vbuf_handle=0x7fffcb46d698,
vc_req=0x0, receiving=0, is_blocking=1) at ibv_channel_manager.c:529
529 for (; i < rdma_num_hcas; ++i) {
(gdb) where
#0 0x00002aaaaab5d8b7 in MPIDI_CH3I_MRAILI_Cq_poll (
vbuf_handle=0x7fffcb46d698, vc_req=0x0, receiving=0, is_blocking=1)
at ibv_channel_manager.c:529
#1 0x00002aaaaab177fa in MPIDI_CH3I_read_progress (vc_pptr=0x7fffcb46d6a0,
v_ptr=0x7fffcb46d698, is_blocking=1) at ch3_read_progress.c:143
#2 0x00002aaaaab17464 in MPIDI_CH3I_Progress (is_blocking=1,
state=<value optimized out>) at ch3_progress.c:202
#3 0x00002aaaaab5bc4e in MPIC_Wait (request_ptr=0x2aaaaae19800)
at helper_fns.c:269
#4 0x00002aaaaab5c043 in MPIC_Sendrecv (sendbuf=0x10993a80, sendcount=2,
sendtype=1275069445, dest=1, sendtag=7, recvbuf=0x10993a88, recvcount=2,
recvtype=1275069445, source=1, recvtag=7, comm=1140850688,
status=0x7fffcb46d820) at helper_fns.c:125
#5 0x00002aaaaaafe387 in MPIR_Allgather (sendbuf=<value optimized out>,
sendcount=<value optimized out>, sendtype=1275069445, recvbuf=0x10993a80,
recvcount=2, recvtype=1275069445, comm_ptr=0x2aaaaae1c1e0)
at allgather.c:192
#6 0x00002aaaaaafeff9 in PMPI_Allgather (sendbuf=0xffffffffffffffff,
sendcount=2, sendtype=1275069445, recvbuf=0x10993a80, recvcount=2,
recvtype=1275069445, comm=1140850688) at allgather.c:866
#7 0x00002aaaaab3b00b in PMPI_Comm_split (comm=1140850688, color=0, key=0,
newcomm=0x2aaaaae1c2f4) at comm_split.c:196
#8 0x00002aaaaab3cd84 in create_2level_comm (comm=1140850688, size=2,
---Type <return> to continue, or q <return> to quit---
my_rank=<value optimized out>) at create_2level_comm.c:142
#9 0x00002aaaaab6877d in PMPI_Init (argc=0x7fffcb46db3c, argv=0x7fffcb46db30)
at init.c:146
#10 0x0000000000400b2f in main (argc=3, argv=0x7fffcb46dc78) at bw.c:27
Node 1:
MPIDI_CH3I_read_progress (vc_pptr=0x7fff0b10bb50, v_ptr=0x7fff0b10bb48,
is_blocking=1) at ch3_read_progress.c:143
143 type = MPIDI_CH3I_MRAILI_Cq_poll(v_ptr, NULL, 0, is_blocking);
(gdb) where
#0 MPIDI_CH3I_read_progress (vc_pptr=0x7fff0b10bb50, v_ptr=0x7fff0b10bb48,
is_blocking=1) at ch3_read_progress.c:143
#1 0x00002afc9fb21f44 in MPIDI_CH3I_Progress (is_blocking=1,
state=<value optimized out>) at ch3_progress.c:202
#2 0x00002afc9fb6660e in MPIC_Wait (request_ptr=0x2afc9fd242a0)
at helper_fns.c:269
#3 0x00002afc9fb66a03 in MPIC_Sendrecv (sendbuf=0xf77028, sendcount=2,
sendtype=1275069445, dest=0, sendtag=7, recvbuf=0xf77020, recvcount=4,
recvtype=1275069445, source=0, recvtag=7, comm=1140850688,
status=0x7fff0b10bcd0) at helper_fns.c:125
#4 0x00002afc9fb08ddb in MPIR_Allgather (sendbuf=<value optimized out>,
sendcount=<value optimized out>, sendtype=1275069445, recvbuf=0xf77020,
recvcount=2, recvtype=1275069445, comm_ptr=0x2afc9fd26c80)
at allgather.c:192
#5 0x00002afc9fb09a45 in PMPI_Allgather (sendbuf=0xffffffffffffffff,
sendcount=2, sendtype=1275069445, recvbuf=0xf77020, recvcount=2,
recvtype=1275069445, comm=1140850688) at allgather.c:866
#6 0x00002afc9fb4591b in PMPI_Comm_split (comm=1140850688, color=1, key=0,
newcomm=0x2afc9fd26d94) at comm_split.c:196
#7 0x00002afc9fb478f4 in create_2level_comm (comm=1140850688, size=2,
my_rank=<value optimized out>) at create_2level_comm.c:142
#8 0x00002afc9fb730a5 in PMPI_Init (argc=0x7fff0b10bfec, argv=0x7fff0b10bfe0)
at init.c:146
---Type <return> to continue, or q <return> to quit---
#9 0x0000000000400bcf in main (argc=3, argv=0x7fff0b10c128) at bw.c:27
Any suggestions would be appreciated.
--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania
From: kris.c1986 at gmail.com [mailto:kris.c1986 at gmail.com] On Behalf Of Krishna Chaitanya
Sent: Tuesday, July 14, 2009 6:39 PM
To: Mike Heinz
Cc: Todd Rimmer; mvapich-discuss at cse.ohio-state.edu; mpich2-dev at mcs.anl.gov
Subject: Re: [mvapich-discuss] [mpich2-dev] Need a hint in debugging a problem that only affects a few machines in our cluster.
Mike,
The hang seems to be occuring when the MPI library is trying to create the 2-level communicator, during the init phase. Can you try running the test with MV2_USE_SHMEM_COLL<http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4rc1.html#x1-16000011.74>=0. This will ensure that a flat communicator is used for the subsequent MPI calls. This might help us isolate the problem.
Thanks,
Krishna
On Tue, Jul 14, 2009 at 5:04 PM, Mike Heinz <michael.heinz at qlogic.com<mailto:michael.heinz at qlogic.com>> wrote:
We're having a very odd problem with our fabric, where, out of the entire cluster, machine "A" can't run mvapich2 programs with machine "B", and machine "C" can't run programs with machine "D" - even though "A" can run with "D" and "B" can run with "C" - and the rest of the fabric works fine.
1) There are no IB errors anywhere on the fabric that I can find, and the machines in question all work correctly with mvapich1 and low-level IB tests.
2) The problem occurs whether using mpd or rsh.
3) If I attach to the running processes, both machines appear to be waiting for a read operation to complete. (See below)
Can anyone make a suggestion on how to debug this?
Stack trace for node 0:
#0 0x000000361160abb5 in pthread_spin_lock () from /lib64/libpthread.so.0
#1 0x00002aaaab08fb6c in mthca_poll_cq (ibcq=0x2060980, ne=1,
wc=0x7fff9d835900) at src/cq.c:468
#2 0x00002aaaaab5d8d8 in MPIDI_CH3I_MRAILI_Cq_poll (
vbuf_handle=0x7fff9d8359d8, vc_req=0x0, receiving=0, is_blocking=1)
at /usr/include/infiniband/verbs.h:934
#3 0x00002aaaaab177fa in MPIDI_CH3I_read_progress (vc_pptr=0x7fff9d8359e0,
v_ptr=0x7fff9d8359d8, is_blocking=1) at ch3_read_progress.c:143
#4 0x00002aaaaab17464 in MPIDI_CH3I_Progress (is_blocking=1,
state=<value optimized out>) at ch3_progress.c:202
#5 0x00002aaaaab5bc4e in MPIC_Wait (request_ptr=0x2aaaaae19800)
at helper_fns.c:269
#6 0x00002aaaaab5c043 in MPIC_Sendrecv (sendbuf=0x217fc50, sendcount=2,
sendtype=1275069445, dest=1, sendtag=7, recvbuf=0x217fc58, recvcount=2,
recvtype=1275069445, source=1, recvtag=7, comm=1140850688,
status=0x7fff9d835b60) at helper_fns.c:125
#7 0x00002aaaaaafe387 in MPIR_Allgather (sendbuf=<value optimized out>,
sendcount=<value optimized out>, sendtype=1275069445, recvbuf=0x217fc50,
recvcount=2, recvtype=1275069445, comm_ptr=0x2aaaaae1c1e0)
at allgather.c:192
#8 0x00002aaaaaafeff9 in PMPI_Allgather (sendbuf=0xffffffffffffffff,
sendcount=2, sendtype=1275069445, recvbuf=0x217fc50, recvcount=2,
recvtype=1275069445, comm=1140850688) at allgather.c:866
---Type <return> to continue, or q <return> to quit---
#9 0x00002aaaaab3b00b in PMPI_Comm_split (comm=1140850688, color=0, key=0,
newcomm=0x2aaaaae1c2f4) at comm_split.c:196
#10 0x00002aaaaab3cd84 in create_2level_comm (comm=1140850688, size=2,
my_rank=<value optimized out>) at create_2level_comm.c:142
#11 0x00002aaaaab6877d in PMPI_Init (argc=0x7fff9d835e7c, argv=0x7fff9d835e70)
at init.c:146
#12 0x0000000000400b2f in main (argc=3, argv=0x7fff9d835fb8) at bw.c:27
Stack trace for node 1:
#0 0x00002ac3cbdac2d2 in MPIDI_CH3I_read_progress (vc_pptr=0x7fffdee81020,
v_ptr=0x7fffdee81018, is_blocking=1) at ch3_read_progress.c:143
#1 0x00002ac3cbdabf44 in MPIDI_CH3I_Progress (is_blocking=1,
state=<value optimized out>) at ch3_progress.c:202
#2 0x00002ac3cbdf060e in MPIC_Wait (request_ptr=0x2ac3cbfae2a0)
at helper_fns.c:269
#3 0x00002ac3cbdf0a03 in MPIC_Sendrecv (sendbuf=0xf79028, sendcount=2,
sendtype=1275069445, dest=0, sendtag=7, recvbuf=0xf79020, recvcount=4,
recvtype=1275069445, source=0, recvtag=7, comm=1140850688,
status=0x7fffdee811a0) at helper_fns.c:125
#4 0x00002ac3cbd92ddb in MPIR_Allgather (sendbuf=<value optimized out>,
sendcount=<value optimized out>, sendtype=1275069445, recvbuf=0xf79020,
recvcount=2, recvtype=1275069445, comm_ptr=0x2ac3cbfb0c80)
at allgather.c:192
#5 0x00002ac3cbd93a45 in PMPI_Allgather (sendbuf=0xffffffffffffffff,
sendcount=2, sendtype=1275069445, recvbuf=0xf79020, recvcount=2,
recvtype=1275069445, comm=1140850688) at allgather.c:866
#6 0x00002ac3cbdcf91b in PMPI_Comm_split (comm=1140850688, color=1, key=0,
newcomm=0x2ac3cbfb0d94) at comm_split.c:196
#7 0x00002ac3cbdd18f4 in create_2level_comm (comm=1140850688, size=2,
my_rank=<value optimized out>) at create_2level_comm.c:142
#8 0x00002ac3cbdfd0a5 in PMPI_Init (argc=0x7fffdee814bc, argv=0x7fffdee814b0)
at init.c:146
---Type <return> to continue, or q <return> to quit---
#9 0x0000000000400bcf in main (argc=3, argv=0x7fffdee815f8) at bw.c:27
--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania
_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
--
In the middle of difficulty, lies opportunity
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich2-dev/attachments/20090715/6b85a1bc/attachment.htm>
More information about the mpich2-dev
mailing list