[mpich2-dev] Need a hint in debugging a problem that only affects a few machines in our cluster.

Mike Heinz michael.heinz at qlogic.com
Tue Jul 14 16:04:29 CDT 2009


We're having a very odd problem with our fabric, where, out of the entire cluster, machine "A" can't run mvapich2 programs with  machine "B", and machine "C" can't run programs with machine "D" - even though "A" can run with "D" and "B" can run with "C" - and the rest of the fabric works fine.


1)      There are no IB errors anywhere on the fabric that I can find, and the machines in question all work correctly with mvapich1 and low-level IB tests.

2)      The problem occurs whether using mpd or rsh.

3)      If I attach to the running processes, both machines appear to be waiting for a read operation to complete. (See below)

Can anyone make a suggestion on how to debug this?

Stack trace for node 0:

#0  0x000000361160abb5 in pthread_spin_lock () from /lib64/libpthread.so.0
#1  0x00002aaaab08fb6c in mthca_poll_cq (ibcq=0x2060980, ne=1,
    wc=0x7fff9d835900) at src/cq.c:468
#2  0x00002aaaaab5d8d8 in MPIDI_CH3I_MRAILI_Cq_poll (
    vbuf_handle=0x7fff9d8359d8, vc_req=0x0, receiving=0, is_blocking=1)
    at /usr/include/infiniband/verbs.h:934
#3  0x00002aaaaab177fa in MPIDI_CH3I_read_progress (vc_pptr=0x7fff9d8359e0,
    v_ptr=0x7fff9d8359d8, is_blocking=1) at ch3_read_progress.c:143
#4  0x00002aaaaab17464 in MPIDI_CH3I_Progress (is_blocking=1,
    state=<value optimized out>) at ch3_progress.c:202
#5  0x00002aaaaab5bc4e in MPIC_Wait (request_ptr=0x2aaaaae19800)
    at helper_fns.c:269
#6  0x00002aaaaab5c043 in MPIC_Sendrecv (sendbuf=0x217fc50, sendcount=2,
    sendtype=1275069445, dest=1, sendtag=7, recvbuf=0x217fc58, recvcount=2,
    recvtype=1275069445, source=1, recvtag=7, comm=1140850688,
    status=0x7fff9d835b60) at helper_fns.c:125
#7  0x00002aaaaaafe387 in MPIR_Allgather (sendbuf=<value optimized out>,
    sendcount=<value optimized out>, sendtype=1275069445, recvbuf=0x217fc50,
    recvcount=2, recvtype=1275069445, comm_ptr=0x2aaaaae1c1e0)
    at allgather.c:192
#8  0x00002aaaaaafeff9 in PMPI_Allgather (sendbuf=0xffffffffffffffff,
    sendcount=2, sendtype=1275069445, recvbuf=0x217fc50, recvcount=2,
    recvtype=1275069445, comm=1140850688) at allgather.c:866
---Type <return> to continue, or q <return> to quit---
#9  0x00002aaaaab3b00b in PMPI_Comm_split (comm=1140850688, color=0, key=0,
    newcomm=0x2aaaaae1c2f4) at comm_split.c:196
#10 0x00002aaaaab3cd84 in create_2level_comm (comm=1140850688, size=2,
    my_rank=<value optimized out>) at create_2level_comm.c:142
#11 0x00002aaaaab6877d in PMPI_Init (argc=0x7fff9d835e7c, argv=0x7fff9d835e70)
    at init.c:146
#12 0x0000000000400b2f in main (argc=3, argv=0x7fff9d835fb8) at bw.c:27

Stack trace for node 1:

#0  0x00002ac3cbdac2d2 in MPIDI_CH3I_read_progress (vc_pptr=0x7fffdee81020,
    v_ptr=0x7fffdee81018, is_blocking=1) at ch3_read_progress.c:143
#1  0x00002ac3cbdabf44 in MPIDI_CH3I_Progress (is_blocking=1,
    state=<value optimized out>) at ch3_progress.c:202
#2  0x00002ac3cbdf060e in MPIC_Wait (request_ptr=0x2ac3cbfae2a0)
    at helper_fns.c:269
#3  0x00002ac3cbdf0a03 in MPIC_Sendrecv (sendbuf=0xf79028, sendcount=2,
    sendtype=1275069445, dest=0, sendtag=7, recvbuf=0xf79020, recvcount=4,
    recvtype=1275069445, source=0, recvtag=7, comm=1140850688,
    status=0x7fffdee811a0) at helper_fns.c:125
#4  0x00002ac3cbd92ddb in MPIR_Allgather (sendbuf=<value optimized out>,
    sendcount=<value optimized out>, sendtype=1275069445, recvbuf=0xf79020,
    recvcount=2, recvtype=1275069445, comm_ptr=0x2ac3cbfb0c80)
    at allgather.c:192
#5  0x00002ac3cbd93a45 in PMPI_Allgather (sendbuf=0xffffffffffffffff,
    sendcount=2, sendtype=1275069445, recvbuf=0xf79020, recvcount=2,
    recvtype=1275069445, comm=1140850688) at allgather.c:866
#6  0x00002ac3cbdcf91b in PMPI_Comm_split (comm=1140850688, color=1, key=0,
    newcomm=0x2ac3cbfb0d94) at comm_split.c:196
#7  0x00002ac3cbdd18f4 in create_2level_comm (comm=1140850688, size=2,
    my_rank=<value optimized out>) at create_2level_comm.c:142
#8  0x00002ac3cbdfd0a5 in PMPI_Init (argc=0x7fffdee814bc, argv=0x7fffdee814b0)
    at init.c:146
---Type <return> to continue, or q <return> to quit---
#9  0x0000000000400bcf in main (argc=3, argv=0x7fffdee815f8) at bw.c:27
--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich2-dev/attachments/20090714/51ae6f42/attachment.htm>


More information about the mpich2-dev mailing list