[mpich2-dev] Need a hint in debugging a problem that only affects a few machines in our cluster.

Rajeev Thakur thakur at mcs.anl.gov
Tue Jul 14 16:49:47 CDT 2009


Mike,
         Since this is related to MVAPICH2, I have forwarded your note
to the mvapich-discuss at cse.ohio-state.edu mailing list.
 
Rajeev


  _____  

From: mpich2-dev-bounces at mcs.anl.gov
[mailto:mpich2-dev-bounces at mcs.anl.gov] On Behalf Of Mike Heinz
Sent: Tuesday, July 14, 2009 4:04 PM
To: mpich2-dev at mcs.anl.gov
Cc: Todd Rimmer
Subject: [mpich2-dev] Need a hint in debugging a problem that only
affects a few machines in our cluster.



We're having a very odd problem with our fabric, where, out of the
entire cluster, machine "A" can't run mvapich2 programs with  machine
"B", and machine "C" can't run programs with machine "D" - even though
"A" can run with "D" and "B" can run with "C" - and the rest of the
fabric works fine.

 

1)      There are no IB errors anywhere on the fabric that I can find,
and the machines in question all work correctly with mvapich1 and
low-level IB tests.

2)      The problem occurs whether using mpd or rsh.

3)      If I attach to the running processes, both machines appear to be
waiting for a read operation to complete. (See below)

 

Can anyone make a suggestion on how to debug this? 

 

Stack trace for node 0:

 

#0  0x000000361160abb5 in pthread_spin_lock () from
/lib64/libpthread.so.0

#1  0x00002aaaab08fb6c in mthca_poll_cq (ibcq=0x2060980, ne=1,

    wc=0x7fff9d835900) at src/cq.c:468

#2  0x00002aaaaab5d8d8 in MPIDI_CH3I_MRAILI_Cq_poll (

    vbuf_handle=0x7fff9d8359d8, vc_req=0x0, receiving=0, is_blocking=1)

    at /usr/include/infiniband/verbs.h:934

#3  0x00002aaaaab177fa in MPIDI_CH3I_read_progress
(vc_pptr=0x7fff9d8359e0,

    v_ptr=0x7fff9d8359d8, is_blocking=1) at ch3_read_progress.c:143

#4  0x00002aaaaab17464 in MPIDI_CH3I_Progress (is_blocking=1,

    state=<value optimized out>) at ch3_progress.c:202

#5  0x00002aaaaab5bc4e in MPIC_Wait (request_ptr=0x2aaaaae19800)

    at helper_fns.c:269

#6  0x00002aaaaab5c043 in MPIC_Sendrecv (sendbuf=0x217fc50, sendcount=2,

    sendtype=1275069445, dest=1, sendtag=7, recvbuf=0x217fc58,
recvcount=2,

    recvtype=1275069445, source=1, recvtag=7, comm=1140850688,

    status=0x7fff9d835b60) at helper_fns.c:125

#7  0x00002aaaaaafe387 in MPIR_Allgather (sendbuf=<value optimized out>,

    sendcount=<value optimized out>, sendtype=1275069445,
recvbuf=0x217fc50,

    recvcount=2, recvtype=1275069445, comm_ptr=0x2aaaaae1c1e0)

    at allgather.c:192

#8  0x00002aaaaaafeff9 in PMPI_Allgather (sendbuf=0xffffffffffffffff,

    sendcount=2, sendtype=1275069445, recvbuf=0x217fc50, recvcount=2,

    recvtype=1275069445, comm=1140850688) at allgather.c:866

---Type <return> to continue, or q <return> to quit---

#9  0x00002aaaaab3b00b in PMPI_Comm_split (comm=1140850688, color=0,
key=0,

    newcomm=0x2aaaaae1c2f4) at comm_split.c:196

#10 0x00002aaaaab3cd84 in create_2level_comm (comm=1140850688, size=2,

    my_rank=<value optimized out>) at create_2level_comm.c:142

#11 0x00002aaaaab6877d in PMPI_Init (argc=0x7fff9d835e7c,
argv=0x7fff9d835e70)

    at init.c:146

#12 0x0000000000400b2f in main (argc=3, argv=0x7fff9d835fb8) at bw.c:27

 

Stack trace for node 1:

 

#0  0x00002ac3cbdac2d2 in MPIDI_CH3I_read_progress
(vc_pptr=0x7fffdee81020,

    v_ptr=0x7fffdee81018, is_blocking=1) at ch3_read_progress.c:143

#1  0x00002ac3cbdabf44 in MPIDI_CH3I_Progress (is_blocking=1,

    state=<value optimized out>) at ch3_progress.c:202

#2  0x00002ac3cbdf060e in MPIC_Wait (request_ptr=0x2ac3cbfae2a0)

    at helper_fns.c:269

#3  0x00002ac3cbdf0a03 in MPIC_Sendrecv (sendbuf=0xf79028, sendcount=2,

    sendtype=1275069445, dest=0, sendtag=7, recvbuf=0xf79020,
recvcount=4,

    recvtype=1275069445, source=0, recvtag=7, comm=1140850688,

    status=0x7fffdee811a0) at helper_fns.c:125

#4  0x00002ac3cbd92ddb in MPIR_Allgather (sendbuf=<value optimized out>,

    sendcount=<value optimized out>, sendtype=1275069445,
recvbuf=0xf79020,

    recvcount=2, recvtype=1275069445, comm_ptr=0x2ac3cbfb0c80)

    at allgather.c:192

#5  0x00002ac3cbd93a45 in PMPI_Allgather (sendbuf=0xffffffffffffffff,

    sendcount=2, sendtype=1275069445, recvbuf=0xf79020, recvcount=2,

    recvtype=1275069445, comm=1140850688) at allgather.c:866

#6  0x00002ac3cbdcf91b in PMPI_Comm_split (comm=1140850688, color=1,
key=0,

    newcomm=0x2ac3cbfb0d94) at comm_split.c:196

#7  0x00002ac3cbdd18f4 in create_2level_comm (comm=1140850688, size=2,

    my_rank=<value optimized out>) at create_2level_comm.c:142

#8  0x00002ac3cbdfd0a5 in PMPI_Init (argc=0x7fffdee814bc,
argv=0x7fffdee814b0)

    at init.c:146

---Type <return> to continue, or q <return> to quit---

#9  0x0000000000400bcf in main (argc=3, argv=0x7fffdee815f8) at bw.c:27

--

Michael Heinz

Principal Engineer, Qlogic Corporation

King of Prussia, Pennsylvania

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich2-dev/attachments/20090714/4feae4ab/attachment-0001.htm>


More information about the mpich2-dev mailing list