[MPICH] RE: ***LOOPING MAIL*** mpich 1.2.6 appears to hang during Isend
Rajeev Thakur
thakur at mcs.anl.gov
Thu Feb 23 11:08:44 CST 2006
Can you try the latest version, 1.2.7p1, or better still, MPICH2 1.0.3. If
it still hangs, send us a small test program.
Rajeev
> -----Original Message-----
> From: Benjamin Rutt [mailto:ruttREMOVE_THIS_SPAM_TAG at bmi.osu.edu]
> Sent: Thursday, February 23, 2006 9:26 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: ***LOOPING MAIL*** mpich 1.2.6 appears to hang during Isend
>
> I have a 15-node job, mpich1.2.6, amd64 system, using sockets for
> comm. Rank 2 appears to be hung trying to send a small (160 byte)
> packet to rank 13. It seems to use eager mechanism for this tiny
> packet. What is mpich trying to do here in the following stack trace
> (looks like 'accept' from socket API)? It seems that it shouldn't be
> blocking forever on an Isend, right? Or is this some hardware issue?
> (This is a new cluster).
>
> Thanks!
>
> (gdb) down
> #9 0x000000000041f046 in deliver_elsewhere (p=0x888550,
> outreq=@0x7fffffa6df80, outreq_pending_state=@0x7fffffa6df60)
> at /home/rutt/dev/dcmpi/dcmpi-cvs/src/dcmpiruntime.cpp:665
> 665 checkrc(MPI_Isend(p->internal_hdr,
> (gdb) list
> 660 // send via MPI, convert to MPI rank system
> 661 outrank = p->address.to_rank - cluster_offset;
> 662 }
> 663 MPI_Request * reqhdr = new MPI_Request;
> 664 p->to_bytearray();
> 665 checkrc(MPI_Isend(p->internal_hdr,
> 666 DCMPI_PACKET_HEADER_SIZE,
> MPI_CHAR, outrank,
> 667 DCMPI_MPI_TAG,
> MPI_COMM_WORLD, reqhdr));
> 668 if (p->body == NULL) {
> 669 outreq.push_back(reqhdr);
> (gdb) print outrank
> $1 = 13
> (gdb) where
> #0 0x000000399410b2ef in __accept_nocancel () from
> /lib64/tls/libpthread.so.0
> #1 0x000000000047ed06 in net_accept ()
> #2 0x0000000000481894 in request_connection ()
> #3 0x000000000048153e in establish_connection ()
> #4 0x0000000000491b49 in send_message ()
> #5 0x00000000004985b7 in MPID_CH_Eagerb_isend_short ()
> #6 0x0000000000485a81 in MPID_IsendContig ()
> #7 0x0000000000486faf in MPID_IsendDatatype ()
> #8 0x0000000000447b30 in PMPI_Isend ()
> #9 0x000000000041f046 in deliver_elsewhere (p=0x888550,
> outreq=@0x7fffffa6df80, outreq_pending_state=@0x7fffffa6df60)
> at /home/rutt/dev/dcmpi/dcmpi-cvs/src/dcmpiruntime.cpp:665
> #10 0x000000000042113a in dcmpi_mainloop ()
> at /home/rutt/dev/dcmpi/dcmpi-cvs/src/dcmpiruntime.cpp:1072
> #11 0x00000000004249dc in execute ()
> at /home/rutt/dev/dcmpi/dcmpi-cvs/src/dcmpiruntime.cpp:1631
> #12 0x0000000000425883 in main (argc=19, argv=0x6e2720)
> at /home/rutt/dev/dcmpi/dcmpi-cvs/src/dcmpiruntime.cpp:1800
> (gdb)
> --
> Benjamin
>
>
More information about the mpich-discuss
mailing list