[MPICH] RE: ***LOOPING MAIL*** mpich 1.2.6 appears to hang during Isend

Rajeev Thakur thakur at mcs.anl.gov
Thu Feb 23 11:08:44 CST 2006


Can you try the latest version, 1.2.7p1, or better still, MPICH2 1.0.3. If
it still hangs, send us a small test program.

Rajeev
 

> -----Original Message-----
> From: Benjamin Rutt [mailto:ruttREMOVE_THIS_SPAM_TAG at bmi.osu.edu] 
> Sent: Thursday, February 23, 2006 9:26 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: ***LOOPING MAIL*** mpich 1.2.6 appears to hang during Isend
> 
> I have a 15-node job, mpich1.2.6, amd64 system, using sockets for
> comm.  Rank 2 appears to be hung trying to send a small (160 byte)
> packet to rank 13.  It seems to use eager mechanism for this tiny
> packet.  What is mpich trying to do here in the following stack trace
> (looks like 'accept' from socket API)?  It seems that it shouldn't be
> blocking forever on an Isend, right?  Or is this some hardware issue?
> (This is a new cluster).
> 
> Thanks!
> 
> (gdb) down
> #9  0x000000000041f046 in deliver_elsewhere (p=0x888550, 
>     outreq=@0x7fffffa6df80, outreq_pending_state=@0x7fffffa6df60)
>     at /home/rutt/dev/dcmpi/dcmpi-cvs/src/dcmpiruntime.cpp:665
> 665             checkrc(MPI_Isend(p->internal_hdr,
> (gdb) list
> 660                 // send via MPI, convert to MPI rank system
> 661                 outrank = p->address.to_rank - cluster_offset;
> 662             }
> 663             MPI_Request * reqhdr = new MPI_Request;
> 664             p->to_bytearray();
> 665             checkrc(MPI_Isend(p->internal_hdr,
> 666                               DCMPI_PACKET_HEADER_SIZE, 
> MPI_CHAR, outrank,
> 667                               DCMPI_MPI_TAG, 
> MPI_COMM_WORLD, reqhdr));
> 668             if (p->body == NULL) {
> 669                 outreq.push_back(reqhdr);
> (gdb) print outrank
> $1 = 13
> (gdb) where
> #0  0x000000399410b2ef in __accept_nocancel () from 
> /lib64/tls/libpthread.so.0
> #1  0x000000000047ed06 in net_accept ()
> #2  0x0000000000481894 in request_connection ()
> #3  0x000000000048153e in establish_connection ()
> #4  0x0000000000491b49 in send_message ()
> #5  0x00000000004985b7 in MPID_CH_Eagerb_isend_short ()
> #6  0x0000000000485a81 in MPID_IsendContig ()
> #7  0x0000000000486faf in MPID_IsendDatatype ()
> #8  0x0000000000447b30 in PMPI_Isend ()
> #9  0x000000000041f046 in deliver_elsewhere (p=0x888550, 
>     outreq=@0x7fffffa6df80, outreq_pending_state=@0x7fffffa6df60)
>     at /home/rutt/dev/dcmpi/dcmpi-cvs/src/dcmpiruntime.cpp:665
> #10 0x000000000042113a in dcmpi_mainloop ()
>     at /home/rutt/dev/dcmpi/dcmpi-cvs/src/dcmpiruntime.cpp:1072
> #11 0x00000000004249dc in execute ()
>     at /home/rutt/dev/dcmpi/dcmpi-cvs/src/dcmpiruntime.cpp:1631
> #12 0x0000000000425883 in main (argc=19, argv=0x6e2720)
>     at /home/rutt/dev/dcmpi/dcmpi-cvs/src/dcmpiruntime.cpp:1800
> (gdb) 
> -- 
> Benjamin
> 
> 




More information about the mpich-discuss mailing list