[MPICH] MPI_Sendrecv error: ssh related?

Michele Trenti trenti at stsci.edu
Mon Oct 2 10:35:30 CDT 2006


Hi,

I encountered a MPI_Sendrecv error (see log below) within a well tested 
application (Gadget2, a publicly available cosmological simulations 
code), that appears only when the job becomes so CPU intensive that 
several minutes passes without data exchange among the nodes. When message 
passing is more frequent everything is running nicely, so I see as 
unlikely that the problem is related to a bug in the application.

I would guess that the problem may be related to my network settings, that 
perhaps close inactive connections (but I have never been kicked out from 
the nodes with a ssh opened terminal for being idle too long).

Has anyone encountered this problem before and/or can offer any insight on 
solution?

I use mpich2-1.0.4 on a Sun Opteron Linux cluster, compiled with gcc 
version 3.4.6 20060404 (Red Hat 3.4.6-3).

Thanks,

Michele

------- Error Log: -------------- 
[cli_10]: aborting job:
Fatal error in MPI_Sendrecv: Other MPI error, error stack:
MPI_Sendrecv(217).........................: 
MPI_Sendrecv(sbuf=0x2a97a984b8,
scount=1669696, MPI_BYTE, dest=7, stag=18, rbuf=0x2a988e6b78, 
rcount=516960,
MPI_BYTE, src=7, rtag=18, MPI_COMM_WORLD, status=0x7fbfffee00) failed
MPIDI_CH3_Progress_wait(217)..............: an error occurred while 
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(608):
MPIDU_Socki_handle_pollhup(439)...........: connection closed by peer
(set=0,sock=10) rank 10 in job 1  udf2.stsci.edu_47530   caused collective 
abort of all ranks
exit status of rank 10: killed by signal 9
-------------------------------------


Michele Trenti
Space Telescope Science Institute
3700 San Martin Drive                       Phone: +1 410 338 4987
Baltimore MD 21218 U.S.                       Fax: +1 410 338 4767


" We shall not cease from exploration
   And the end of all our exploring
   Will be to arrive where we started
   And know the place for the first time. "

                                      T. S. Eliot





More information about the mpich-discuss mailing list