[MPICH] MPI_Sendrecv error: ssh related?
Michele Trenti
trenti at stsci.edu
Mon Oct 2 10:35:30 CDT 2006
Hi,
I encountered a MPI_Sendrecv error (see log below) within a well tested
application (Gadget2, a publicly available cosmological simulations
code), that appears only when the job becomes so CPU intensive that
several minutes passes without data exchange among the nodes. When message
passing is more frequent everything is running nicely, so I see as
unlikely that the problem is related to a bug in the application.
I would guess that the problem may be related to my network settings, that
perhaps close inactive connections (but I have never been kicked out from
the nodes with a ssh opened terminal for being idle too long).
Has anyone encountered this problem before and/or can offer any insight on
solution?
I use mpich2-1.0.4 on a Sun Opteron Linux cluster, compiled with gcc
version 3.4.6 20060404 (Red Hat 3.4.6-3).
Thanks,
Michele
------- Error Log: --------------
[cli_10]: aborting job:
Fatal error in MPI_Sendrecv: Other MPI error, error stack:
MPI_Sendrecv(217).........................:
MPI_Sendrecv(sbuf=0x2a97a984b8,
scount=1669696, MPI_BYTE, dest=7, stag=18, rbuf=0x2a988e6b78,
rcount=516960,
MPI_BYTE, src=7, rtag=18, MPI_COMM_WORLD, status=0x7fbfffee00) failed
MPIDI_CH3_Progress_wait(217)..............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(608):
MPIDU_Socki_handle_pollhup(439)...........: connection closed by peer
(set=0,sock=10) rank 10 in job 1 udf2.stsci.edu_47530 caused collective
abort of all ranks
exit status of rank 10: killed by signal 9
-------------------------------------
Michele Trenti
Space Telescope Science Institute
3700 San Martin Drive Phone: +1 410 338 4987
Baltimore MD 21218 U.S. Fax: +1 410 338 4767
" We shall not cease from exploration
And the end of all our exploring
Will be to arrive where we started
And know the place for the first time. "
T. S. Eliot
More information about the mpich-discuss
mailing list