[MPICH] MPICH crashes with large number of messages

Anh Vo vtqanh at gmail.com
Thu Oct 4 16:43:06 CDT 2007


I was trying to run a test program on a small cluster with 4 nodes in
which each node sends a message with random size to the root. The
program calls for 100K messages but everytime it would fail around
20K-30K messages. It seems like the root is unable to receive any more
messages from the other 3 and causes all the MPI-Send from those to
fail.

Here's the output of MPICH
[cli_3]: aborting job:
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173).............................: MPI_Send(buf=0x80d4008,
count=3986, MPI_INT, dest=0, tag=0, MPI_COMM_WORLD) failed
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(660):
MPIDU_Socki_handle_pollhup(402)...........: connection closed by peer
(set=0,sock=1)

[cli_1]: aborting job:
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173).............................: MPI_Send(buf=0x80d4008,
count=9369, MPI_INT, dest=0, tag=0, MPI_COMM_WORLD) failed
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(660):
MPIDU_Socki_handle_pollhup(402)...........: connection closed by peer
(set=0,sock=1)

[cli_2]: aborting job:
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173).............................: MPI_Send(buf=0x80d4008,
count=9754, MPI_INT, dest=0, tag=0, MPI_COMM_WORLD) failed
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(660):
MPIDU_Socki_handle_pollhup(402)...........: connection closed by peer
(set=0,sock=1)




More information about the mpich-discuss mailing list