[MPICH] Re: MPICH crashes with large number of messages

Anh Vo vtqanh at gmail.com
Thu Oct 4 23:25:12 CDT 2007


The buff that is used for receiving the message is of the same
type/size with the one that is used for sending, so that should not be
a problem.

It looked like the root node was terminated by the system (killed by
signal 9). What could cause the system to terminate an MPI process?

rank 3 in job 5  raven30_39541   caused collective abort of all ranks
  exit status of rank 3: return code 1
rank 2 in job 5  raven30_39541   caused collective abort of all ranks
  exit status of rank 2: return code 1
rank 1 in job 5  raven30_39541   caused collective abort of all ranks
  exit status of rank 1: return code 1
rank 0 in job 5  raven30_39541   caused collective abort of all ranks
  exit status of rank 0: killed by signal 9

Thanks
--Anh

On 10/4/07, Anh Vo <vtqanh at gmail.com> wrote:
> I was trying to run a test program on a small cluster with 4 nodes in
> which each node sends a message with random size to the root. The
> program calls for 100K messages but everytime it would fail around
> 20K-30K messages. It seems like the root is unable to receive any more
> messages from the other 3 and causes all the MPI-Send from those to
> fail.
>
> Here's the output of MPICH
> [cli_3]: aborting job:
> Fatal error in MPI_Send: Other MPI error, error stack:
> MPI_Send(173).............................: MPI_Send(buf=0x80d4008,
> count=3986, MPI_INT, dest=0, tag=0, MPI_COMM_WORLD) failed
> MPIDI_CH3i_Progress_wait(215).............: an error occurred while
> handling an event returned by MPIDU_Sock_Wait()
> MPIDI_CH3I_Progress_handle_sock_event(660):
> MPIDU_Socki_handle_pollhup(402)...........: connection closed by peer
> (set=0,sock=1)
>
> [cli_1]: aborting job:
> Fatal error in MPI_Send: Other MPI error, error stack:
> MPI_Send(173).............................: MPI_Send(buf=0x80d4008,
> count=9369, MPI_INT, dest=0, tag=0, MPI_COMM_WORLD) failed
> MPIDI_CH3i_Progress_wait(215).............: an error occurred while
> handling an event returned by MPIDU_Sock_Wait()
> MPIDI_CH3I_Progress_handle_sock_event(660):
> MPIDU_Socki_handle_pollhup(402)...........: connection closed by peer
> (set=0,sock=1)
>
> [cli_2]: aborting job:
> Fatal error in MPI_Send: Other MPI error, error stack:
> MPI_Send(173).............................: MPI_Send(buf=0x80d4008,
> count=9754, MPI_INT, dest=0, tag=0, MPI_COMM_WORLD) failed
> MPIDI_CH3i_Progress_wait(215).............: an error occurred while
> handling an event returned by MPIDU_Sock_Wait()
> MPIDI_CH3I_Progress_handle_sock_event(660):
> MPIDU_Socki_handle_pollhup(402)...........: connection closed by peer
> (set=0,sock=1)
>




More information about the mpich-discuss mailing list