[mpich-discuss] strange timeouts

Philippe Bourdin Bourdin at KIS.Uni-Freiburg.de
Mon Oct 13 08:07:34 CDT 2008


	Hello,

I run a MPI-application that is based on the very well tested MHD-code 
"Pencil" on 8 nodes with 8 cpus connected by "simple" ethernet.

What I see now, is that the simulation breaks since there seem to be a 
timeout in MPI-communication, somtimes on MPI_Recv, but it happened also 
at MPI_Finalize oder MPI_Wait. Normaly the simulation runs fine for a 
couple of hours, before this happens:

stderr:
> Fatal error in MPI_Recv: Other MPI error, error stack:
> MPI_Recv(186).............................: MPI_Recv(buf=0xc936700, count=1024, MPI_REAL, src=61, tag=101, MPI_COMM_WORLD, status=0xce168e0) failed
> MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
> MPIDI_CH3I_Progress_handle_sock_event(420):
> MPIDU_Socki_handle_read(633)..............: connection failure (set=0,sock=8,errno=110:Connection timed out)[cli_62]: aborting job:
> Fatal error in MPI_Recv: Other MPI error, error stack:
> MPI_Recv(186).............................: MPI_Recv(buf=0xc936700, count=1024, MPI_REAL, src=61, tag=101, MPI_COMM_WORLD, status=0xce168e0) failed
> MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
> MPIDI_CH3I_Progress_handle_sock_event(420):
> MPIDU_Socki_handle_read(633)..............: connection failure (set=0,sock=8,errno=110:Connection timed out)

stdout (cut the first 974 timesteps...):
>    975   4.28E+01  8.08E-03   0.03   0.02   0.00   0.03   0.98   0.03
>    976   4.28E+01  8.08E-03   0.03   0.02   0.00   0.03   0.98   0.03
>    977   4.28E+01  8.08E-03   0.03   0.02   0.00   0.03   0.98   0.03
> rank 62 in job 10  node003.bfg.uni-freiburg.de_38675   caused collective abort of all ranks
>   exit status of rank 62: killed by signal 9

Btw: I am pretty sure that this kill-signal was not issued by me, by the 
torque queue system or by an admin...
Anyone having any idea, any hint or any clue on this...?

Thanks and best regards,

	Philippe Bourdin.




More information about the mpich-discuss mailing list