[mpich-discuss] strange timeouts

Philippe Bourdin Bourdin at KIS.Uni-Freiburg.de
Tue Oct 21 08:08:44 CDT 2008


	Hello,

well, noone able to comment on this?

Here is another event that occured, does this sound familiar to someone?
> [cli_37]: aborting job:
> Fatal error in MPI_Isend: Internal MPI error!, error stack:
> MPI_Isend(145)........: MPI_Isend(buf=0xca15a00, count=226368, MPI_REAL, dest=38, tag=4, MPI_COMM_WORLD, request=0xce16ee0) failed
> MPIDI_CH3_RndvSend(70): failure occurred while attempting to send RTS packet
> (unknown)(): Internal MPI error!

my original post was:
> I run a MPI-application that is based on the very well tested MHD-code 
> "Pencil" on 8 nodes with 8 cpus connected by "simple" ethernet.
> 
> What I see now, is that the simulation breaks since there seem to be a 
> timeout in MPI-communication, somtimes on MPI_Recv, but it happened also 
> at MPI_Finalize oder MPI_Wait. Normaly the simulation runs fine for a 
> couple of hours, before this happens:
> 
> stderr:
>> [cli_62]: aborting job:
>> Fatal error in MPI_Recv: Other MPI error, error stack:
>> MPI_Recv(186).............................: MPI_Recv(buf=0xc936700, 
>> count=1024, MPI_REAL, src=61, tag=101, MPI_COMM_WORLD, 
>> status=0xce168e0) failed
>> MPIDI_CH3i_Progress_wait(215).............: an error occurred while 
>> handling an event returned by MPIDU_Sock_Wait()
>> MPIDI_CH3I_Progress_handle_sock_event(420):
>> MPIDU_Socki_handle_read(633)..............: connection failure 
>> (set=0,sock=8,errno=110:Connection timed out)
> 
> stdout (cut the first 974 timesteps...):
>>    975   4.28E+01  8.08E-03   0.03   0.02   0.00   0.03   0.98   0.03
>>    976   4.28E+01  8.08E-03   0.03   0.02   0.00   0.03   0.98   0.03
>>    977   4.28E+01  8.08E-03   0.03   0.02   0.00   0.03   0.98   0.03
>> rank 62 in job 10  node003.bfg.uni-freiburg.de_38675   caused 
>> collective abort of all ranks
>>   exit status of rank 62: killed by signal 9
> 
> Btw: I am pretty sure that this kill-signal was not issued by me, by the 
> torque queue system or by an admin...
> Anyone having any idea, any hint or any clue on this...?
> 
> Thanks and best regards,
> 
>     Philippe Bourdin.
> 




More information about the mpich-discuss mailing list