[mpich-discuss] strange timeouts
Philippe Bourdin
Bourdin at KIS.Uni-Freiburg.de
Tue Oct 21 08:08:44 CDT 2008
Hello,
well, noone able to comment on this?
Here is another event that occured, does this sound familiar to someone?
> [cli_37]: aborting job:
> Fatal error in MPI_Isend: Internal MPI error!, error stack:
> MPI_Isend(145)........: MPI_Isend(buf=0xca15a00, count=226368, MPI_REAL, dest=38, tag=4, MPI_COMM_WORLD, request=0xce16ee0) failed
> MPIDI_CH3_RndvSend(70): failure occurred while attempting to send RTS packet
> (unknown)(): Internal MPI error!
my original post was:
> I run a MPI-application that is based on the very well tested MHD-code
> "Pencil" on 8 nodes with 8 cpus connected by "simple" ethernet.
>
> What I see now, is that the simulation breaks since there seem to be a
> timeout in MPI-communication, somtimes on MPI_Recv, but it happened also
> at MPI_Finalize oder MPI_Wait. Normaly the simulation runs fine for a
> couple of hours, before this happens:
>
> stderr:
>> [cli_62]: aborting job:
>> Fatal error in MPI_Recv: Other MPI error, error stack:
>> MPI_Recv(186).............................: MPI_Recv(buf=0xc936700,
>> count=1024, MPI_REAL, src=61, tag=101, MPI_COMM_WORLD,
>> status=0xce168e0) failed
>> MPIDI_CH3i_Progress_wait(215).............: an error occurred while
>> handling an event returned by MPIDU_Sock_Wait()
>> MPIDI_CH3I_Progress_handle_sock_event(420):
>> MPIDU_Socki_handle_read(633)..............: connection failure
>> (set=0,sock=8,errno=110:Connection timed out)
>
> stdout (cut the first 974 timesteps...):
>> 975 4.28E+01 8.08E-03 0.03 0.02 0.00 0.03 0.98 0.03
>> 976 4.28E+01 8.08E-03 0.03 0.02 0.00 0.03 0.98 0.03
>> 977 4.28E+01 8.08E-03 0.03 0.02 0.00 0.03 0.98 0.03
>> rank 62 in job 10 node003.bfg.uni-freiburg.de_38675 caused
>> collective abort of all ranks
>> exit status of rank 62: killed by signal 9
>
> Btw: I am pretty sure that this kill-signal was not issued by me, by the
> torque queue system or by an admin...
> Anyone having any idea, any hint or any clue on this...?
>
> Thanks and best regards,
>
> Philippe Bourdin.
>
More information about the mpich-discuss
mailing list