[mpich-discuss] strange timeouts
Rajeev Thakur
thakur at mcs.anl.gov
Tue Oct 21 09:43:14 CDT 2008
No idea. Something weird is going on in the network I think.
Try using the latest 1.0.8rc1 that was just released.
Rajeev
> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Philippe Bourdin
> Sent: Tuesday, October 21, 2008 8:09 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] strange timeouts
>
>
> Hello,
>
> well, noone able to comment on this?
>
> Here is another event that occured, does this sound familiar
> to someone?
> > [cli_37]: aborting job:
> > Fatal error in MPI_Isend: Internal MPI error!, error stack:
> > MPI_Isend(145)........: MPI_Isend(buf=0xca15a00, count=226368,
> > MPI_REAL, dest=38, tag=4, MPI_COMM_WORLD, request=0xce16ee0) failed
> > MPIDI_CH3_RndvSend(70): failure occurred while attempting
> to send RTS
> > packet
> > (unknown)(): Internal MPI error!
>
> my original post was:
> > I run a MPI-application that is based on the very well
> tested MHD-code
> > "Pencil" on 8 nodes with 8 cpus connected by "simple" ethernet.
> >
> > What I see now, is that the simulation breaks since there
> seem to be a
> > timeout in MPI-communication, somtimes on MPI_Recv, but it happened
> > also at MPI_Finalize oder MPI_Wait. Normaly the simulation
> runs fine
> > for a couple of hours, before this happens:
> >
> > stderr:
> >> [cli_62]: aborting job:
> >> Fatal error in MPI_Recv: Other MPI error, error stack:
> >> MPI_Recv(186).............................:
> MPI_Recv(buf=0xc936700,
> >> count=1024, MPI_REAL, src=61, tag=101, MPI_COMM_WORLD,
> >> status=0xce168e0) failed
> >> MPIDI_CH3i_Progress_wait(215).............: an error
> occurred while
> >> handling an event returned by MPIDU_Sock_Wait()
> >> MPIDI_CH3I_Progress_handle_sock_event(420):
> >> MPIDU_Socki_handle_read(633)..............: connection failure
> >> (set=0,sock=8,errno=110:Connection timed out)
> >
> > stdout (cut the first 974 timesteps...):
> >> 975 4.28E+01 8.08E-03 0.03 0.02 0.00 0.03
> 0.98 0.03
> >> 976 4.28E+01 8.08E-03 0.03 0.02 0.00 0.03
> 0.98 0.03
> >> 977 4.28E+01 8.08E-03 0.03 0.02 0.00 0.03
> 0.98 0.03
> >> rank 62 in job 10 node003.bfg.uni-freiburg.de_38675 caused
> >> collective abort of all ranks
> >> exit status of rank 62: killed by signal 9
> >
> > Btw: I am pretty sure that this kill-signal was not issued
> by me, by
> > the torque queue system or by an admin...
> > Anyone having any idea, any hint or any clue on this...?
> >
> > Thanks and best regards,
> >
> > Philippe Bourdin.
> >
>
>
More information about the mpich-discuss
mailing list