[mpich-discuss] strange timeouts

Tue Oct 21 09:43:14 CDT 2008

No idea. Something weird is going on in the network I think.

Try using the latest 1.0.8rc1 that was just released.

Rajeev

> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov 
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Philippe Bourdin
> Sent: Tuesday, October 21, 2008 8:09 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] strange timeouts
> 
> 
> 	Hello,
> 
> well, noone able to comment on this?
> 
> Here is another event that occured, does this sound familiar 
> to someone?
> > [cli_37]: aborting job:
> > Fatal error in MPI_Isend: Internal MPI error!, error stack:
> > MPI_Isend(145)........: MPI_Isend(buf=0xca15a00, count=226368, 
> > MPI_REAL, dest=38, tag=4, MPI_COMM_WORLD, request=0xce16ee0) failed
> > MPIDI_CH3_RndvSend(70): failure occurred while attempting 
> to send RTS 
> > packet
> > (unknown)(): Internal MPI error!
> 
> my original post was:
> > I run a MPI-application that is based on the very well 
> tested MHD-code 
> > "Pencil" on 8 nodes with 8 cpus connected by "simple" ethernet.
> > 
> > What I see now, is that the simulation breaks since there 
> seem to be a 
> > timeout in MPI-communication, somtimes on MPI_Recv, but it happened 
> > also at MPI_Finalize oder MPI_Wait. Normaly the simulation 
> runs fine 
> > for a couple of hours, before this happens:
> > 
> > stderr:
> >> [cli_62]: aborting job:
> >> Fatal error in MPI_Recv: Other MPI error, error stack:
> >> MPI_Recv(186).............................: 
> MPI_Recv(buf=0xc936700, 
> >> count=1024, MPI_REAL, src=61, tag=101, MPI_COMM_WORLD,
> >> status=0xce168e0) failed
> >> MPIDI_CH3i_Progress_wait(215).............: an error 
> occurred while 
> >> handling an event returned by MPIDU_Sock_Wait()
> >> MPIDI_CH3I_Progress_handle_sock_event(420):
> >> MPIDU_Socki_handle_read(633)..............: connection failure 
> >> (set=0,sock=8,errno=110:Connection timed out)
> > 
> > stdout (cut the first 974 timesteps...):
> >>    975   4.28E+01  8.08E-03   0.03   0.02   0.00   0.03   
> 0.98   0.03
> >>    976   4.28E+01  8.08E-03   0.03   0.02   0.00   0.03   
> 0.98   0.03
> >>    977   4.28E+01  8.08E-03   0.03   0.02   0.00   0.03   
> 0.98   0.03
> >> rank 62 in job 10  node003.bfg.uni-freiburg.de_38675   caused 
> >> collective abort of all ranks
> >>   exit status of rank 62: killed by signal 9
> > 
> > Btw: I am pretty sure that this kill-signal was not issued 
> by me, by 
> > the torque queue system or by an admin...
> > Anyone having any idea, any hint or any clue on this...?
> > 
> > Thanks and best regards,
> > 
> >     Philippe Bourdin.
> > 
> 
>