[mpich-discuss] MPI_Barrier(MPI_COMM_WORLD) failed

Rajeev Thakur thakur at mcs.anl.gov
Tue Apr 21 11:37:43 CDT 2009


Thanks. Yes, it is worth trying with Nemesis. (Configure with
--with-device=ch3:nemesis).

Rajeev
 

> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov 
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Gus Correa
> Sent: Tuesday, April 21, 2009 11:11 AM
> To: Mpich Discuss
> Subject: Re: [mpich-discuss] MPI_Barrier(MPI_COMM_WORLD) failed
> 
> Hi Xiao Bo, Rajeev, list
> 
> I read reports of MPICH2 1.0.8 with the ch3:sock channel
> failing with socket errors in newer (maybe not so new now)
> Linux kernels.
> 
> The person that reported the problem
> had trouble before with p4 errors and MPICH1.
> Back then somebody else pointed out the same
> kind of problems with the newer kernels and MPICH1.
> Changing to MPICH2 1.0.8 with ch:socket didn't fix the problem either.
> 
> The fix I suggested consisted in changing to the nemesis channel
> (i.e. "configure --with-device=ch3:nemesis [other parameters]"),
> which AFAIK is not the default in MPICH2 1.0.8.
> 
> Please, see this thread:
> http://marc.info/?l=npaci-rocks-discussion&m=123175012813683&w=2
> 
> Xiao Bo apparently was using AIX before, and this may explain
> why his code began failing after he migrated it to Linux.
> His MPICH 1.0.8 seems to use sockets, as the error messages suggest.
> The problem may not be the same, and the fix may not be the same,
> however it may be worth trying a nemesis build, just in case.
> 
> Trust my guess or not, here are my two cents anyway. :)
> Good luck!
> 
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
> 
> 
> Rajeev Thakur wrote:
> > If the MPICH2 test suite runs (run make testing in top-level mpich2
> > directory), then I don't know what the problem might be.
> > 
> > Rajeev
> >  
> > 
> >> -----Original Message-----
> >> From: Xiao Bo Lu [mailto:xiao.lu at auckland.ac.nz] 
> >> Sent: Monday, April 20, 2009 6:27 PM
> >> To: Rajeev Thakur; mpich-discuss at mcs.anl.gov
> >> Subject: Re: [mpich-discuss] MPI_Barrier(MPI_COMM_WORLD) failed
> >>
> >> Hi,
> >>
> >> Yes. It is a Fortran 90 code. I did re-compile all the source 
> >> code and libraries with the new MPICH2. The cofiguration 
> >> option I made is as:
> >>
> >> ./configure -prefix=/hpc/xlu012 CC=gcc F90=gfortran
> >>
> >> and when I compiled all the files with the mpif90. I also did 
> >> a few simple mpi tests with the mpiexec and it seems to work 
> >> fine. I'm starting to wonder if there is anything to do with 
> >> the memory allocation or some other communication variables 
> >> that blocks some of the messages from a large array(??).
> >>
> >> Regards
> >> Xiao
> >>
> >> Rajeev Thakur wrote:
> >>> If it's Fortran code, just make sure no mpif.h files are 
> >> left around 
> >>> from the old implementation. Also make sure that the entire 
> >> code (all 
> >>> files) have been recompiled with MPICH2.
> >>>
> >>> Rajeev
> >>>
> >>>
> >>>
> >>>   
> >>>> -----Original Message-----
> >>>> From: mpich-discuss-bounces at mcs.anl.gov 
> >>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of 
> Xiao Bo Lu
> >>>> Sent: Monday, April 20, 2009 5:57 PM
> >>>> To: mpich-discuss at mcs.anl.gov
> >>>> Subject: Re: [mpich-discuss] MPI_Barrier(MPI_COMM_WORLD) failed
> >>>>
> >>>> Hi Rajeev,
> >>>>
> >>>> Yes. The code was working but on a different platform 
> >> (IBM-aix system 
> >>>> with POE). I have to move the code to the new system since 
> >> the lease 
> >>>> on the old one just expired.
> >>>>
> >>>> Regards
> >>>> Xiao
> >>>>
> >>>> Rajeev Thakur wrote:
> >>>>     
> >>>>> Was this code that worked earlier?
> >>>>>
> >>>>> Rajeev
> >>>>>
> >>>>>   
> >>>>>       
> >>>>>> -----Original Message-----
> >>>>>> From: mpich-discuss-bounces at mcs.anl.gov 
> >>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of 
> >> Xiao Bo Lu
> >>>>>> Sent: Monday, April 20, 2009 12:51 AM
> >>>>>> To: mpich-discuss at mcs.anl.gov
> >>>>>> Subject: [mpich-discuss] MPI_Barrier(MPI_COMM_WORLD) failed
> >>>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> I've recently installed MPICH2-1.0.8 on my local machine
> >>>>>> (x86_64 Linux,
> >>>>>> gfortran 4.1.2) and I am now experiencing errors with my 
> >> mpi code. 
> >>>>>> The error messages are:
> >>>>>>
> >>>>>> Fatal error in MPI_Barrier: Other MPI error, error stack:
> >>>>>> MPI_Barrier(406)..........................: 
> >>>>>> MPI_Barrier(MPI_COMM_WORLD)
> >>>>>> failed
> >>>>>> MPIR_Barrier(77)..........................:
> >>>>>> MPIC_Sendrecv(126)........................:
> >>>>>> MPIC_Wait(270)............................:
> >>>>>> MPIDI_CH3i_Progress_wait(215).............: an error
> >>>>>>         
> >>>> occurred while
> >>>>     
> >>>>>> handling an event returned by MPIDU_Sock_Wait()
> >>>>>> MPIDI_CH3I_Progress_handle_sock_event(420):
> >>>>>> MPIDU_Socki_handle_read(637)..............: connection failure 
> >>>>>> (set=0,sock=1,errno=104:Connection reset by peer)[cli_0]:
> >>>>>> aborting job:
> >>>>>> Fatal error in MPI_Barrier: Other MPI error, error stack:
> >>>>>> MPI_Barrier(406)..........................: 
> >>>>>> MPI_Barrier(MPI_COMM_WORLD)
> >>>>>> failed
> >>>>>> MPIR_Barrier(77)..........................:
> >>>>>> MPIC_Sendrecv(126)........................:
> >>>>>> MPIC_Wait(270)............................:
> >>>>>> MPIDI_CH3i_Progress_wait(215).............: an error
> >>>>>>         
> >>>> occurred while
> >>>>     
> >>>>>> handling an event returned by MPIDU_Sock_Wait()
> >>>>>> MPIDI_CH3I_Progress_handle_sock_event(420):
> >>>>>> MPIDU_Socki_handle_read size of processor is:            
> >>         4
> >>>>>> I searched the net and found quite a few links about such
> >>>>>>         
> >>>> error, but
> >>>>     
> >>>>>> none of the posts could give a definitive fix. Do some 
> >> of you know 
> >>>>>> what could cause this error (e.g. incorrect installation; 
> >>>>>> environmental
> >>>>>> setting..) and how to fix it?
> >>>>>>
> >>>>>> Regards
> >>>>>> Xiao
> >>>>>>
> >>>>>>     
> >>>>>>         
> >>>>>   
> >>>>>       
> >>>>     
> >>>   
> >>
> 
> 



More information about the mpich-discuss mailing list