[mpich-discuss] MPI_Barrier(MPI_COMM_WORLD) failed

Gus Correa gus at ldeo.columbia.edu
Tue Apr 21 11:10:59 CDT 2009


Hi Xiao Bo, Rajeev, list

I read reports of MPICH2 1.0.8 with the ch3:sock channel
failing with socket errors in newer (maybe not so new now)
Linux kernels.

The person that reported the problem
had trouble before with p4 errors and MPICH1.
Back then somebody else pointed out the same
kind of problems with the newer kernels and MPICH1.
Changing to MPICH2 1.0.8 with ch:socket didn't fix the problem either.

The fix I suggested consisted in changing to the nemesis channel
(i.e. "configure --with-device=ch3:nemesis [other parameters]"),
which AFAIK is not the default in MPICH2 1.0.8.

Please, see this thread:
http://marc.info/?l=npaci-rocks-discussion&m=123175012813683&w=2

Xiao Bo apparently was using AIX before, and this may explain
why his code began failing after he migrated it to Linux.
His MPICH 1.0.8 seems to use sockets, as the error messages suggest.
The problem may not be the same, and the fix may not be the same,
however it may be worth trying a nemesis build, just in case.

Trust my guess or not, here are my two cents anyway. :)
Good luck!

Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


Rajeev Thakur wrote:
> If the MPICH2 test suite runs (run make testing in top-level mpich2
> directory), then I don't know what the problem might be.
> 
> Rajeev
>  
> 
>> -----Original Message-----
>> From: Xiao Bo Lu [mailto:xiao.lu at auckland.ac.nz] 
>> Sent: Monday, April 20, 2009 6:27 PM
>> To: Rajeev Thakur; mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] MPI_Barrier(MPI_COMM_WORLD) failed
>>
>> Hi,
>>
>> Yes. It is a Fortran 90 code. I did re-compile all the source 
>> code and libraries with the new MPICH2. The cofiguration 
>> option I made is as:
>>
>> ./configure -prefix=/hpc/xlu012 CC=gcc F90=gfortran
>>
>> and when I compiled all the files with the mpif90. I also did 
>> a few simple mpi tests with the mpiexec and it seems to work 
>> fine. I'm starting to wonder if there is anything to do with 
>> the memory allocation or some other communication variables 
>> that blocks some of the messages from a large array(??).
>>
>> Regards
>> Xiao
>>
>> Rajeev Thakur wrote:
>>> If it's Fortran code, just make sure no mpif.h files are 
>> left around 
>>> from the old implementation. Also make sure that the entire 
>> code (all 
>>> files) have been recompiled with MPICH2.
>>>
>>> Rajeev
>>>
>>>
>>>
>>>   
>>>> -----Original Message-----
>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Xiao Bo Lu
>>>> Sent: Monday, April 20, 2009 5:57 PM
>>>> To: mpich-discuss at mcs.anl.gov
>>>> Subject: Re: [mpich-discuss] MPI_Barrier(MPI_COMM_WORLD) failed
>>>>
>>>> Hi Rajeev,
>>>>
>>>> Yes. The code was working but on a different platform 
>> (IBM-aix system 
>>>> with POE). I have to move the code to the new system since 
>> the lease 
>>>> on the old one just expired.
>>>>
>>>> Regards
>>>> Xiao
>>>>
>>>> Rajeev Thakur wrote:
>>>>     
>>>>> Was this code that worked earlier?
>>>>>
>>>>> Rajeev
>>>>>
>>>>>   
>>>>>       
>>>>>> -----Original Message-----
>>>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of 
>> Xiao Bo Lu
>>>>>> Sent: Monday, April 20, 2009 12:51 AM
>>>>>> To: mpich-discuss at mcs.anl.gov
>>>>>> Subject: [mpich-discuss] MPI_Barrier(MPI_COMM_WORLD) failed
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I've recently installed MPICH2-1.0.8 on my local machine
>>>>>> (x86_64 Linux,
>>>>>> gfortran 4.1.2) and I am now experiencing errors with my 
>> mpi code. 
>>>>>> The error messages are:
>>>>>>
>>>>>> Fatal error in MPI_Barrier: Other MPI error, error stack:
>>>>>> MPI_Barrier(406)..........................: 
>>>>>> MPI_Barrier(MPI_COMM_WORLD)
>>>>>> failed
>>>>>> MPIR_Barrier(77)..........................:
>>>>>> MPIC_Sendrecv(126)........................:
>>>>>> MPIC_Wait(270)............................:
>>>>>> MPIDI_CH3i_Progress_wait(215).............: an error
>>>>>>         
>>>> occurred while
>>>>     
>>>>>> handling an event returned by MPIDU_Sock_Wait()
>>>>>> MPIDI_CH3I_Progress_handle_sock_event(420):
>>>>>> MPIDU_Socki_handle_read(637)..............: connection failure 
>>>>>> (set=0,sock=1,errno=104:Connection reset by peer)[cli_0]:
>>>>>> aborting job:
>>>>>> Fatal error in MPI_Barrier: Other MPI error, error stack:
>>>>>> MPI_Barrier(406)..........................: 
>>>>>> MPI_Barrier(MPI_COMM_WORLD)
>>>>>> failed
>>>>>> MPIR_Barrier(77)..........................:
>>>>>> MPIC_Sendrecv(126)........................:
>>>>>> MPIC_Wait(270)............................:
>>>>>> MPIDI_CH3i_Progress_wait(215).............: an error
>>>>>>         
>>>> occurred while
>>>>     
>>>>>> handling an event returned by MPIDU_Sock_Wait()
>>>>>> MPIDI_CH3I_Progress_handle_sock_event(420):
>>>>>> MPIDU_Socki_handle_read size of processor is:            
>>         4
>>>>>> I searched the net and found quite a few links about such
>>>>>>         
>>>> error, but
>>>>     
>>>>>> none of the posts could give a definitive fix. Do some 
>> of you know 
>>>>>> what could cause this error (e.g. incorrect installation; 
>>>>>> environmental
>>>>>> setting..) and how to fix it?
>>>>>>
>>>>>> Regards
>>>>>> Xiao
>>>>>>
>>>>>>     
>>>>>>         
>>>>>   
>>>>>       
>>>>     
>>>   
>>



More information about the mpich-discuss mailing list