[mpich-discuss] MPI_Barrier(MPI_COMM_WORLD) failed

Xiao Bo Lu xiao.lu at auckland.ac.nz
Tue Apr 21 20:48:15 CDT 2009


Thanks Rajeev and Gus,

Yes, it turns out my new system does use ch3:sock channel. I added the  
-with-device=ch3:nemesis flag to the configuration and recompiled all my 
code and libraries. Now, the "MPI_Barrier(MPI_COMM_WORLD) 
failed........." all gone, but the code is still broke with the new 
error messages:

rank 1 in job 1061  hpc2_15464   caused collective abort of all ranks
  exit status of rank 1: killed by signal 11

I think I might have more than 1 problem with the code, possibly none 
MPI routines. Thanks anyway.

Regards
Xiao


Rajeev Thakur wrote:
> Thanks. Yes, it is worth trying with Nemesis. (Configure with
> --with-device=ch3:nemesis).
>
> Rajeev
>  
>
>   
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov 
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Gus Correa
>> Sent: Tuesday, April 21, 2009 11:11 AM
>> To: Mpich Discuss
>> Subject: Re: [mpich-discuss] MPI_Barrier(MPI_COMM_WORLD) failed
>>
>> Hi Xiao Bo, Rajeev, list
>>
>> I read reports of MPICH2 1.0.8 with the ch3:sock channel
>> failing with socket errors in newer (maybe not so new now)
>> Linux kernels.
>>
>> The person that reported the problem
>> had trouble before with p4 errors and MPICH1.
>> Back then somebody else pointed out the same
>> kind of problems with the newer kernels and MPICH1.
>> Changing to MPICH2 1.0.8 with ch:socket didn't fix the problem either.
>>
>> The fix I suggested consisted in changing to the nemesis channel
>> (i.e. "configure --with-device=ch3:nemesis [other parameters]"),
>> which AFAIK is not the default in MPICH2 1.0.8.
>>
>> Please, see this thread:
>> http://marc.info/?l=npaci-rocks-discussion&m=123175012813683&w=2
>>
>> Xiao Bo apparently was using AIX before, and this may explain
>> why his code began failing after he migrated it to Linux.
>> His MPICH 1.0.8 seems to use sockets, as the error messages suggest.
>> The problem may not be the same, and the fix may not be the same,
>> however it may be worth trying a nemesis build, just in case.
>>
>> Trust my guess or not, here are my two cents anyway. :)
>> Good luck!
>>
>> Gus Correa
>> ---------------------------------------------------------------------
>> Gustavo Correa
>> Lamont-Doherty Earth Observatory - Columbia University
>> Palisades, NY, 10964-8000 - USA
>> ---------------------------------------------------------------------
>>
>>
>> Rajeev Thakur wrote:
>>     
>>> If the MPICH2 test suite runs (run make testing in top-level mpich2
>>> directory), then I don't know what the problem might be.
>>>
>>> Rajeev
>>>  
>>>
>>>       
>>>> -----Original Message-----
>>>> From: Xiao Bo Lu [mailto:xiao.lu at auckland.ac.nz] 
>>>> Sent: Monday, April 20, 2009 6:27 PM
>>>> To: Rajeev Thakur; mpich-discuss at mcs.anl.gov
>>>> Subject: Re: [mpich-discuss] MPI_Barrier(MPI_COMM_WORLD) failed
>>>>
>>>> Hi,
>>>>
>>>> Yes. It is a Fortran 90 code. I did re-compile all the source 
>>>> code and libraries with the new MPICH2. The cofiguration 
>>>> option I made is as:
>>>>
>>>> ./configure -prefix=/hpc/xlu012 CC=gcc F90=gfortran
>>>>
>>>> and when I compiled all the files with the mpif90. I also did 
>>>> a few simple mpi tests with the mpiexec and it seems to work 
>>>> fine. I'm starting to wonder if there is anything to do with 
>>>> the memory allocation or some other communication variables 
>>>> that blocks some of the messages from a large array(??).
>>>>
>>>> Regards
>>>> Xiao
>>>>
>>>> Rajeev Thakur wrote:
>>>>         
>>>>> If it's Fortran code, just make sure no mpif.h files are 
>>>>>           
>>>> left around 
>>>>         
>>>>> from the old implementation. Also make sure that the entire 
>>>>>           
>>>> code (all 
>>>>         
>>>>> files) have been recompiled with MPICH2.
>>>>>
>>>>> Rajeev
>>>>>
>>>>>
>>>>>
>>>>>   
>>>>>           
>>>>>> -----Original Message-----
>>>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of 
>>>>>>             
>> Xiao Bo Lu
>>     
>>>>>> Sent: Monday, April 20, 2009 5:57 PM
>>>>>> To: mpich-discuss at mcs.anl.gov
>>>>>> Subject: Re: [mpich-discuss] MPI_Barrier(MPI_COMM_WORLD) failed
>>>>>>
>>>>>> Hi Rajeev,
>>>>>>
>>>>>> Yes. The code was working but on a different platform 
>>>>>>             
>>>> (IBM-aix system 
>>>>         
>>>>>> with POE). I have to move the code to the new system since 
>>>>>>             
>>>> the lease 
>>>>         
>>>>>> on the old one just expired.
>>>>>>
>>>>>> Regards
>>>>>> Xiao
>>>>>>
>>>>>> Rajeev Thakur wrote:
>>>>>>     
>>>>>>             
>>>>>>> Was this code that worked earlier?
>>>>>>>
>>>>>>> Rajeev
>>>>>>>
>>>>>>>   
>>>>>>>       
>>>>>>>               
>>>>>>>> -----Original Message-----
>>>>>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of 
>>>>>>>>                 
>>>> Xiao Bo Lu
>>>>         
>>>>>>>> Sent: Monday, April 20, 2009 12:51 AM
>>>>>>>> To: mpich-discuss at mcs.anl.gov
>>>>>>>> Subject: [mpich-discuss] MPI_Barrier(MPI_COMM_WORLD) failed
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I've recently installed MPICH2-1.0.8 on my local machine
>>>>>>>> (x86_64 Linux,
>>>>>>>> gfortran 4.1.2) and I am now experiencing errors with my 
>>>>>>>>                 
>>>> mpi code. 
>>>>         
>>>>>>>> The error messages are:
>>>>>>>>
>>>>>>>> Fatal error in MPI_Barrier: Other MPI error, error stack:
>>>>>>>> MPI_Barrier(406)..........................: 
>>>>>>>> MPI_Barrier(MPI_COMM_WORLD)
>>>>>>>> failed
>>>>>>>> MPIR_Barrier(77)..........................:
>>>>>>>> MPIC_Sendrecv(126)........................:
>>>>>>>> MPIC_Wait(270)............................:
>>>>>>>> MPIDI_CH3i_Progress_wait(215).............: an error
>>>>>>>>         
>>>>>>>>                 
>>>>>> occurred while
>>>>>>     
>>>>>>             
>>>>>>>> handling an event returned by MPIDU_Sock_Wait()
>>>>>>>> MPIDI_CH3I_Progress_handle_sock_event(420):
>>>>>>>> MPIDU_Socki_handle_read(637)..............: connection failure 
>>>>>>>> (set=0,sock=1,errno=104:Connection reset by peer)[cli_0]:
>>>>>>>> aborting job:
>>>>>>>> Fatal error in MPI_Barrier: Other MPI error, error stack:
>>>>>>>> MPI_Barrier(406)..........................: 
>>>>>>>> MPI_Barrier(MPI_COMM_WORLD)
>>>>>>>> failed
>>>>>>>> MPIR_Barrier(77)..........................:
>>>>>>>> MPIC_Sendrecv(126)........................:
>>>>>>>> MPIC_Wait(270)............................:
>>>>>>>> MPIDI_CH3i_Progress_wait(215).............: an error
>>>>>>>>         
>>>>>>>>                 
>>>>>> occurred while
>>>>>>     
>>>>>>             
>>>>>>>> handling an event returned by MPIDU_Sock_Wait()
>>>>>>>> MPIDI_CH3I_Progress_handle_sock_event(420):
>>>>>>>> MPIDU_Socki_handle_read size of processor is:            
>>>>>>>>                 
>>>>         4
>>>>         
>>>>>>>> I searched the net and found quite a few links about such
>>>>>>>>         
>>>>>>>>                 
>>>>>> error, but
>>>>>>     
>>>>>>             
>>>>>>>> none of the posts could give a definitive fix. Do some 
>>>>>>>>                 
>>>> of you know 
>>>>         
>>>>>>>> what could cause this error (e.g. incorrect installation; 
>>>>>>>> environmental
>>>>>>>> setting..) and how to fix it?
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Xiao
>>>>>>>>
>>>>>>>>     
>>>>>>>>         
>>>>>>>>                 
>>>>>>>   
>>>>>>>       
>>>>>>>               
>>>>>>     
>>>>>>             
>>>>>   
>>>>>           
>>     
>
>   



More information about the mpich-discuss mailing list