[mpich-discuss] MPI_Barrier(MPI_COMM_WORLD) failed

Gus Correa gus at ldeo.columbia.edu
Tue Apr 21 21:10:22 CDT 2009


Hi Xiao Bo

signal 11, segmentation fault, is caught by the OS.
Not likely to be an MPI error anymore.
Some part of the program is trying to use memory that
was not allocated to it.

It may be a bug on the code, but those causing segfault sometimes are
hard to find.
It may be a loose syntax going beyond an array size,
it may be a typo when passing an argument to a subroutine
- even to one of the MPI subroutines.
IIRC, F90
checks subprogram arguments only if you have subprogram interfaces,
and F77 doesn't check anything.
Or it may be something else.

One possibility is to compile with -Mbounds/-check-bounds or similar 
flag to check if the array bounds are not violated.
Another line of action is to monitor memory use (say with "top"
on the nodes) to see if it is growing above the available memory
when the program is running.

If you have a serial version, it may be easier to find the bug there,
if it is not MPI-related.

I hope this helps.

Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

Gus Correa

Xiao Bo Lu wrote:
> Thanks Rajeev and Gus,
> 
> Yes, it turns out my new system does use ch3:sock channel. I added the  
> -with-device=ch3:nemesis flag to the configuration and recompiled all my 
> code and libraries. Now, the "MPI_Barrier(MPI_COMM_WORLD) 
> failed........." all gone, but the code is still broke with the new 
> error messages:
> 
> rank 1 in job 1061  hpc2_15464   caused collective abort of all ranks
>  exit status of rank 1: killed by signal 11
> 
> I think I might have more than 1 problem with the code, possibly none 
> MPI routines. Thanks anyway.
> 
> Regards
> Xiao
> 
> 
> Rajeev Thakur wrote:
>> Thanks. Yes, it is worth trying with Nemesis. (Configure with
>> --with-device=ch3:nemesis).
>>
>> Rajeev
>>  
>>
>>  
>>> -----Original Message-----
>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Gus Correa
>>> Sent: Tuesday, April 21, 2009 11:11 AM
>>> To: Mpich Discuss
>>> Subject: Re: [mpich-discuss] MPI_Barrier(MPI_COMM_WORLD) failed
>>>
>>> Hi Xiao Bo, Rajeev, list
>>>
>>> I read reports of MPICH2 1.0.8 with the ch3:sock channel
>>> failing with socket errors in newer (maybe not so new now)
>>> Linux kernels.
>>>
>>> The person that reported the problem
>>> had trouble before with p4 errors and MPICH1.
>>> Back then somebody else pointed out the same
>>> kind of problems with the newer kernels and MPICH1.
>>> Changing to MPICH2 1.0.8 with ch:socket didn't fix the problem either.
>>>
>>> The fix I suggested consisted in changing to the nemesis channel
>>> (i.e. "configure --with-device=ch3:nemesis [other parameters]"),
>>> which AFAIK is not the default in MPICH2 1.0.8.
>>>
>>> Please, see this thread:
>>> http://marc.info/?l=npaci-rocks-discussion&m=123175012813683&w=2
>>>
>>> Xiao Bo apparently was using AIX before, and this may explain
>>> why his code began failing after he migrated it to Linux.
>>> His MPICH 1.0.8 seems to use sockets, as the error messages suggest.
>>> The problem may not be the same, and the fix may not be the same,
>>> however it may be worth trying a nemesis build, just in case.
>>>
>>> Trust my guess or not, here are my two cents anyway. :)
>>> Good luck!
>>>
>>> Gus Correa
>>> ---------------------------------------------------------------------
>>> Gustavo Correa
>>> Lamont-Doherty Earth Observatory - Columbia University
>>> Palisades, NY, 10964-8000 - USA
>>> ---------------------------------------------------------------------
>>>
>>>
>>> Rajeev Thakur wrote:
>>>    
>>>> If the MPICH2 test suite runs (run make testing in top-level mpich2
>>>> directory), then I don't know what the problem might be.
>>>>
>>>> Rajeev
>>>>  
>>>>
>>>>      
>>>>> -----Original Message-----
>>>>> From: Xiao Bo Lu [mailto:xiao.lu at auckland.ac.nz] Sent: Monday, 
>>>>> April 20, 2009 6:27 PM
>>>>> To: Rajeev Thakur; mpich-discuss at mcs.anl.gov
>>>>> Subject: Re: [mpich-discuss] MPI_Barrier(MPI_COMM_WORLD) failed
>>>>>
>>>>> Hi,
>>>>>
>>>>> Yes. It is a Fortran 90 code. I did re-compile all the source code 
>>>>> and libraries with the new MPICH2. The cofiguration option I made 
>>>>> is as:
>>>>>
>>>>> ./configure -prefix=/hpc/xlu012 CC=gcc F90=gfortran
>>>>>
>>>>> and when I compiled all the files with the mpif90. I also did a few 
>>>>> simple mpi tests with the mpiexec and it seems to work fine. I'm 
>>>>> starting to wonder if there is anything to do with the memory 
>>>>> allocation or some other communication variables that blocks some 
>>>>> of the messages from a large array(??).
>>>>>
>>>>> Regards
>>>>> Xiao
>>>>>
>>>>> Rajeev Thakur wrote:
>>>>>        
>>>>>> If it's Fortran code, just make sure no mpif.h files are           
>>>>> left around        
>>>>>> from the old implementation. Also make sure that the entire           
>>>>> code (all        
>>>>>> files) have been recompiled with MPICH2.
>>>>>>
>>>>>> Rajeev
>>>>>>
>>>>>>
>>>>>>
>>>>>>            
>>>>>>> -----Original Message-----
>>>>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of             
>>> Xiao Bo Lu
>>>    
>>>>>>> Sent: Monday, April 20, 2009 5:57 PM
>>>>>>> To: mpich-discuss at mcs.anl.gov
>>>>>>> Subject: Re: [mpich-discuss] MPI_Barrier(MPI_COMM_WORLD) failed
>>>>>>>
>>>>>>> Hi Rajeev,
>>>>>>>
>>>>>>> Yes. The code was working but on a different platform             
>>>>> (IBM-aix system        
>>>>>>> with POE). I have to move the code to the new system since 
>>>>>>>             
>>>>> the lease        
>>>>>>> on the old one just expired.
>>>>>>>
>>>>>>> Regards
>>>>>>> Xiao
>>>>>>>
>>>>>>> Rajeev Thakur wrote:
>>>>>>>                
>>>>>>>> Was this code that worked earlier?
>>>>>>>>
>>>>>>>> Rajeev
>>>>>>>>
>>>>>>>>                      
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>>>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of 
>>>>>>>>>                 
>>>>> Xiao Bo Lu
>>>>>        
>>>>>>>>> Sent: Monday, April 20, 2009 12:51 AM
>>>>>>>>> To: mpich-discuss at mcs.anl.gov
>>>>>>>>> Subject: [mpich-discuss] MPI_Barrier(MPI_COMM_WORLD) failed
>>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I've recently installed MPICH2-1.0.8 on my local machine
>>>>>>>>> (x86_64 Linux,
>>>>>>>>> gfortran 4.1.2) and I am now experiencing errors with my 
>>>>>>>>>                 
>>>>> mpi code.        
>>>>>>>>> The error messages are:
>>>>>>>>>
>>>>>>>>> Fatal error in MPI_Barrier: Other MPI error, error stack:
>>>>>>>>> MPI_Barrier(406)..........................: 
>>>>>>>>> MPI_Barrier(MPI_COMM_WORLD)
>>>>>>>>> failed
>>>>>>>>> MPIR_Barrier(77)..........................:
>>>>>>>>> MPIC_Sendrecv(126)........................:
>>>>>>>>> MPIC_Wait(270)............................:
>>>>>>>>> MPIDI_CH3i_Progress_wait(215).............: an error
>>>>>>>>>                         
>>>>>>> occurred while
>>>>>>>                
>>>>>>>>> handling an event returned by MPIDU_Sock_Wait()
>>>>>>>>> MPIDI_CH3I_Progress_handle_sock_event(420):
>>>>>>>>> MPIDU_Socki_handle_read(637)..............: connection failure 
>>>>>>>>> (set=0,sock=1,errno=104:Connection reset by peer)[cli_0]:
>>>>>>>>> aborting job:
>>>>>>>>> Fatal error in MPI_Barrier: Other MPI error, error stack:
>>>>>>>>> MPI_Barrier(406)..........................: 
>>>>>>>>> MPI_Barrier(MPI_COMM_WORLD)
>>>>>>>>> failed
>>>>>>>>> MPIR_Barrier(77)..........................:
>>>>>>>>> MPIC_Sendrecv(126)........................:
>>>>>>>>> MPIC_Wait(270)............................:
>>>>>>>>> MPIDI_CH3i_Progress_wait(215).............: an error
>>>>>>>>>                         
>>>>>>> occurred while
>>>>>>>                
>>>>>>>>> handling an event returned by MPIDU_Sock_Wait()
>>>>>>>>> MPIDI_CH3I_Progress_handle_sock_event(420):
>>>>>>>>> MPIDU_Socki_handle_read size of processor is:            
>>>>>>>>>                 
>>>>>         4
>>>>>        
>>>>>>>>> I searched the net and found quite a few links about such
>>>>>>>>>                         
>>>>>>> error, but
>>>>>>>                
>>>>>>>>> none of the posts could give a definitive fix. Do some 
>>>>>>>>>                 
>>>>> of you know        
>>>>>>>>> what could cause this error (e.g. incorrect installation; 
>>>>>>>>> environmental
>>>>>>>>> setting..) and how to fix it?
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Xiao
>>>>>>>>>
>>>>>>>>>                             
>>>>>>>>                       
>>>>>>>                 
>>>>>>             
>>>     
>>
>>   



More information about the mpich-discuss mailing list