[mpich-discuss] MPI_Barrier(MPI_COMM_WORLD) failed
Xiao Bo Lu
xiao.lu at auckland.ac.nz
Tue Apr 21 20:48:15 CDT 2009
Thanks Rajeev and Gus,
Yes, it turns out my new system does use ch3:sock channel. I added the
-with-device=ch3:nemesis flag to the configuration and recompiled all my
code and libraries. Now, the "MPI_Barrier(MPI_COMM_WORLD)
failed........." all gone, but the code is still broke with the new
error messages:
rank 1 in job 1061 hpc2_15464 caused collective abort of all ranks
exit status of rank 1: killed by signal 11
I think I might have more than 1 problem with the code, possibly none
MPI routines. Thanks anyway.
Regards
Xiao
Rajeev Thakur wrote:
> Thanks. Yes, it is worth trying with Nemesis. (Configure with
> --with-device=ch3:nemesis).
>
> Rajeev
>
>
>
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Gus Correa
>> Sent: Tuesday, April 21, 2009 11:11 AM
>> To: Mpich Discuss
>> Subject: Re: [mpich-discuss] MPI_Barrier(MPI_COMM_WORLD) failed
>>
>> Hi Xiao Bo, Rajeev, list
>>
>> I read reports of MPICH2 1.0.8 with the ch3:sock channel
>> failing with socket errors in newer (maybe not so new now)
>> Linux kernels.
>>
>> The person that reported the problem
>> had trouble before with p4 errors and MPICH1.
>> Back then somebody else pointed out the same
>> kind of problems with the newer kernels and MPICH1.
>> Changing to MPICH2 1.0.8 with ch:socket didn't fix the problem either.
>>
>> The fix I suggested consisted in changing to the nemesis channel
>> (i.e. "configure --with-device=ch3:nemesis [other parameters]"),
>> which AFAIK is not the default in MPICH2 1.0.8.
>>
>> Please, see this thread:
>> http://marc.info/?l=npaci-rocks-discussion&m=123175012813683&w=2
>>
>> Xiao Bo apparently was using AIX before, and this may explain
>> why his code began failing after he migrated it to Linux.
>> His MPICH 1.0.8 seems to use sockets, as the error messages suggest.
>> The problem may not be the same, and the fix may not be the same,
>> however it may be worth trying a nemesis build, just in case.
>>
>> Trust my guess or not, here are my two cents anyway. :)
>> Good luck!
>>
>> Gus Correa
>> ---------------------------------------------------------------------
>> Gustavo Correa
>> Lamont-Doherty Earth Observatory - Columbia University
>> Palisades, NY, 10964-8000 - USA
>> ---------------------------------------------------------------------
>>
>>
>> Rajeev Thakur wrote:
>>
>>> If the MPICH2 test suite runs (run make testing in top-level mpich2
>>> directory), then I don't know what the problem might be.
>>>
>>> Rajeev
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: Xiao Bo Lu [mailto:xiao.lu at auckland.ac.nz]
>>>> Sent: Monday, April 20, 2009 6:27 PM
>>>> To: Rajeev Thakur; mpich-discuss at mcs.anl.gov
>>>> Subject: Re: [mpich-discuss] MPI_Barrier(MPI_COMM_WORLD) failed
>>>>
>>>> Hi,
>>>>
>>>> Yes. It is a Fortran 90 code. I did re-compile all the source
>>>> code and libraries with the new MPICH2. The cofiguration
>>>> option I made is as:
>>>>
>>>> ./configure -prefix=/hpc/xlu012 CC=gcc F90=gfortran
>>>>
>>>> and when I compiled all the files with the mpif90. I also did
>>>> a few simple mpi tests with the mpiexec and it seems to work
>>>> fine. I'm starting to wonder if there is anything to do with
>>>> the memory allocation or some other communication variables
>>>> that blocks some of the messages from a large array(??).
>>>>
>>>> Regards
>>>> Xiao
>>>>
>>>> Rajeev Thakur wrote:
>>>>
>>>>> If it's Fortran code, just make sure no mpif.h files are
>>>>>
>>>> left around
>>>>
>>>>> from the old implementation. Also make sure that the entire
>>>>>
>>>> code (all
>>>>
>>>>> files) have been recompiled with MPICH2.
>>>>>
>>>>> Rajeev
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: mpich-discuss-bounces at mcs.anl.gov
>>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of
>>>>>>
>> Xiao Bo Lu
>>
>>>>>> Sent: Monday, April 20, 2009 5:57 PM
>>>>>> To: mpich-discuss at mcs.anl.gov
>>>>>> Subject: Re: [mpich-discuss] MPI_Barrier(MPI_COMM_WORLD) failed
>>>>>>
>>>>>> Hi Rajeev,
>>>>>>
>>>>>> Yes. The code was working but on a different platform
>>>>>>
>>>> (IBM-aix system
>>>>
>>>>>> with POE). I have to move the code to the new system since
>>>>>>
>>>> the lease
>>>>
>>>>>> on the old one just expired.
>>>>>>
>>>>>> Regards
>>>>>> Xiao
>>>>>>
>>>>>> Rajeev Thakur wrote:
>>>>>>
>>>>>>
>>>>>>> Was this code that worked earlier?
>>>>>>>
>>>>>>> Rajeev
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: mpich-discuss-bounces at mcs.anl.gov
>>>>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of
>>>>>>>>
>>>> Xiao Bo Lu
>>>>
>>>>>>>> Sent: Monday, April 20, 2009 12:51 AM
>>>>>>>> To: mpich-discuss at mcs.anl.gov
>>>>>>>> Subject: [mpich-discuss] MPI_Barrier(MPI_COMM_WORLD) failed
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I've recently installed MPICH2-1.0.8 on my local machine
>>>>>>>> (x86_64 Linux,
>>>>>>>> gfortran 4.1.2) and I am now experiencing errors with my
>>>>>>>>
>>>> mpi code.
>>>>
>>>>>>>> The error messages are:
>>>>>>>>
>>>>>>>> Fatal error in MPI_Barrier: Other MPI error, error stack:
>>>>>>>> MPI_Barrier(406)..........................:
>>>>>>>> MPI_Barrier(MPI_COMM_WORLD)
>>>>>>>> failed
>>>>>>>> MPIR_Barrier(77)..........................:
>>>>>>>> MPIC_Sendrecv(126)........................:
>>>>>>>> MPIC_Wait(270)............................:
>>>>>>>> MPIDI_CH3i_Progress_wait(215).............: an error
>>>>>>>>
>>>>>>>>
>>>>>> occurred while
>>>>>>
>>>>>>
>>>>>>>> handling an event returned by MPIDU_Sock_Wait()
>>>>>>>> MPIDI_CH3I_Progress_handle_sock_event(420):
>>>>>>>> MPIDU_Socki_handle_read(637)..............: connection failure
>>>>>>>> (set=0,sock=1,errno=104:Connection reset by peer)[cli_0]:
>>>>>>>> aborting job:
>>>>>>>> Fatal error in MPI_Barrier: Other MPI error, error stack:
>>>>>>>> MPI_Barrier(406)..........................:
>>>>>>>> MPI_Barrier(MPI_COMM_WORLD)
>>>>>>>> failed
>>>>>>>> MPIR_Barrier(77)..........................:
>>>>>>>> MPIC_Sendrecv(126)........................:
>>>>>>>> MPIC_Wait(270)............................:
>>>>>>>> MPIDI_CH3i_Progress_wait(215).............: an error
>>>>>>>>
>>>>>>>>
>>>>>> occurred while
>>>>>>
>>>>>>
>>>>>>>> handling an event returned by MPIDU_Sock_Wait()
>>>>>>>> MPIDI_CH3I_Progress_handle_sock_event(420):
>>>>>>>> MPIDU_Socki_handle_read size of processor is:
>>>>>>>>
>>>> 4
>>>>
>>>>>>>> I searched the net and found quite a few links about such
>>>>>>>>
>>>>>>>>
>>>>>> error, but
>>>>>>
>>>>>>
>>>>>>>> none of the posts could give a definitive fix. Do some
>>>>>>>>
>>>> of you know
>>>>
>>>>>>>> what could cause this error (e.g. incorrect installation;
>>>>>>>> environmental
>>>>>>>> setting..) and how to fix it?
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Xiao
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>
>
>
More information about the mpich-discuss
mailing list