[mpich-discuss] Fatal error in MPI_Barrier

Antonio José Gallardo Díaz ajcampa at hotmail.com
Tue Feb 3 03:49:29 CST 2009


In first place thanks for your help.

I have in the archive:

 /etc/mpd.conf: 
/*************************************************************************************/
#! /bin/sh
#
# This file contains configuration information for mpicc.  This is
# essentially just the variable-initialization part of mpicc.
# --------------------------------------------------------------------------
# Set the default values of all variables.
#
# Directory locations: Fixed for any MPI implementation.
# Set from the directory arguments to configure (e.g., --prefix=/usr/local)
prefix=/usr/local
exec_prefix=${prefix}
sysconfdir=${prefix}/etc
includedir=${prefix}/include
libdir=${exec_prefix}/lib
#
# Default settings for compiler, flags, and libraries.
# Determined by a combination of environment variables and tests within
# configure (e.g., determining whehter -lsocket is needee)
CC="gcc"
WRAPPER_CFLAGS=""
WRAPPER_LDFLAGS="  "
MPILIBNAME="mpich"
PMPILIBNAME="pmpich"
MPI_OTHERLIBS="-lpthread      -lrt   "
NEEDSPLIB="no"
# MPIVERSION is the version of the MPICH2 library that mpicc is intended for
MPIVERSION="1.0.8"
/*************************************************************************************/


archive /etc/mpd.hosts:
/*************************************************************************************/
master ifhn=192.168.1.1
slave ifhn=192.168.1.2

/*************************************************************************************/



archive .mpd.conf:

/*************************************************************************************/

MPD_SECRETWORD=hola

/*************************************************************************************/



archive .mpd.hosts:


/*************************************************************************************/
master ifhn=192.168.1.1
slave ifhn=192.168.1.2

/*************************************************************************************/


I use the next command for wake up the cluster:

mpdboot --totalnum=2 --ifhn=192.168.1.1 -f .mpd.hosts


and when i try my job use:

mpiexec -recvtimeout 30 -n 2 ./Proyecto/debug/src/proyecto2




Need you anymore? Thanks.

From: thakur at mcs.anl.gov
To: mpich-discuss at mcs.anl.gov
Date: Mon, 2 Feb 2009 17:30:36 -0600
Subject: Re: [mpich-discuss] Fatal error in MPI_Barrier










What parameters did you pass to "configure" when you built 
MPICH2?


  
  
  From: mpich-discuss-bounces at mcs.anl.gov 
  [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Antonio José 
  Gallardo Díaz
Sent: Monday, February 02, 2009 5:29 PM
To: 
  mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] Fatal error 
  in MPI_Barrier


  Only have two nodes.
 
Node 1--> name: master --> 
  hostname: wireless
Node 2--> name: slave----> hostname: 
  wireless2
 
For wake up the cluster i use the command 
  "mpdboot".
 
For example, i can to see how is the two node's 
  id. In my job, i use for example MPI_rank(...) and i receive the number of the 
  nodes, however if i use a MPI_Send(...) or MPI_Receive(...), mi job exit of 
  the application and show me a error.
If i use "mpiexec -l -n 2 hostname", i 
  receive:
0 : wireless
1: wireless 2
 
I don't know that it is 
  the answer for your question.
 
Thanks.



  
  
From: thakur at mcs.anl.gov
To: mpich-discuss at mcs.anl.gov
Date: Mon, 2 
  Feb 2009 15:52:52 -0600
Subject: Re: [mpich-discuss] Fatal error in 
  MPI_Barrier



  

  The error message "unable to find the 
  process group structure with id <>" is odd. How exactly did you 
  configure MPICH2? Were you able to set up an MPD ring on the two nodes 
  successfully?
   
  Rajeev

  
    
    
    From: mpich-discuss-bounces at mcs.anl.gov 
    [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Antonio José 
    Gallardo Díaz
Sent: Monday, February 02, 2009 12:39 
    PM
To: mpich-discuss at mcs.anl.gov
Subject: Re: 
    [mpich-discuss] Fatal error in MPI_Barrier


    Hello. I Have tested to use the command:

mpiexec 
    -recvtimeout 30 -n 2 /home/mpi/mpich2-1.0.8/examples/cpi   
    

and this is the 
    result.

/********************************************************************************************************************************************************/ 
                                                      
    
Process 0 of 2 is on 
    wireless                                                                                                         
    
Process 1 of 2 is on 
    wireless2                                                                                                        
    
Fatal error in MPI_Bcast: Other MPI error, error 
    stack:                                                                               
    
MPI_Bcast(786)............................: 
    MPI_Bcast(buf=0x7ffff732586c, count=1, MPI_INT, root=0, MPI_COMM_WORLD) 
    failed            
    
MPIR_Bcast(230)...........................:                                                                                           
    
MPIC_Send(39).............................:                                                                                           
    
MPIC_Wait(270)............................:                                                                                           
    
MPIDI_CH3i_Progress_wait(215).............: an error occurred while 
    handling an event returned by 
    MPIDU_Sock_Wait()                   
    
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_read(637)..............: 
    connection failure (set=0,sock=1,errno=104:Connection reset by peer)[cli_0]: 
    aborting job:
Fatal error in MPI_Bcast: Other MPI error, error 
    stack:
MPI_Bcast(786)............................: 
    MPI_Bcast(buf=0x7ffff732586c, count=1, MPI_INT, root=0, MPI_COMM_WORLD) 
    failed
MPIR_Bcast(230)...........................:
MPIC_Send(39).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: 
    an error occurred while handling an event returned by MPFatal error in 
    MPI_Bcast: Other MPI error, error 
    stack:
MPI_Bcast(786)...............................: 
    MPI_Bcast(buf=0xbf82bec8, count=1, MPI_INT, root=0, MPI_COMM_WORLD) 
    failed
MPIR_Bcast(198)..............................:
MPIC_Recv(81)................................:
MPIC_Wait(270)...............................:
MPIDI_CH3i_Progress_wait(215)................: 
    an error occurred while handling an event returned by 
    MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(640)...:
MPIDI_CH3_Sockconn_handle_connopen_event(887): 
    unable to find the process group structure with id <>[cli_1]: aborting 
    job:
Fatal error in MPI_Bcast: Other MPI error, error 
    stack:
MPI_Bcast(786)...............................: 
    MPI_Bcast(buf=0xbf82bec8, count=1, MPI_INT, root=0, MPI_COMM_WORLD) 
    failed
MPIR_Bcast(198)..............................:
MPIC_Recv(81)................................:
MPIC_Wait(270)...............................:
MPIDI_CH3i_Progress_wait(215)................: 
    an error occurred while handling an event 
    rIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_read(637)..............: 
    connection failure (set=0,sock=1,errno=104:Connection reset by 
    peer)
eturned by 
    MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(640)...:
MPIDI_CH3_Sockconn_handle_connopen_event(887): 
    unable to find the process group structure with id <>
rank 1 in job 
    21  wireless_47695   caused collective abort of all 
    ranks
  exit status of rank 1: return code 1
rank 0 in job 
    21  wireless_47695   caused collective abort of all 
    ranks
  exit status of rank 0: return code 
    1

/********************************************************************************************************************************************************/

The 
    mpdcheck said that has a problem with the first ip but it's solved.
I 
    tested:

mpdcheck 
    -s                   
    and in the other node           
      mpdcheck -c "name" "number"    
    -------------->   Well.
mpiexec -n 1 /bin/hostname 
    ------------------------------------------------------------------------------------------------------------->   
    Well.
mpiexec -l -n 4 /bin/hostname 
    ---------------------------------------------------------------------------------------------------------->   
    Well.

I have to say that with all command i have to put the options 
    -recvtimeout 30 because but have problems. Without this option, say 
    me:

mpiexec_wireless (mpiexec 392): no msg recvd from mpd when 
    expecting ack of request


What can i do?? Please help and sorry 
    for my poor english.




    
    From: ajcampa at hotmail.com
To: mpich-discuss at mcs.anl.gov
Date: Mon, 2 
    Feb 2009 18:17:39 +0100
Subject: Re: [mpich-discuss] Fatal error in 
    MPI_Barrier


    
    Well, thanks for your answer.  Really, the name of mi pc is "Wireless" 
    and the othes pc "Wireless2", i use in the two pc, the same user "mpi". 
    

I will try the mpdchech utility and then write 
    something.

Thank for all.

Un saludo desde España.


    
    From: thakur at mcs.anl.gov
To: mpich-discuss at mcs.anl.gov
Date: Mon, 2 
    Feb 2009 10:55:03 -0600
Subject: Re: [mpich-discuss] Fatal error in 
    MPI_Barrier


    

    Are you really trying to use the wireless network? 
    Looks like that's what is getting used.
     
    You can use the mpdcheck utility to diagnose 
    network configuration problems. See Appendix A.2 of the 
    installation guide.
     
    Rajeev

    
      
      From: mpich-discuss-bounces at mcs.anl.gov 
      [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Antonio 
      José Gallardo Díaz
Sent: Monday, February 02, 2009 9:49 
      AM
To: mpich-discuss at mcs.anl.gov
Subject: 
      [mpich-discuss] Fatal error in MPI_Barrier


      Hello, this error show me when i try my jobs that use 
      MPI.


Fatal error in MPI_Barrier: Other MPI error, error 
      stack:
MPI_Barrier(406).............................: 
      MPI_Barrier(MPI_COMM_WORLD) 
      failed
MPIR_Barrier(77).............................:
MPIC_Sendrecv(123)...........................:
MPIC_Wait(270)...............................:
MPIDI_CH3i_Progress_wait(215)................: 
      an error occurred while handling an event returned by 
      MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(640)...:
MPIDI_CH3_Sockconn_handle_connopen_event(887): 
      unable to find the process group structure with id <��oz�>[cli_1]: 
      aborting job:
Fatal error in MPI_Barrier: Other MPI error, error 
      stack:
MPI_Barrier(406).............................: 
      MPI_Barrier(MPI_COMM_WORLD) 
      failed
MPIR_Barrier(77).............................:
MPIC_Sendrecv(123)...........................:
MPIC_Wait(270)...............................:
MPIDI_CH3i_Progress_wait(215)................: 
      an error occurred while handling an event returned by 
      MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(640)...:
MPIDI_CH3_Sockconn_handle_connopen_event(887): 
      unable to find the process group structure with id <��oz�>
rank 1 
      in job 15  wireless_43226   caused collective abort of all 
      ranks
  exit status of rank 1: killed by signal 9

I have 
      two PC's with linux (kubuntu 8.10). I make a cluster using this machines. 
      When use for example the command "mpiexec -l -n 2 hostname" i can see that 
      it's all right, but when i try to send o receive some thing i have the 
      same error. I don't know why. Please i need one hand. Thanks for all. 
      


      
      El doble de diversión: Con 
      Windows Live Messenger comparte fotos mientras hablas. 

    
    Con el nuevo Windows Live lo tendrás todo al 
    alcance de tu mano

    
    Con el nuevo Windows Live lo tendrás todo al 
    alcance de tu mano 

  
  Tienes un nuevo Messenger por descubrir. ¡Descárgatelo! 
_________________________________________________________________
Consigue gratis el nuevo Messenger. ¡Descárgatelo! 
http://download.live.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20090203/efe328a7/attachment-0001.htm>


More information about the mpich-discuss mailing list