[mpich-discuss] Fatal error in MPI_Barrier

Rajeev Thakur thakur at mcs.anl.gov
Mon Feb 2 17:30:36 CST 2009


What parameters did you pass to "configure" when you built MPICH2?


  _____  

From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Antonio José Gallardo Díaz
Sent: Monday, February 02, 2009 5:29 PM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] Fatal error in MPI_Barrier


Only have two nodes.
 
Node 1--> name: master --> hostname: wireless
Node 2--> name: slave----> hostname: wireless2
 
For wake up the cluster i use the command "mpdboot".
 
For example, i can to see how is the two node's id. In my job, i use for example MPI_rank(...) and i receive the number of the nodes, however if i use a MPI_Send(...) or MPI_Receive(...), mi job exit of the application and show me a error.
If i use "mpiexec -l -n 2 hostname", i receive:
0 : wireless
1: wireless 2
 
I don't know that it is the answer for your question.
 
Thanks.



  _____  


From: thakur at mcs.anl.gov
To: mpich-discuss at mcs.anl.gov
Date: Mon, 2 Feb 2009 15:52:52 -0600
Subject: Re: [mpich-discuss] Fatal error in MPI_Barrier



The error message "unable to find the process group structure with id <>" is odd. How exactly did you configure MPICH2? Were you able to set up an MPD ring on the two nodes successfully?
 
Rajeev


  _____  

From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Antonio José Gallardo Díaz
Sent: Monday, February 02, 2009 12:39 PM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] Fatal error in MPI_Barrier


Hello. I Have tested to use the command:

mpiexec -recvtimeout 30 -n 2 /home/mpi/mpich2-1.0.8/examples/cpi   

and this is the result.

/********************************************************************************************************************************************************/                                                   
Process 0 of 2 is on wireless                                                                                                         
Process 1 of 2 is on wireless2                                                                                                        
Fatal error in MPI_Bcast: Other MPI error, error stack:                                                                               
MPI_Bcast(786)............................: MPI_Bcast(buf=0x7ffff732586c, count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed            
MPIR_Bcast(230)...........................:                                                                                           
MPIC_Send(39).............................:                                                                                           
MPIC_Wait(270)............................:                                                                                           
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()                   
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_read(637)..............: connection failure (set=0,sock=1,errno=104:Connection reset by peer)[cli_0]: aborting job:
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)............................: MPI_Bcast(buf=0x7ffff732586c, count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(230)...........................:
MPIC_Send(39).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPFatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)...............................: MPI_Bcast(buf=0xbf82bec8, count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)..............................:
MPIC_Recv(81)................................:
MPIC_Wait(270)...............................:
MPIDI_CH3i_Progress_wait(215)................: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(640)...:
MPIDI_CH3_Sockconn_handle_connopen_event(887): unable to find the process group structure with id <>[cli_1]: aborting job:
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)...............................: MPI_Bcast(buf=0xbf82bec8, count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)..............................:
MPIC_Recv(81)................................:
MPIC_Wait(270)...............................:
MPIDI_CH3i_Progress_wait(215)................: an error occurred while handling an event rIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_read(637)..............: connection failure (set=0,sock=1,errno=104:Connection reset by peer)
eturned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(640)...:
MPIDI_CH3_Sockconn_handle_connopen_event(887): unable to find the process group structure with id <>
rank 1 in job 21  wireless_47695   caused collective abort of all ranks
  exit status of rank 1: return code 1
rank 0 in job 21  wireless_47695   caused collective abort of all ranks
  exit status of rank 0: return code 1

/********************************************************************************************************************************************************/

The mpdcheck said that has a problem with the first ip but it's solved.
I tested:

mpdcheck -s                   and in the other node             mpdcheck -c "name" "number"    -------------->   Well.
mpiexec -n 1 /bin/hostname ------------------------------------------------------------------------------------------------------------->   Well.
mpiexec -l -n 4 /bin/hostname ---------------------------------------------------------------------------------------------------------->   Well.

I have to say that with all command i have to put the options -recvtimeout 30 because but have problems. Without this option, say me:

mpiexec_wireless (mpiexec 392): no msg recvd from mpd when expecting ack of request


What can i do?? Please help and sorry for my poor english.




  _____  

From: ajcampa at hotmail.com
To: mpich-discuss at mcs.anl.gov
Date: Mon, 2 Feb 2009 18:17:39 +0100
Subject: Re: [mpich-discuss] Fatal error in MPI_Barrier

Well, thanks for your answer.  Really, the name of mi pc is "Wireless" and the othes pc "Wireless2", i use in the two pc, the same user "mpi". 

I will try the mpdchech utility and then write something.

Thank for all.

Un saludo desde España.


  _____  

From: thakur at mcs.anl.gov
To: mpich-discuss at mcs.anl.gov
Date: Mon, 2 Feb 2009 10:55:03 -0600
Subject: Re: [mpich-discuss] Fatal error in MPI_Barrier


Are you really trying to use the wireless network? Looks like that's what is getting used.
 
You can use the mpdcheck utility to diagnose network configuration problems. See Appendix A.2 of the installation guide.
 
Rajeev


  _____  

From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Antonio José Gallardo Díaz
Sent: Monday, February 02, 2009 9:49 AM
To: mpich-discuss at mcs.anl.gov
Subject: [mpich-discuss] Fatal error in MPI_Barrier


Hello, this error show me when i try my jobs that use MPI.


Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(406).............................: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier(77).............................:
MPIC_Sendrecv(123)...........................:
MPIC_Wait(270)...............................:
MPIDI_CH3i_Progress_wait(215)................: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(640)...:
MPIDI_CH3_Sockconn_handle_connopen_event(887): unable to find the process group structure with id <��oz�>[cli_1]: aborting job:
Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(406).............................: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier(77).............................:
MPIC_Sendrecv(123)...........................:
MPIC_Wait(270)...............................:
MPIDI_CH3i_Progress_wait(215)................: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(640)...:
MPIDI_CH3_Sockconn_handle_connopen_event(887): unable to find the process group structure with id <��oz�>
rank 1 in job 15  wireless_43226   caused collective abort of all ranks
  exit status of rank 1: killed by signal 9

I have two PC's with linux (kubuntu 8.10). I make a cluster using this machines. When use for example the command "mpiexec -l -n 2 hostname" i can see that it's all right, but when i try to send o receive some thing i have the same error. I don't know why. Please i need one hand. Thanks for all. 


  _____  

El doble de diversión: Con  <http://www.microsoft.com/windows/windowslive/messenger.aspx> Windows Live Messenger comparte fotos mientras hablas. 


  _____  

Con el nuevo Windows Live lo tendrás todo al  <http://home.live.com/> alcance de tu mano

  _____  

Con el nuevo Windows Live lo tendrás todo al  <http://home.live.com/> alcance de tu mano 


  _____  

Tienes un nuevo Messenger por descubrir. ¡Descárgatelo!  <http://download.live.com/> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20090202/27c0a4b6/attachment-0001.htm>


More information about the mpich-discuss mailing list