[mpich-discuss] Fatal error in MPI_Barrier
Antonio José Gallardo Díaz
ajcampa at hotmail.com
Mon Feb 2 12:39:05 CST 2009
Hello. I Have tested to use the command:
mpiexec -recvtimeout 30 -n 2 /home/mpi/mpich2-1.0.8/examples/cpi
and this is the result.
/********************************************************************************************************************************************************/
Process 0 of 2 is on wireless
Process 1 of 2 is on wireless2
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)............................: MPI_Bcast(buf=0x7ffff732586c, count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(230)...........................:
MPIC_Send(39).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_read(637)..............: connection failure (set=0,sock=1,errno=104:Connection reset by peer)[cli_0]: aborting job:
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)............................: MPI_Bcast(buf=0x7ffff732586c, count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(230)...........................:
MPIC_Send(39).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPFatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)...............................: MPI_Bcast(buf=0xbf82bec8, count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)..............................:
MPIC_Recv(81)................................:
MPIC_Wait(270)...............................:
MPIDI_CH3i_Progress_wait(215)................: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(640)...:
MPIDI_CH3_Sockconn_handle_connopen_event(887): unable to find the process group structure with id <>[cli_1]: aborting job:
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)...............................: MPI_Bcast(buf=0xbf82bec8, count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)..............................:
MPIC_Recv(81)................................:
MPIC_Wait(270)...............................:
MPIDI_CH3i_Progress_wait(215)................: an error occurred while handling an event rIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_read(637)..............: connection failure (set=0,sock=1,errno=104:Connection reset by peer)
eturned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(640)...:
MPIDI_CH3_Sockconn_handle_connopen_event(887): unable to find the process group structure with id <>
rank 1 in job 21 wireless_47695 caused collective abort of all ranks
exit status of rank 1: return code 1
rank 0 in job 21 wireless_47695 caused collective abort of all ranks
exit status of rank 0: return code 1
/********************************************************************************************************************************************************/
The mpdcheck said that has a problem with the first ip but it's solved.
I tested:
mpdcheck -s and in the other node mpdcheck -c "name" "number" --------------> Well.
mpiexec -n 1 /bin/hostname -------------------------------------------------------------------------------------------------------------> Well.
mpiexec -l -n 4 /bin/hostname ----------------------------------------------------------------------------------------------------------> Well.
I have to say that with all command i have to put the options -recvtimeout 30 because but have problems. Without this option, say me:
mpiexec_wireless (mpiexec 392): no msg recvd from mpd when expecting ack of request
What can i do?? Please help and sorry for my poor english.
From: ajcampa at hotmail.com
To: mpich-discuss at mcs.anl.gov
Date: Mon, 2 Feb 2009 18:17:39 +0100
Subject: Re: [mpich-discuss] Fatal error in MPI_Barrier
Well, thanks for your answer. Really, the name of mi pc is "Wireless" and the othes pc "Wireless2", i use in the two pc, the same user "mpi".
I will try the mpdchech utility and then write something.
Thank for all.
Un saludo desde España.
From: thakur at mcs.anl.gov
To: mpich-discuss at mcs.anl.gov
Date: Mon, 2 Feb 2009 10:55:03 -0600
Subject: Re: [mpich-discuss] Fatal error in MPI_Barrier
Are you really trying to use the wireless network? Looks like
that's what is getting used.
You can use the mpdcheck utility to diagnose
network configuration problems. See Appendix A.2 of the installation
guide.
Rajeev
From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Antonio José
Gallardo Díaz
Sent: Monday, February 02, 2009 9:49 AM
To:
mpich-discuss at mcs.anl.gov
Subject: [mpich-discuss] Fatal error in
MPI_Barrier
Hello, this error show me when i try my jobs that use
MPI.
Fatal error in MPI_Barrier: Other MPI error, error
stack:
MPI_Barrier(406).............................:
MPI_Barrier(MPI_COMM_WORLD)
failed
MPIR_Barrier(77).............................:
MPIC_Sendrecv(123)...........................:
MPIC_Wait(270)...............................:
MPIDI_CH3i_Progress_wait(215)................:
an error occurred while handling an event returned by
MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(640)...:
MPIDI_CH3_Sockconn_handle_connopen_event(887):
unable to find the process group structure with id <��oz�>[cli_1]:
aborting job:
Fatal error in MPI_Barrier: Other MPI error, error
stack:
MPI_Barrier(406).............................:
MPI_Barrier(MPI_COMM_WORLD)
failed
MPIR_Barrier(77).............................:
MPIC_Sendrecv(123)...........................:
MPIC_Wait(270)...............................:
MPIDI_CH3i_Progress_wait(215)................:
an error occurred while handling an event returned by
MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(640)...:
MPIDI_CH3_Sockconn_handle_connopen_event(887):
unable to find the process group structure with id <��oz�>
rank 1 in
job 15 wireless_43226 caused collective abort of all
ranks
exit status of rank 1: killed by signal 9
I have two
PC's with linux (kubuntu 8.10). I make a cluster using this machines. When use
for example the command "mpiexec -l -n 2 hostname" i can see that it's all
right, but when i try to send o receive some thing i have the same error. I
don't know why. Please i need one hand. Thanks for all.
El doble de diversión: Con Windows Live Messenger comparte fotos mientras hablas.
Con el nuevo Windows Live lo tendrás todo al alcance de tu mano
_________________________________________________________________
Consigue gratis el nuevo Messenger. ¡Descárgatelo!
http://download.live.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20090202/686640d7/attachment.htm>
More information about the mpich-discuss
mailing list