[mpich-discuss] Problem with MPI_Bcast
Hisham Adel
hosham2004 at yahoo.com
Tue Dec 14 07:53:40 CST 2010
Hi All,
I have installed the new MPICH2 version "1.3.1" with this configuration:
./configure --without-mpe --disable-f77 --disable-fc
After the installation, I started run some old programs I have written before
with MPI....
All the programs I have written before with MPI hang when number of cores > 20.
They hang when there is anMPI_Bcast call.
So, I got the "Hello_world" example and I executed it. It works well. So, I
have modified it and added a simple MPI_Bcast call, the program
start to hang when number of cores > 20.
I also have tried the new installation with the "cpi" example included in the
package and it hangs when the number of nodes > 20.....
Do you have any ideas about that ?
Here is the "Hello World" example:
#include <stdio.h>
#include "mpi.h"
#include <string.h>
int main(int argc, char **argv)
{
int my_rank;
int source;
int dest;
int p,len;
int tag = 50;
char message [100];
char name[MPI_MAX_PROCESSOR_NAME];
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &p);
int x=0;
if(my_rank==0)
{
x=923;
}
MPI_Bcast(&x,1,MPI_INT,0,MPI_COMM_WORLD);
printf("\nI %d got %d from node 0\n",my_rank,x);
if (my_rank != 0) {
MPI_Get_processor_name(name, &len);
sprintf(message, "Greetings from process %d, I am %s !",
my_rank, name);
dest = 0;
MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag,
MPI_COMM_WORLD);
} else {
for (source = 1; source < p; source++) {
MPI_Recv(message, 100, MPI_CHAR, source, tag,
MPI_COMM_WORLD, &status);
printf("%s\n", message);
}
}
MPI_Finalize();
}
Here is the error message I got, when I run the "Hello World" Example:
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1306)......................: MPI_Bcast(buf=0x7fff463d2ad4, count=1,
MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1150).................:
MPIR_Bcast_intra(990).................:
MPIR_Bcast_scatter_ring_allgather(840):
MPIR_Bcast_binomial(187)..............:
MPIC_Send(66).........................:
MPIC_Wait(528)........................:
MPIDI_CH3I_Progress(335)..............:
MPID_nem_mpich2_blocking_recv(906)....:
MPID_nem_tcp_connpoll(1830)...........: Communication error with rank 20:
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1306)......................: MPI_Bcast(buf=0x7fff8c374d84, count=1,
MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1150).................:
MPIR_Bcast_intra(990).................:
MPIR_Bcast_scatter_ring_allgather(840):
MPIR_Bcast_binomial(187)..............:
MPIC_Send(66).........................:
MPIC_Wait(528)........................:
MPIDI_CH3I_Progress(335)..............:
MPID_nem_mpich2_blocking_recv(906)....:
MPID_nem_tcp_connpoll(1843)...........:
state_commrdy_handler(1674)...........:
MPID_nem_tcp_recv_handler(1653).......: Communication error with rank 16
MPID_nem_tcp_recv_handler(1554).......: socket closed
APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)
Here is the error message I got, when I run "cpi" example:
Process 1 of 22 is on node00
Process 0 of 22 is on node00
Process 4 of 22 is on node02
Process 5 of 22 is on node02
Process 6 of 22 is on node03
Process 7 of 22 is on node03
Process 20 of 22 is on node10
Process 21 of 22 is on node10
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1306)......................: MPI_Bcast(buf=0x7fff44bcfd3c, count=1,
MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1150).................:
MPIR_Bcast_intra(990).................:
MPIR_Bcast_scatter_ring_allgather(840):
MPIR_Bcast_binomial(187)..............:
MPIC_Send(66).........................:
MPIC_Wait(528)........................:
MPIDI_CH3I_Progress(335)..............:
MPID_nem_mpich2_blocking_recv(906)....:
MPID_nem_tcp_connpoll(1843)...........:
state_commrdy_handler(1674)...........:
MPID_nem_tcp_recv_handler(1653).......: Communication error with rank 16
MPID_nem_tcp_recv_handler(1554).......: socket closed
Process 2 of 22 is on node01
Process 3 of 22 is on node01
[proxy:0:2 at node02] HYDT_dmxu_poll_wait_for_event
(/home/k/mpich2-1.3.1/src/pm/hydra/tools/demux/demux_poll.c:70): assert
(!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP)) failed
[proxy:0:2 at node02] main
(/home/k/mpich2-1.3.1/src/pm/hydra/pm/pmiserv/pmip.c:225): demux engine error
waiting for event
Process 8 of 22 is on node04
Process 9 of 22 is on node04
Process 18 of 22 is on node09
Process 19 of 22 is on node09
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1306)......................: MPI_Bcast(buf=0x7ffff9d75dec, count=1,
MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1150).................:
MPIR_Bcast_intra(990).................:
MPIR_Bcast_scatter_ring_allgather(840):
MPIR_Bcast_binomial(157)..............:
MPIC_Recv(108)........................:
MPIC_Wait(528)........................:
MPIDI_CH3I_Progress(335)..............:
MPID_nem_mpich2_blocking_recv(906)....:
MPID_nem_tcp_connpoll(1830)...........: Communication error with rank 0:
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1306)......................: MPI_Bcast(buf=0x7fff9645255c, count=1,
MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1150).................:
MPIR_Bcast_intra(990).................:
MPIR_Bcast_scatter_ring_allgather(840):
MPIR_Bcast_binomial(187)..............:
MPIC_Send(66).........................:
MPIC_Wait(528)........................:
MPIDI_CH3I_Progress(335)..............:
MPID_nem_mpich2_blocking_recv(906)....:
MPID_nem_tcp_connpoll(1843)...........:
state_commrdy_handler(1674)...........:
MPID_nem_tcp_recv_handler(1653).......: Communication error with rank 0
MPID_nem_tcp_recv_handler(1554).......: socket closed
Process 16 of 22 is on node08
Process 17 of 22 is on node08
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1306)......................: MPI_Bcast(buf=0x7fff02102e6c, count=1,
MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1150).................:
MPIR_Bcast_intra(990).................:
MPIR_Bcast_scatter_ring_allgather(840):
MPIR_Bcast_binomial(187)..............:
MPIC_Send(66).........................:
MPIC_Wait(528)........................:
MPIDI_CH3I_Progress(335)..............:
MPID_nem_mpich2_blocking_recv(906)....:
MPID_nem_tcp_connpoll(1830)...........: Communication error with rank 20:
Process 12 of 22 is on node06
Process 13 of 22 is on node06
Process 14 of 22 is on node07
Process 15 of 22 is on node07
[mpiexec at node00] HYDT_bscu_wait_for_completion
(/home/k/mpich2-1.3.1/src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:99): one of
the processes terminated badly; aborting
[mpiexec at node00] HYDT_bsci_wait_for_completion
(/home/k/mpich2-1.3.1/src/pm/hydra/tools/bootstrap/src/bsci_wait.c:18):
bootstrap device returned error waiting for completion
[mpiexec at node00] HYD_pmci_wait_for_completion
(/home/k/mpich2-1.3.1/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:352): bootstrap
server returned error waiting for completion
[mpiexec at node00] main
(/home/k/mpich2-1.3.1/src/pm/hydra/ui/mpich/mpiexec.c:302): process manager
error waiting for completion
Here is also the running command:
>mpiexec -f hosts -n 22 ./mpi-Hello.exe
> mpiexec.hydra -f hosts -n 22 ./mpi-Hello.exe
When number of cores is 20, the program executed well.
Here is also the "hosts" file:
node00:2
node01:2
node02:2
node03:2
node04:2
node05:2
node06:2
node07:2
node08:2
node09:2
node10:2
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101214/f6ee41f9/attachment-0001.htm>
More information about the mpich-discuss
mailing list