<html><head><style type="text/css"><!-- DIV {margin:0px;} --></style></head><body><div style="font-family:arial, helvetica, sans-serif;font-size:10pt"><div style="color: black; font-family: arial, helvetica, sans-serif; font-size: 10pt; "></div><div><br><font class="Apple-style-span" face="arial, helvetica, sans-serif" size="2"> Thanks for your fast reply. The program runs well when I have removed "node10" and increased the number of processes.</font></div><div><font class="Apple-style-span" face="arial, helvetica, sans-serif" size="2">Now, I don't know where is the problem with "node10". It has the same Linux version, the same configuration and on the same network. </font></div><div><font class="Apple-style-span" face="arial, helvetica, sans-serif" size="2"><br></font></div><div><font class="Apple-style-span" face="arial, helvetica, sans-serif" size="2">Do you have any ideas ?</font></div><div style="color: black; font-family: arial,
helvetica, sans-serif; font-size: 10pt; "><br></div><div style="font-family: arial, helvetica, sans-serif; font-size: 10pt; color: black; "><br><div style="font-family:arial, helvetica, sans-serif;font-size:13px"><font size="2" face="Tahoma"><hr size="1"><b><span style="font-weight: bold;">From:</span></b> Pavan Balaji <balaji@mcs.anl.gov><br><b><span style="font-weight: bold;">To:</span></b> Hisham Adel <hosham2004@yahoo.com><br><b><span style="font-weight: bold;">Cc:</span></b> MPI <mpich2-dev@mcs.anl.gov>; MPI_questions <mpich-discuss@mcs.anl.gov><br><b><span style="font-weight: bold;">Sent:</span></b> Tue, December 14, 2010 2:57:35 PM<br><b><span style="font-weight: bold;">Subject:</span></b> Re: Problem with MPI_Bcast<br></font><br><br>My guess is that there is something wrong with node10. Can you try <br>removing node10 in your hostfile and running your test program with more <br>than 20 processes?<br><br> --
Pavan<br><br>On 12/14/2010 07:53 AM, Hisham Adel wrote:<br>> Hi All,<br>><br>> I have installed the new MPICH2 version "1.3.1" with this configuration:<br>><br>> *./configure --without-mpe --disable-f77 --disable-fc *<br>><br>> After the installation, I started run some old programs I have written<br>> before with MPI....<br>> All the programs I have written before with MPI hang when number of<br>> cores > 20. They hang when there is an*MPI_Bcast* call.<br>><br>> So, I got the "Hello_world" example and I executed it. It works well.<br>> So, I have modified it and added a simple *MPI_Bcast* call, the program<br>> start to hang when number of cores > 20.<br>><br>><br>> I also have tried the new installation with the "*cpi*" example included<br>> in the package and it hangs when the number of nodes > 20.....<br>><br>><br>> Do you have any ideas about that ?<br>><br>><br>>
_Here is the "Hello World" example:_<br>><br>> *#include <stdio.h>*<br>> *#include "mpi.h"*<br>> *#include <string.h>*<br>> *<br>> *<br>> *int main(int argc, char **argv)*<br>> *{*<br>> *int my_rank;*<br>> *int source;*<br>> *int dest;*<br>> *int p,len;*<br>> *int tag = 50;*<br>> *char message [100];*<br>> *char name[MPI_MAX_PROCESSOR_NAME];*<br>> *MPI_Status status;*<br>> *<br>> *<br>> *MPI_Init(&argc, &argv);*<br>> *MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);*<br>> *MPI_Comm_size(MPI_COMM_WORLD, &p);*<br>> *int x=0;*<br>> *if(my_rank==0)*<br>> *{*<br>> *x=923;*<br>> *}*<br>> *MPI_Bcast(&x,1,MPI_INT,0,MPI_COMM_WORLD);*<br>> *printf("\nI %d got %d from node 0\n",my_rank,x);*<br>> *if (my_rank != 0) {*<br>> *MPI_Get_processor_name(name, &len);*<br>> *sprintf(message, "Greetings from process %d, I am %s !", my_rank, name);*<br>>
*dest = 0;*<br>> *MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag,*<br>> *MPI_COMM_WORLD);*<br>> *} else {*<br>> *for (source = 1; source < p; source++) {*<br>> *MPI_Recv(message, 100, MPI_CHAR, source, tag,*<br>> *MPI_COMM_WORLD, &status);*<br>> *printf("%s\n", message);*<br>> *}*<br>> *}*<br>> *MPI_Finalize();*<br>> *}*<br>><br>> _Here is the error message I got, when I run the "Hello World" Example:_<br>><br>> Fatal error in PMPI_Bcast: Other MPI error, error stack:<br>> PMPI_Bcast(1306)......................: MPI_Bcast(buf=0x7fff463d2ad4,<br>> count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed<br>> MPIR_Bcast_impl(1150).................:<br>> MPIR_Bcast_intra(990).................:<br>> MPIR_Bcast_scatter_ring_allgather(840):<br>> MPIR_Bcast_binomial(187)..............:<br>> MPIC_Send(66).........................:<br>> MPIC_Wait(528)........................:<br>>
MPIDI_CH3I_Progress(335)..............:<br>> MPID_nem_mpich2_blocking_recv(906)....:<br>> MPID_nem_tcp_connpoll(1830)...........: Communication error with rank 20:<br>> Fatal error in PMPI_Bcast: Other MPI error, error stack:<br>> PMPI_Bcast(1306)......................: MPI_Bcast(buf=0x7fff8c374d84,<br>> count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed<br>> MPIR_Bcast_impl(1150).................:<br>> MPIR_Bcast_intra(990).................:<br>> MPIR_Bcast_scatter_ring_allgather(840):<br>> MPIR_Bcast_binomial(187)..............:<br>> MPIC_Send(66).........................:<br>> MPIC_Wait(528)........................:<br>> MPIDI_CH3I_Progress(335)..............:<br>> MPID_nem_mpich2_blocking_recv(906)....:<br>> MPID_nem_tcp_connpoll(1843)...........:<br>> state_commrdy_handler(1674)...........:<br>> MPID_nem_tcp_recv_handler(1653).......: Communication error with rank 16<br>>
MPID_nem_tcp_recv_handler(1554).......: socket closed<br>> APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)<br>><br>><br>> _Here is the error message I got, when I run "cpi" example:_<br>><br>><br>> Process 1 of 22 is on node00<br>> Process 0 of 22 is on node00<br>> Process 4 of 22 is on node02<br>> Process 5 of 22 is on node02<br>> Process 6 of 22 is on node03<br>> Process 7 of 22 is on node03<br>> Process 20 of 22 is on node10<br>> Process 21 of 22 is on node10<br>> Fatal error in PMPI_Bcast: Other MPI error, error stack:<br>> PMPI_Bcast(1306)......................: MPI_Bcast(buf=0x7fff44bcfd3c,<br>> count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed<br>> MPIR_Bcast_impl(1150).................:<br>> MPIR_Bcast_intra(990).................:<br>> MPIR_Bcast_scatter_ring_allgather(840):<br>> MPIR_Bcast_binomial(187)..............:<br>>
MPIC_Send(66).........................:<br>> MPIC_Wait(528)........................:<br>> MPIDI_CH3I_Progress(335)..............:<br>> MPID_nem_mpich2_blocking_recv(906)....:<br>> MPID_nem_tcp_connpoll(1843)...........:<br>> state_commrdy_handler(1674)...........:<br>> MPID_nem_tcp_recv_handler(1653).......: Communication error with rank 16<br>> MPID_nem_tcp_recv_handler(1554).......: socket closed<br>> Process 2 of 22 is on node01<br>> Process 3 of 22 is on node01<br>> [proxy:0:2@node02] HYDT_dmxu_poll_wait_for_event<br>> (/home/k/mpich2-1.3.1/src/pm/hydra/tools/demux/demux_poll.c:70): assert<br>> (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP)) failed<br>> [proxy:0:2@node02] main<br>> (/home/k/mpich2-1.3.1/src/pm/hydra/pm/pmiserv/pmip.c:225): demux engine<br>> error waiting for event<br>> Process 8 of 22 is on node04<br>> Process 9 of 22 is on node04<br>> Process 18 of 22 is on
node09<br>> Process 19 of 22 is on node09<br>> Fatal error in PMPI_Bcast: Other MPI error, error stack:<br>> PMPI_Bcast(1306)......................: MPI_Bcast(buf=0x7ffff9d75dec,<br>> count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed<br>> MPIR_Bcast_impl(1150).................:<br>> MPIR_Bcast_intra(990).................:<br>> MPIR_Bcast_scatter_ring_allgather(840):<br>> MPIR_Bcast_binomial(157)..............:<br>> MPIC_Recv(108)........................:<br>> MPIC_Wait(528)........................:<br>> MPIDI_CH3I_Progress(335)..............:<br>> MPID_nem_mpich2_blocking_recv(906)....:<br>> MPID_nem_tcp_connpoll(1830)...........: Communication error with rank 0:<br>> Fatal error in PMPI_Bcast: Other MPI error, error stack:<br>> PMPI_Bcast(1306)......................: MPI_Bcast(buf=0x7fff9645255c,<br>> count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed<br>> MPIR_Bcast_impl(1150).................:<br>>
MPIR_Bcast_intra(990).................:<br>> MPIR_Bcast_scatter_ring_allgather(840):<br>> MPIR_Bcast_binomial(187)..............:<br>> MPIC_Send(66).........................:<br>> MPIC_Wait(528)........................:<br>> MPIDI_CH3I_Progress(335)..............:<br>> MPID_nem_mpich2_blocking_recv(906)....:<br>> MPID_nem_tcp_connpoll(1843)...........:<br>> state_commrdy_handler(1674)...........:<br>> MPID_nem_tcp_recv_handler(1653).......: Communication error with rank 0<br>> MPID_nem_tcp_recv_handler(1554).......: socket closed<br>> Process 16 of 22 is on node08<br>> Process 17 of 22 is on node08<br>> Fatal error in PMPI_Bcast: Other MPI error, error stack:<br>> PMPI_Bcast(1306)......................: MPI_Bcast(buf=0x7fff02102e6c,<br>> count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed<br>> MPIR_Bcast_impl(1150).................:<br>> MPIR_Bcast_intra(990).................:<br>>
MPIR_Bcast_scatter_ring_allgather(840):<br>> MPIR_Bcast_binomial(187)..............:<br>> MPIC_Send(66).........................:<br>> MPIC_Wait(528)........................:<br>> MPIDI_CH3I_Progress(335)..............:<br>> MPID_nem_mpich2_blocking_recv(906)....:<br>> MPID_nem_tcp_connpoll(1830)...........: Communication error with rank 20:<br>> Process 12 of 22 is on node06<br>> Process 13 of 22 is on node06<br>> Process 14 of 22 is on node07<br>> Process 15 of 22 is on node07<br>> [mpiexec@node00] HYDT_bscu_wait_for_completion<br>> (/home/k/mpich2-1.3.1/src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:99): one<br>> of the processes terminated badly; aborting<br>> [mpiexec@node00] HYDT_bsci_wait_for_completion<br>> (/home/k/mpich2-1.3.1/src/pm/hydra/tools/bootstrap/src/bsci_wait.c:18):<br>> bootstrap device returned error waiting for completion<br>> [mpiexec@node00] HYD_pmci_wait_for_completion<br>>
(/home/k/mpich2-1.3.1/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:352):<br>> bootstrap server returned error waiting for completion<br>> [mpiexec@node00] main<br>> (/home/k/mpich2-1.3.1/src/pm/hydra/ui/mpich/mpiexec.c:302): process<br>> manager error waiting for completion<br>><br>><br>><br>><br>><br>> _Here is also the running command:_<br>><br>> >mpiexec -f hosts -n 22 ./mpi-Hello.exe<br>> > mpiexec.hydra -f hosts -n 22 ./mpi-Hello.exe<br>><br>><br>> When number of cores is 20, the program executed well.<br>><br>><br>><br>> _Here is also the "hosts" file:_<br>> node00:2<br>> node01:2<br>> node02:2<br>> node03:2<br>> node04:2<br>> node05:2<br>> node06:2<br>> node07:2<br>> node08:2<br>> node09:2<br>> node10:2<br>><br>><br>><br>><br>><br>><br>><br><br>-- <br>Pavan Balaji<br><a href="http://www.mcs.anl.gov/~balaji"
target="_blank">http://www.mcs.anl.gov/~balaji</a><br></div></div><div style="position: fixed; color: black; font-family: arial, helvetica, sans-serif; font-size: 10pt; "></div>
</div><br>
</body></html>