[mpich2-dev] Problem with MPI_Bcast

Hisham Adel hosham2004 at yahoo.com
Tue Dec 14 08:04:42 CST 2010


 Thanks for your fast reply. The program runs well when I have removed "node10" 
and increased the number of processes.
Now, I don't know where is the problem with "node10". It has the same Linux 
version, the same configuration and on the same network. 

Do you have any ideas ?




________________________________
From: Pavan Balaji <balaji at mcs.anl.gov>
To: Hisham Adel <hosham2004 at yahoo.com>
Cc: MPI <mpich2-dev at mcs.anl.gov>; MPI_questions <mpich-discuss at mcs.anl.gov>
Sent: Tue, December 14, 2010 2:57:35 PM
Subject: Re: Problem with MPI_Bcast


My guess is that there is something wrong with node10. Can you try 
removing node10 in your hostfile and running your test program with more 
than 20 processes?

  -- Pavan

On 12/14/2010 07:53 AM, Hisham Adel wrote:
> Hi All,
>
> I have installed the new MPICH2 version "1.3.1" with this configuration:
>
> *./configure --without-mpe --disable-f77 --disable-fc *
>
> After the installation, I started run some old programs I have written
> before with MPI....
> All the programs I have written before with MPI hang when number of
> cores > 20. They hang when there is an*MPI_Bcast* call.
>
> So, I got the "Hello_world" example and I executed it. It works well.
> So, I have modified it and added a simple *MPI_Bcast* call, the program
> start to hang when number of cores > 20.
>
>
> I also have tried the new installation with the "*cpi*" example included
> in the package and it hangs when the number of nodes > 20.....
>
>
> Do you have any ideas about that ?
>
>
> _Here is the "Hello World" example:_
>
> *#include <stdio.h>*
> *#include "mpi.h"*
> *#include <string.h>*
> *
> *
> *int main(int argc, char **argv)*
> *{*
> *int my_rank;*
> *int source;*
> *int dest;*
> *int p,len;*
> *int tag = 50;*
> *char message [100];*
> *char name[MPI_MAX_PROCESSOR_NAME];*
> *MPI_Status status;*
> *
> *
> *MPI_Init(&argc, &argv);*
> *MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);*
> *MPI_Comm_size(MPI_COMM_WORLD, &p);*
> *int x=0;*
> *if(my_rank==0)*
> *{*
> *x=923;*
> *}*
> *MPI_Bcast(&x,1,MPI_INT,0,MPI_COMM_WORLD);*
> *printf("\nI %d got %d from node 0\n",my_rank,x);*
> *if (my_rank != 0) {*
> *MPI_Get_processor_name(name, &len);*
> *sprintf(message, "Greetings from process %d, I am %s !", my_rank, name);*
> *dest = 0;*
> *MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag,*
> *MPI_COMM_WORLD);*
> *} else {*
> *for (source = 1; source < p; source++) {*
> *MPI_Recv(message, 100, MPI_CHAR, source, tag,*
> *MPI_COMM_WORLD, &status);*
> *printf("%s\n", message);*
> *}*
> *}*
> *MPI_Finalize();*
> *}*
>
> _Here is the error message I got, when I run the "Hello World" Example:_
>
> Fatal error in PMPI_Bcast: Other MPI error, error stack:
> PMPI_Bcast(1306)......................: MPI_Bcast(buf=0x7fff463d2ad4,
> count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed
> MPIR_Bcast_impl(1150).................:
> MPIR_Bcast_intra(990).................:
> MPIR_Bcast_scatter_ring_allgather(840):
> MPIR_Bcast_binomial(187)..............:
> MPIC_Send(66).........................:
> MPIC_Wait(528)........................:
> MPIDI_CH3I_Progress(335)..............:
> MPID_nem_mpich2_blocking_recv(906)....:
> MPID_nem_tcp_connpoll(1830)...........: Communication error with rank 20:
> Fatal error in PMPI_Bcast: Other MPI error, error stack:
> PMPI_Bcast(1306)......................: MPI_Bcast(buf=0x7fff8c374d84,
> count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed
> MPIR_Bcast_impl(1150).................:
> MPIR_Bcast_intra(990).................:
> MPIR_Bcast_scatter_ring_allgather(840):
> MPIR_Bcast_binomial(187)..............:
> MPIC_Send(66).........................:
> MPIC_Wait(528)........................:
> MPIDI_CH3I_Progress(335)..............:
> MPID_nem_mpich2_blocking_recv(906)....:
> MPID_nem_tcp_connpoll(1843)...........:
> state_commrdy_handler(1674)...........:
> MPID_nem_tcp_recv_handler(1653).......: Communication error with rank 16
> MPID_nem_tcp_recv_handler(1554).......: socket closed
> APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)
>
>
> _Here is the error message I got, when I run "cpi" example:_
>
>
> Process 1 of 22 is on node00
> Process 0 of 22 is on node00
> Process 4 of 22 is on node02
> Process 5 of 22 is on node02
> Process 6 of 22 is on node03
> Process 7 of 22 is on node03
> Process 20 of 22 is on node10
> Process 21 of 22 is on node10
> Fatal error in PMPI_Bcast: Other MPI error, error stack:
> PMPI_Bcast(1306)......................: MPI_Bcast(buf=0x7fff44bcfd3c,
> count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed
> MPIR_Bcast_impl(1150).................:
> MPIR_Bcast_intra(990).................:
> MPIR_Bcast_scatter_ring_allgather(840):
> MPIR_Bcast_binomial(187)..............:
> MPIC_Send(66).........................:
> MPIC_Wait(528)........................:
> MPIDI_CH3I_Progress(335)..............:
> MPID_nem_mpich2_blocking_recv(906)....:
> MPID_nem_tcp_connpoll(1843)...........:
> state_commrdy_handler(1674)...........:
> MPID_nem_tcp_recv_handler(1653).......: Communication error with rank 16
> MPID_nem_tcp_recv_handler(1554).......: socket closed
> Process 2 of 22 is on node01
> Process 3 of 22 is on node01
> [proxy:0:2 at node02] HYDT_dmxu_poll_wait_for_event
> (/home/k/mpich2-1.3.1/src/pm/hydra/tools/demux/demux_poll.c:70): assert
> (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP)) failed
> [proxy:0:2 at node02] main
> (/home/k/mpich2-1.3.1/src/pm/hydra/pm/pmiserv/pmip.c:225): demux engine
> error waiting for event
> Process 8 of 22 is on node04
> Process 9 of 22 is on node04
> Process 18 of 22 is on node09
> Process 19 of 22 is on node09
> Fatal error in PMPI_Bcast: Other MPI error, error stack:
> PMPI_Bcast(1306)......................: MPI_Bcast(buf=0x7ffff9d75dec,
> count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed
> MPIR_Bcast_impl(1150).................:
> MPIR_Bcast_intra(990).................:
> MPIR_Bcast_scatter_ring_allgather(840):
> MPIR_Bcast_binomial(157)..............:
> MPIC_Recv(108)........................:
> MPIC_Wait(528)........................:
> MPIDI_CH3I_Progress(335)..............:
> MPID_nem_mpich2_blocking_recv(906)....:
> MPID_nem_tcp_connpoll(1830)...........: Communication error with rank 0:
> Fatal error in PMPI_Bcast: Other MPI error, error stack:
> PMPI_Bcast(1306)......................: MPI_Bcast(buf=0x7fff9645255c,
> count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed
> MPIR_Bcast_impl(1150).................:
> MPIR_Bcast_intra(990).................:
> MPIR_Bcast_scatter_ring_allgather(840):
> MPIR_Bcast_binomial(187)..............:
> MPIC_Send(66).........................:
> MPIC_Wait(528)........................:
> MPIDI_CH3I_Progress(335)..............:
> MPID_nem_mpich2_blocking_recv(906)....:
> MPID_nem_tcp_connpoll(1843)...........:
> state_commrdy_handler(1674)...........:
> MPID_nem_tcp_recv_handler(1653).......: Communication error with rank 0
> MPID_nem_tcp_recv_handler(1554).......: socket closed
> Process 16 of 22 is on node08
> Process 17 of 22 is on node08
> Fatal error in PMPI_Bcast: Other MPI error, error stack:
> PMPI_Bcast(1306)......................: MPI_Bcast(buf=0x7fff02102e6c,
> count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed
> MPIR_Bcast_impl(1150).................:
> MPIR_Bcast_intra(990).................:
> MPIR_Bcast_scatter_ring_allgather(840):
> MPIR_Bcast_binomial(187)..............:
> MPIC_Send(66).........................:
> MPIC_Wait(528)........................:
> MPIDI_CH3I_Progress(335)..............:
> MPID_nem_mpich2_blocking_recv(906)....:
> MPID_nem_tcp_connpoll(1830)...........: Communication error with rank 20:
> Process 12 of 22 is on node06
> Process 13 of 22 is on node06
> Process 14 of 22 is on node07
> Process 15 of 22 is on node07
> [mpiexec at node00] HYDT_bscu_wait_for_completion
> (/home/k/mpich2-1.3.1/src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:99): one
> of the processes terminated badly; aborting
> [mpiexec at node00] HYDT_bsci_wait_for_completion
> (/home/k/mpich2-1.3.1/src/pm/hydra/tools/bootstrap/src/bsci_wait.c:18):
> bootstrap device returned error waiting for completion
> [mpiexec at node00] HYD_pmci_wait_for_completion
> (/home/k/mpich2-1.3.1/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:352):
> bootstrap server returned error waiting for completion
> [mpiexec at node00] main
> (/home/k/mpich2-1.3.1/src/pm/hydra/ui/mpich/mpiexec.c:302): process
> manager error waiting for completion
>
>
>
>
>
> _Here is also the running command:_
>
>  >mpiexec -f hosts -n 22 ./mpi-Hello.exe
>  > mpiexec.hydra -f hosts -n 22 ./mpi-Hello.exe
>
>
> When number of cores is 20, the program executed well.
>
>
>
> _Here is also the "hosts" file:_
> node00:2
> node01:2
> node02:2
> node03:2
> node04:2
> node05:2
> node06:2
> node07:2
> node08:2
> node09:2
> node10:2
>
>
>
>
>
>
>

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji



      
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich2-dev/attachments/20101214/25bfea24/attachment.htm>


More information about the mpich2-dev mailing list