[mpich-discuss] Problem with MPI_Bcast

Hisham Adel hosham2004 at yahoo.com
Tue Dec 14 07:53:40 CST 2010


Hi All,

I have installed the new MPICH2 version "1.3.1" with this configuration:

./configure --without-mpe --disable-f77 --disable-fc 

After the installation, I started run some old programs I have written before 
with MPI....
All the programs I have written before with MPI hang when number of cores > 20. 
They hang when there is anMPI_Bcast call.

So, I got the "Hello_world" example and  I executed it. It works well. So, I 
have modified it and added a simple MPI_Bcast call, the program 
start to hang when number of cores > 20.


I also have tried the new installation with the "cpi" example included in the 
package and it hangs when the number of nodes > 20.....


Do you have any ideas about that ?


Here is the "Hello World" example:

#include <stdio.h>
#include "mpi.h"
#include <string.h>

int main(int argc, char **argv)
{
        int my_rank;
        int source;
        int dest;
        int p,len;
        int tag = 50;
        char message [100];
        char name[MPI_MAX_PROCESSOR_NAME];
        MPI_Status status;

        MPI_Init(&argc, &argv);
        MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
        MPI_Comm_size(MPI_COMM_WORLD, &p);
        int x=0;
        if(my_rank==0)
        {
                x=923;
        }
        MPI_Bcast(&x,1,MPI_INT,0,MPI_COMM_WORLD);
        printf("\nI %d got %d from node 0\n",my_rank,x);
        if (my_rank != 0) {
                MPI_Get_processor_name(name, &len);
                sprintf(message, "Greetings from process %d, I am %s !", 
my_rank, name);
                dest = 0;
                MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag,
                        MPI_COMM_WORLD);
        } else {
                for (source = 1; source < p; source++) {
                        MPI_Recv(message, 100, MPI_CHAR, source, tag,
                                MPI_COMM_WORLD, &status);
                        printf("%s\n", message);
                }
        }
        MPI_Finalize();
}

Here is the error message I got, when I run the "Hello World" Example:

Fatal error in PMPI_Bcast: Other MPI error, error stack:

PMPI_Bcast(1306)......................: MPI_Bcast(buf=0x7fff463d2ad4, count=1, 
MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1150).................: 
MPIR_Bcast_intra(990).................: 
MPIR_Bcast_scatter_ring_allgather(840): 
MPIR_Bcast_binomial(187)..............: 
MPIC_Send(66).........................: 
MPIC_Wait(528)........................: 
MPIDI_CH3I_Progress(335)..............: 
MPID_nem_mpich2_blocking_recv(906)....: 
MPID_nem_tcp_connpoll(1830)...........: Communication error with rank 20: 
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1306)......................: MPI_Bcast(buf=0x7fff8c374d84, count=1, 
MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1150).................: 
MPIR_Bcast_intra(990).................: 
MPIR_Bcast_scatter_ring_allgather(840): 
MPIR_Bcast_binomial(187)..............: 
MPIC_Send(66).........................: 
MPIC_Wait(528)........................: 
MPIDI_CH3I_Progress(335)..............: 
MPID_nem_mpich2_blocking_recv(906)....: 
MPID_nem_tcp_connpoll(1843)...........: 
state_commrdy_handler(1674)...........: 
MPID_nem_tcp_recv_handler(1653).......: Communication error with rank 16
MPID_nem_tcp_recv_handler(1554).......: socket closed
APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)


Here is the error message I got, when I run "cpi" example:


Process 1 of 22 is on node00
Process 0 of 22 is on node00
Process 4 of 22 is on node02
Process 5 of 22 is on node02
Process 6 of 22 is on node03
Process 7 of 22 is on node03
Process 20 of 22 is on node10
Process 21 of 22 is on node10
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1306)......................: MPI_Bcast(buf=0x7fff44bcfd3c, count=1, 
MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1150).................: 
MPIR_Bcast_intra(990).................: 
MPIR_Bcast_scatter_ring_allgather(840): 
MPIR_Bcast_binomial(187)..............: 
MPIC_Send(66).........................: 
MPIC_Wait(528)........................: 
MPIDI_CH3I_Progress(335)..............: 
MPID_nem_mpich2_blocking_recv(906)....: 
MPID_nem_tcp_connpoll(1843)...........: 
state_commrdy_handler(1674)...........: 
MPID_nem_tcp_recv_handler(1653).......: Communication error with rank 16
MPID_nem_tcp_recv_handler(1554).......: socket closed
Process 2 of 22 is on node01
Process 3 of 22 is on node01
[proxy:0:2 at node02] HYDT_dmxu_poll_wait_for_event 
(/home/k/mpich2-1.3.1/src/pm/hydra/tools/demux/demux_poll.c:70): assert 
(!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP)) failed
[proxy:0:2 at node02] main 
(/home/k/mpich2-1.3.1/src/pm/hydra/pm/pmiserv/pmip.c:225): demux engine error 
waiting for event
Process 8 of 22 is on node04
Process 9 of 22 is on node04
Process 18 of 22 is on node09
Process 19 of 22 is on node09
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1306)......................: MPI_Bcast(buf=0x7ffff9d75dec, count=1, 
MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1150).................: 
MPIR_Bcast_intra(990).................: 
MPIR_Bcast_scatter_ring_allgather(840): 
MPIR_Bcast_binomial(157)..............: 
MPIC_Recv(108)........................: 
MPIC_Wait(528)........................: 
MPIDI_CH3I_Progress(335)..............: 
MPID_nem_mpich2_blocking_recv(906)....: 
MPID_nem_tcp_connpoll(1830)...........: Communication error with rank 0: 
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1306)......................: MPI_Bcast(buf=0x7fff9645255c, count=1, 
MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1150).................: 
MPIR_Bcast_intra(990).................: 
MPIR_Bcast_scatter_ring_allgather(840): 
MPIR_Bcast_binomial(187)..............: 
MPIC_Send(66).........................: 
MPIC_Wait(528)........................: 
MPIDI_CH3I_Progress(335)..............: 
MPID_nem_mpich2_blocking_recv(906)....: 
MPID_nem_tcp_connpoll(1843)...........: 
state_commrdy_handler(1674)...........: 
MPID_nem_tcp_recv_handler(1653).......: Communication error with rank 0
MPID_nem_tcp_recv_handler(1554).......: socket closed
Process 16 of 22 is on node08
Process 17 of 22 is on node08
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1306)......................: MPI_Bcast(buf=0x7fff02102e6c, count=1, 
MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1150).................: 
MPIR_Bcast_intra(990).................: 
MPIR_Bcast_scatter_ring_allgather(840): 
MPIR_Bcast_binomial(187)..............: 
MPIC_Send(66).........................: 
MPIC_Wait(528)........................: 
MPIDI_CH3I_Progress(335)..............: 
MPID_nem_mpich2_blocking_recv(906)....: 
MPID_nem_tcp_connpoll(1830)...........: Communication error with rank 20: 
Process 12 of 22 is on node06
Process 13 of 22 is on node06
Process 14 of 22 is on node07
Process 15 of 22 is on node07
[mpiexec at node00] HYDT_bscu_wait_for_completion 
(/home/k/mpich2-1.3.1/src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:99): one of 
the processes terminated badly; aborting
[mpiexec at node00] HYDT_bsci_wait_for_completion 
(/home/k/mpich2-1.3.1/src/pm/hydra/tools/bootstrap/src/bsci_wait.c:18): 
bootstrap device returned error waiting for completion
[mpiexec at node00] HYD_pmci_wait_for_completion 
(/home/k/mpich2-1.3.1/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:352): bootstrap 
server returned error waiting for completion
[mpiexec at node00] main 
(/home/k/mpich2-1.3.1/src/pm/hydra/ui/mpich/mpiexec.c:302): process manager 
error waiting for completion





Here is also the running command:

>mpiexec  -f hosts -n 22  ./mpi-Hello.exe
> mpiexec.hydra  -f hosts -n 22  ./mpi-Hello.exe


When number of cores is 20, the program executed well.



Here is also the "hosts" file:
node00:2
node01:2
node02:2
node03:2
node04:2
node05:2
node06:2
node07:2
node08:2
node09:2
node10:2


      
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101214/f6ee41f9/attachment-0001.htm>


More information about the mpich-discuss mailing list