[MPICH] Communication problem on a small heterogeneous ring involving Red Hat Linux - Enterprise Edition

Salvatore Sorce sorce at unipa.it
Wed Nov 29 08:50:31 CST 2006


Dear all,

I have a small cluster composed by four machines: two double-PIII running Scientific Linux CERN release 3.0.8 (kernel 2.4.21-37.EL), one double-Xeon running Red Hat Enterprise Linux WS release 3 (Taroon update 4, kernel 2.4.21-27 EL), and one single-PIV running Red Hat Linux Release 9 (Shrike, kernel 2.4.20-8). All machines have MPICH2 1.0.4p1 installed.
Tests on whatever kind of ring I set up are OK, and processes are correctly spawned and started on right hosts.

I experienced problems when processes need to communicate each other, and one side of the communication is the machine with Red Hat Enterprise Linux OS. If I do not involve the Enterprise Linux machine in communications, all runs right.

I am using a simple send-and-receive Fortran test program, where process #1 sends an array of real to process #0. Both processes use blocking communication functions (mpi_send and mpi_recv).

When process #0 (the receiving one) runs on the Red Hat Enterprise Linux machine, all hangs up at the mpi_send (maybe because on the Enterprise Linux side the mpi_recv do not accomplish its task).
When process #1 (the sending one) runs on the Red Hat Enterprise Linux machine, I obtain the following output:

[cli_0]: aborting job:
Fatal error in MPI_Recv: Other MPI error, error stack:
MPI_Recv(186)................................: MPI_Recv(buf=0xbfff2368, count=2, MPI_REAL, src=1, tag=17, MPI_COMM_WORLD, status=0xbfff2010) failed
MPIDI_CH3_Progress_wait(217).................: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(590)...: 
MPIDI_CH3_Sockconn_handle_connopen_event(791): unable to find the process group structure with id <>
[cli_1]: aborting job:
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173).............................: MPI_Send(buf=0x7fbffebdf0, count=2, MPI_REAL, dest=0, tag=17, MPI_COMM_WORLD) failed
MPIDI_CH3_Progress_wait(217)..............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(415): 
MPIDU_Socki_handle_read(670)..............: connection failure (set=0,sock=1,errno=104:Connection reset by peer)
rank 0 in job 1  mpitemp_32877   caused collective abort of all ranks
  exit status of rank 0: return code 1

I understand that in both cases mpi_recv causes an error, what is the problem?

Thank you in advance for your attention.

Regards,
Salvatore.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20061129/1b11e8e2/attachment.htm>


More information about the mpich-discuss mailing list