[MPICH] Communication problem on a small heterogeneous ring involving Red Hat Linux - Enterprise Edition

Wed Nov 29 13:11:00 CST 2006

MPICH2 does not work on heterogeneous systems yet, although we plan to
support it in the future.

Rajeev

  _____  

From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Salvatore Sorce
Sent: Wednesday, November 29, 2006 8:51 AM
To: mpich-discuss at mcs.anl.gov
Subject: [MPICH] Communication problem on a small heterogeneous ring
involving Red Hat Linux - Enterprise Edition

Dear all,

I have a small cluster composed by four machines: two double-PIII running
Scientific Linux CERN release 3.0.8 (kernel 2.4.21-37.EL), one double-Xeon
running Red Hat Enterprise Linux WS release 3 (Taroon update 4, kernel
2.4.21-27 EL), and one single-PIV running Red Hat Linux Release 9 (Shrike,
kernel 2.4.20-8). All machines have MPICH2 1.0.4p1 installed.
Tests on whatever kind of ring I set up are OK, and processes are correctly
spawned and started on right hosts.

I experienced problems when processes need to communicate each other, and
one side of the communication is the machine with Red Hat Enterprise Linux
OS. If I do not involve the Enterprise Linux machine in communications, all
runs right.

I am using a simple send-and-receive Fortran test program, where process #1
sends an array of real to process #0. Both processes use blocking
communication functions (mpi_send and mpi_recv).

When process #0 (the receiving one) runs on the Red Hat Enterprise Linux
machine, all hangs up at the mpi_send (maybe because on the Enterprise Linux
side the mpi_recv do not accomplish its task).
When process #1 (the sending one) runs on the Red Hat Enterprise Linux
machine, I obtain the following output:

[cli_0]: aborting job:
Fatal error in MPI_Recv: Other MPI error, error stack:
MPI_Recv(186)................................: MPI_Recv(buf=0xbfff2368,
count=2, MPI_REAL, src=1, tag=17, MPI_COMM_WORLD, status=0xbfff2010) failed
MPIDI_CH3_Progress_wait(217).................: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(590)...: 
MPIDI_CH3_Sockconn_handle_connopen_event(791): unable to find the process
group structure with id <>
[cli_1]: aborting job:
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173).............................: MPI_Send(buf=0x7fbffebdf0,
count=2, MPI_REAL, dest=0, tag=17, MPI_COMM_WORLD) failed
MPIDI_CH3_Progress_wait(217)..............: an error occurred while handling
an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(415): 
MPIDU_Socki_handle_read(670)..............: connection failure
(set=0,sock=1,errno=104:Connection reset by peer)
rank 0 in job 1  mpitemp_32877   caused collective abort of all ranks
  exit status of rank 0: return code 1

I understand that in both cases mpi_recv causes an error, what is the
problem?

Thank you in advance for your attention.

Regards,
Salvatore.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20061129/9d0d0b1e/attachment.htm>