<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<META content="MSHTML 6.00.2900.2995" name=GENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=#ffffff>
<DIV><FONT face=Arial size=2>Dear all,</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>I have a small cluster composed by four machines:
two double-PIII running Scientific Linux CERN release 3.0.8 (kernel
2.4.21-37.EL), one double-Xeon running Red Hat Enterprise Linux WS release 3
(Taroon update 4, kernel 2.4.21-27 EL), and one single-PIV running Red Hat Linux
Release 9 (Shrike, kernel 2.4.20-8). </FONT><FONT face=Arial size=2>All machines
have MPICH2 1.0.4p1 installed.</FONT></DIV>
<DIV><FONT face=Arial size=2>Tests on whatever kind of ring I set up are OK, and
processes are correctly spawned and started on right hosts.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>I experienced problems when processes need to
communicate each other, and one side of the communication is the machine with
Red Hat Enterprise Linux OS. If I do not involve the Enterprise Linux machine in
communications, all runs right.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>I am using a simple send-and-receive Fortran test
program, where process #1 sends an array of real to process #0. Both processes
use blocking communication functions (mpi_send and mpi_recv).</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>When process #0 (the receiving one) runs on
the Red Hat Enterprise Linux machine, all hangs up at the mpi_send (maybe
because on the Enterprise Linux side the mpi_recv do not accomplish its
task).</FONT></DIV>
<DIV><FONT face=Arial size=2>When process #1 (the sending one) runs on the Red
Hat Enterprise Linux machine, I obtain the following output:</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>[cli_0]: aborting job:<BR>Fatal error in MPI_Recv:
Other MPI error, error stack:<BR>MPI_Recv(186)................................:
MPI_Recv(buf=0xbfff2368, count=2, MPI_REAL, src=1, tag=17, MPI_COMM_WORLD,
status=0xbfff2010) failed<BR>MPIDI_CH3_Progress_wait(217).................: an
error occurred while handling an event returned by
MPIDU_Sock_Wait()<BR>MPIDI_CH3I_Progress_handle_sock_event(590)...:
<BR>MPIDI_CH3_Sockconn_handle_connopen_event(791): unable to find the process
group structure with id <><BR>[cli_1]: aborting job:<BR>Fatal error in
MPI_Send: Other MPI error, error
stack:<BR>MPI_Send(173).............................: MPI_Send(buf=0x7fbffebdf0,
count=2, MPI_REAL, dest=0, tag=17, MPI_COMM_WORLD)
failed<BR>MPIDI_CH3_Progress_wait(217)..............: an error occurred while
handling an event returned by
MPIDU_Sock_Wait()<BR>MPIDI_CH3I_Progress_handle_sock_event(415):
<BR>MPIDU_Socki_handle_read(670)..............: connection failure
(set=0,sock=1,errno=104:Connection reset by peer)<BR>rank 0 in job 1
mpitemp_32877 caused collective abort of all ranks<BR> exit
status of rank 0: return code 1</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>I understand that in both cases mpi_recv causes an
error, what is the problem?</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>Thank you in advance for your
attention.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>Regards,</FONT></DIV>
<DIV><FONT face=Arial size=2>Salvatore.</FONT></DIV></BODY></HTML>