<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.5730.11" name=GENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=#ffffff>
<DIV dir=ltr align=left><SPAN class=103430919-29112006><FONT face=Arial
color=#0000ff size=2>MPICH2 does not work on heterogeneous systems yet, although
we plan to support it in the future.</FONT></SPAN></DIV>
<DIV> </DIV>
<DIV><SPAN class=103430919-29112006></SPAN><FONT face=Arial><FONT
color=#0000ff><FONT size=2>R<SPAN
class=103430919-29112006>ajeev</SPAN></FONT></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT color=#0000ff><FONT size=2><SPAN
class=103430919-29112006></SPAN></FONT></FONT></FONT> </DIV>
<BLOCKQUOTE dir=ltr
style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #0000ff 2px solid; MARGIN-RIGHT: 0px">
<DIV class=OutlookMessageHeader lang=en-us dir=ltr align=left>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> owner-mpich-discuss@mcs.anl.gov
[mailto:owner-mpich-discuss@mcs.anl.gov] <B>On Behalf Of </B>Salvatore
Sorce<BR><B>Sent:</B> Wednesday, November 29, 2006 8:51 AM<BR><B>To:</B>
mpich-discuss@mcs.anl.gov<BR><B>Subject:</B> [MPICH] Communication problem on
a small heterogeneous ring involving Red Hat Linux - Enterprise
Edition<BR></FONT><BR></DIV>
<DIV></DIV>
<DIV><FONT face=Arial size=2>Dear all,</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>I have a small cluster composed by four machines:
two double-PIII running Scientific Linux CERN release 3.0.8 (kernel
2.4.21-37.EL), one double-Xeon running Red Hat Enterprise Linux WS release 3
(Taroon update 4, kernel 2.4.21-27 EL), and one single-PIV running Red Hat
Linux Release 9 (Shrike, kernel 2.4.20-8). </FONT><FONT face=Arial size=2>All
machines have MPICH2 1.0.4p1 installed.</FONT></DIV>
<DIV><FONT face=Arial size=2>Tests on whatever kind of ring I set up are OK,
and processes are correctly spawned and started on right hosts.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>I experienced problems when processes need to
communicate each other, and one side of the communication is the machine with
Red Hat Enterprise Linux OS. If I do not involve the Enterprise Linux machine
in communications, all runs right.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>I am using a simple send-and-receive Fortran test
program, where process #1 sends an array of real to process #0. Both processes
use blocking communication functions (mpi_send and mpi_recv).</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>When process #0 (the receiving one) runs on
the Red Hat Enterprise Linux machine, all hangs up at the mpi_send (maybe
because on the Enterprise Linux side the mpi_recv do not accomplish its
task).</FONT></DIV>
<DIV><FONT face=Arial size=2>When process #1 (the sending one) runs on the Red
Hat Enterprise Linux machine, I obtain the following output:</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>[cli_0]: aborting job:<BR>Fatal error in
MPI_Recv: Other MPI error, error
stack:<BR>MPI_Recv(186)................................:
MPI_Recv(buf=0xbfff2368, count=2, MPI_REAL, src=1, tag=17, MPI_COMM_WORLD,
status=0xbfff2010) failed<BR>MPIDI_CH3_Progress_wait(217).................: an
error occurred while handling an event returned by
MPIDU_Sock_Wait()<BR>MPIDI_CH3I_Progress_handle_sock_event(590)...:
<BR>MPIDI_CH3_Sockconn_handle_connopen_event(791): unable to find the process
group structure with id <><BR>[cli_1]: aborting job:<BR>Fatal error in
MPI_Send: Other MPI error, error
stack:<BR>MPI_Send(173).............................:
MPI_Send(buf=0x7fbffebdf0, count=2, MPI_REAL, dest=0, tag=17, MPI_COMM_WORLD)
failed<BR>MPIDI_CH3_Progress_wait(217)..............: an error occurred while
handling an event returned by
MPIDU_Sock_Wait()<BR>MPIDI_CH3I_Progress_handle_sock_event(415):
<BR>MPIDU_Socki_handle_read(670)..............: connection failure
(set=0,sock=1,errno=104:Connection reset by peer)<BR>rank 0 in job 1
mpitemp_32877 caused collective abort of all ranks<BR> exit
status of rank 0: return code 1</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>I understand that in both cases mpi_recv causes
an error, what is the problem?</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>Thank you in advance for your
attention.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>Regards,</FONT></DIV>
<DIV><FONT face=Arial size=2>Salvatore.</FONT></DIV></BLOCKQUOTE></BODY></HTML>