[MPICH] Communication problem on a small heterogeneous ring involving Red Hat Linux - Enterprise Edition

Fri Dec 1 03:18:41 CST 2006

Thank you all for your prompt replies.

I carried out some other trial including in the ring another double-Xeon machine running Scientific Linux CERN release 3.0.6 (kernel 2.4.21-37.EL). This does not introduce other communication errors. Whatever ring which does not include the Red Hat Enterprise machine does not generate communication errors.
All machines have the same byte ordering (little-endian).
I compared the "config.log" files of all involved systems, and I found that the RH Enterprise machine has a different size for <long> and <long double> types. In more detail, other machines have a size of 4 and 12 respectively, while the RH Enteprise has 8 and 16. Maybe this is the reason I experience the mentioned problems.
To complete the report, I obtain the same error messages even if I use types of the same size (such as MPI_INTEGER).

Do you know if there is a solution, or simply do I have to exclude the RH Enterprise machine from my cluster (or, maybe, install another OS on that machine)?

Thank you again.

Regards,
Salvatore.
  ----- Original Message ----- 
  From: Rajeev Thakur 
  To: 'Matthew Chambers' ; mpich-discuss at mcs.anl.gov 
  Sent: Wednesday, November 29, 2006 11:06 PM
  Subject: RE: [MPICH] Communication problem on a small heterogeneous ring involving Red Hat Linux - Enterprise Edition

  It should work if the byte ordering and type sizes are the same, unless there is something funky going on because of the different OSes. 

  Rajeev

----------------------------------------------------------------------------
    From: owner-mpich-discuss at mcs.anl.gov [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Matthew Chambers
    Sent: Wednesday, November 29, 2006 2:08 PM
    To: mpich-discuss at mcs.anl.gov
    Subject: RE: [MPICH] Communication problem on a small heterogeneous ring involving Red Hat Linux - Enterprise Edition

    Is he actually talking about a heterogeneous system?  All I see are different operating systems.  Unless the Xeon is 64 bit, it seems like byte ordering and type sizes should all be equal, which as far as I know is the only kind of heterogeneity that would affect MPI.  Am I wrong?

    Matt Chambers

    Vanderbilt Bioinformatics

----------------------------------------------------------------------------

    From: owner-mpich-discuss at mcs.anl.gov [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Rajeev Thakur
    Sent: Wednesday, November 29, 2006 1:11 PM
    To: 'Salvatore Sorce'; mpich-discuss at mcs.anl.gov
    Subject: RE: [MPICH] Communication problem on a small heterogeneous ring involving Red Hat Linux - Enterprise Edition

    MPICH2 does not work on heterogeneous systems yet, although we plan to support it in the future.

    Rajeev

--------------------------------------------------------------------------

      From: owner-mpich-discuss at mcs.anl.gov [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Salvatore Sorce
      Sent: Wednesday, November 29, 2006 8:51 AM
      To: mpich-discuss at mcs.anl.gov
      Subject: [MPICH] Communication problem on a small heterogeneous ring involving Red Hat Linux - Enterprise Edition

      Dear all,

      I have a small cluster composed by four machines: two double-PIII running Scientific Linux CERN release 3.0.8 (kernel 2.4.21-37.EL), one double-Xeon running Red Hat Enterprise Linux WS release 3 (Taroon update 4, kernel 2.4.21-27 EL), and one single-PIV running Red Hat Linux Release 9 (Shrike, kernel 2.4.20-8). All machines have MPICH2 1.0.4p1 installed.

      Tests on whatever kind of ring I set up are OK, and processes are correctly spawned and started on right hosts.

      I experienced problems when processes need to communicate each other, and one side of the communication is the machine with Red Hat Enterprise Linux OS. If I do not involve the Enterprise Linux machine in communications, all runs right.

      I am using a simple send-and-receive Fortran test program, where process #1 sends an array of real to process #0. Both processes use blocking communication functions (mpi_send and mpi_recv).

      When process #0 (the receiving one) runs on the Red Hat Enterprise Linux machine, all hangs up at the mpi_send (maybe because on the Enterprise Linux side the mpi_recv do not accomplish its task).

      When process #1 (the sending one) runs on the Red Hat Enterprise Linux machine, I obtain the following output:

      [cli_0]: aborting job:
      Fatal error in MPI_Recv: Other MPI error, error stack:
      MPI_Recv(186)................................: MPI_Recv(buf=0xbfff2368, count=2, MPI_REAL, src=1, tag=17, MPI_COMM_WORLD, status=0xbfff2010) failed
      MPIDI_CH3_Progress_wait(217).................: an error occurred while handling an event returned by MPIDU_Sock_Wait()
      MPIDI_CH3I_Progress_handle_sock_event(590)...: 
      MPIDI_CH3_Sockconn_handle_connopen_event(791): unable to find the process group structure with id <>
      [cli_1]: aborting job:
      Fatal error in MPI_Send: Other MPI error, error stack:
      MPI_Send(173).............................: MPI_Send(buf=0x7fbffebdf0, count=2, MPI_REAL, dest=0, tag=17, MPI_COMM_WORLD) failed
      MPIDI_CH3_Progress_wait(217)..............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
      MPIDI_CH3I_Progress_handle_sock_event(415): 
      MPIDU_Socki_handle_read(670)..............: connection failure (set=0,sock=1,errno=104:Connection reset by peer)
      rank 0 in job 1  mpitemp_32877   caused collective abort of all ranks
        exit status of rank 0: return code 1

      I understand that in both cases mpi_recv causes an error, what is the problem?

      Thank you in advance for your attention.

      Regards,

      Salvatore.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20061201/cadb6c7d/attachment.htm>