[MPICH] Communication problem on a small heterogeneous ring involving Red Hat Linux - Enterprise Edition

Matthew Chambers matthew.chambers at vanderbilt.edu
Fri Dec 1 09:15:04 CST 2006


It sounds to me like your Xeon machine is 64 bit and has a 64 bit install of
Linux on it (using the LP64 data model where longs and pointers are 64 bit,
and ints are 32 bit).  You should not rely on MPICH2 functioning across both
64 bit and 32 bit nodes, even if you do not think you use longs or pointers
in your program.  If you want that kind of support, use MPICH1 instead.
Technically MPICH1 should also work across different byte orderings, but I
could never get my programs to run across both PowerPC and x86 nodes on our
HPC cluster.

 

Matt Chambers

Vanderbilt Bioinformatics

 

  _____  

From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Salvatore Sorce
Sent: Friday, December 01, 2006 3:19 AM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [MPICH] Communication problem on a small heterogeneous ring
involving Red Hat Linux - Enterprise Edition

 

Thank you all for your prompt replies.

 

I carried out some other trial including in the ring another double-Xeon
machine running Scientific Linux CERN release 3.0.6 (kernel 2.4.21-37.EL).
This does not introduce other communication errors. Whatever ring which does
not include the Red Hat Enterprise machine does not generate communication
errors.

All machines have the same byte ordering (little-endian).

I compared the "config.log" files of all involved systems, and I found that
the RH Enterprise machine has a different size for <long> and <long double>
types. In more detail, other machines have a size of 4 and 12 respectively,
while the RH Enteprise has 8 and 16. Maybe this is the reason I experience
the mentioned problems.

To complete the report, I obtain the same error messages even if I use types
of the same size (such as MPI_INTEGER).

 

Do you know if there is a solution, or simply do I have to exclude the RH
Enterprise machine from my cluster (or, maybe, install another OS on that
machine)?

 

Thank you again.

 

Regards,

Salvatore.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20061201/fede4c93/attachment.htm>


More information about the mpich-discuss mailing list