[MPICH] heterogeneous x86_64 and i386 endian problem

Philip Sydney Lavers psl02 at uow.edu.au
Thu Jun 8 18:34:04 CDT 2006


Robert,

I have done what you are trying to do, but to succeed I had to load the 32 bit version of the OS onto the 64 bit machine. I used Mandrake (now Mandriva) 10. I initially put Windows XP (32 bit) onto the 64 bit machine as well, but it just gradually died.

I honestly dont think it is worth the effort in this day of dual core Opterons and Athlon64s. My current clusters are all 64bit including dual processor and dual core boxes. One is FreeBSD and the other is Fedora 5. In Australia (self build from computer market parts) a well equipped i386 costs about $500-$700   and a dual core Athlon64 is about $800 - $1000.  

The improved performance mightily otweighs cost considerations.

regards,

Philip Lavers
---- Original message ----
>Date: Thu, 8 Jun 2006 16:32:03 -0400 
>From: McMullen Robert W Ctr AFRL/VSBYH <Robert.McMullen.ctr at hanscom.af.mil>  
>Subject: [MPICH] heterogeneous x86_64 and i386 endian problem  
>To: "'mpich-discuss at mcs.anl.gov'" <mpich-discuss at mcs.anl.gov>
>
>I'm using mpich-1.2.7p1 because mpich2 doesn't yet support heterogeneous clusters.  I'm trying to have an x86_64 machine as the root node operating on a >4GB dataset, where it's farmed out to a bunch of cheap i386 nodes with 1GB memory.  Can't operate on i386 only because they don't support >4GB memory.  Can use x86_64 only, but don't have many and do have all these i386 machines begging for use.
>
>It seems that for MPI_INT and MPI_FLOAT (not MPI_SHORT or MPI_DOUBLE), the i386 nodes return data that claims to be big endian!
>
>Anyone ever seen this?  I'm not the MPI expert so I may be doing something silly, but I'm also including a test program that does show the problem on my cluster.  There's not much to the test program other than creating some test arrays, using MPI_Gatherv to share the data, and testing the data for min and max values outside the possible range.
>
>For example, the results clearly show the corruption is happening:
>
>size=100000 proc=0: status=0 datatype=MPI_INT min=50.000000 max=100049.000000
>size=100000 proc=2: status=0 datatype=MPI_INT min=20050.000000 max=120049.000000
>size=100000 proc=4: status=0 datatype=MPI_INT min=40050.000000 max=140049.000000
>size=100000 proc=1: status=0 datatype=MPI_INT min=10050.000000 max=110049.000000
>size=100000 proc=3: status=0 datatype=MPI_INT min=30050.000000 max=130049.000000
>at root: size=100000 datatype=MPI_INT min=-2147483392.000000 max=2147418368.000000
>
>Looking at the generated output bad-proc?-<size>-MPI_INT.bin shows that the data from the bad procs is only endian swapped; manually swapping the bytes shows that the processors did correctly generate the data.  Somewhere in the transfer something is getting confused.  Interestingly, if the transfer count is only 10000 the byte-swapping problem doesn't occur.  But at counts of 100000 or greater, it does.
>
>Compile two versions, one on i386 named heterogeneous.i386 and one on x86_64 named heterogeneous.x86_64, and run with x86_64 as the root node:
>
>mpirun -machinefile machines.txt -arch x86_64 -np 3 -arch i386 -np 2 /path/to/heterogeneous.%a
>
>where the machinefile lists the x86_64 nodes first:
>
>64bit1.example.com
>64bit2.example.com
>64bit3.example.com
>32bit1.example.com
>32bit2.example.com
>
>Thanks for any help.
>
>Rob
>
>PS: Test program would have been much cleaner with C++ templates, but I didn't want to eliminate any C-only users from being able to compile the program. :)
>
>________________
>heterogeneous.c (7k bytes)




More information about the mpich-discuss mailing list