[MPICH] heterogeneous x86_64 and i386 endian problem
McMullen Robert W Ctr AFRL/VSBYH
Robert.McMullen.ctr at hanscom.af.mil
Thu Jun 8 15:32:03 CDT 2006
I'm using mpich-1.2.7p1 because mpich2 doesn't yet support heterogeneous clusters. I'm trying to have an x86_64 machine as the root node operating on a >4GB dataset, where it's farmed out to a bunch of cheap i386 nodes with 1GB memory. Can't operate on i386 only because they don't support >4GB memory. Can use x86_64 only, but don't have many and do have all these i386 machines begging for use.
It seems that for MPI_INT and MPI_FLOAT (not MPI_SHORT or MPI_DOUBLE), the i386 nodes return data that claims to be big endian!
Anyone ever seen this? I'm not the MPI expert so I may be doing something silly, but I'm also including a test program that does show the problem on my cluster. There's not much to the test program other than creating some test arrays, using MPI_Gatherv to share the data, and testing the data for min and max values outside the possible range.
For example, the results clearly show the corruption is happening:
size=100000 proc=0: status=0 datatype=MPI_INT min=50.000000 max=100049.000000
size=100000 proc=2: status=0 datatype=MPI_INT min=20050.000000 max=120049.000000
size=100000 proc=4: status=0 datatype=MPI_INT min=40050.000000 max=140049.000000
size=100000 proc=1: status=0 datatype=MPI_INT min=10050.000000 max=110049.000000
size=100000 proc=3: status=0 datatype=MPI_INT min=30050.000000 max=130049.000000
at root: size=100000 datatype=MPI_INT min=-2147483392.000000 max=2147418368.000000
Looking at the generated output bad-proc?-<size>-MPI_INT.bin shows that the data from the bad procs is only endian swapped; manually swapping the bytes shows that the processors did correctly generate the data. Somewhere in the transfer something is getting confused. Interestingly, if the transfer count is only 10000 the byte-swapping problem doesn't occur. But at counts of 100000 or greater, it does.
Compile two versions, one on i386 named heterogeneous.i386 and one on x86_64 named heterogeneous.x86_64, and run with x86_64 as the root node:
mpirun -machinefile machines.txt -arch x86_64 -np 3 -arch i386 -np 2 /path/to/heterogeneous.%a
where the machinefile lists the x86_64 nodes first:
64bit1.example.com
64bit2.example.com
64bit3.example.com
32bit1.example.com
32bit2.example.com
Thanks for any help.
Rob
PS: Test program would have been much cleaner with C++ templates, but I didn't want to eliminate any C-only users from being able to compile the program. :)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: heterogeneous.c
Type: application/octet-stream
Size: 5685 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20060608/26973ab0/attachment.obj>
More information about the mpich-discuss
mailing list