[MPICH] Fatal error in MPI_Init using ch3:sock or ch3:ssm

Romain Dolbeau romain at dolbeau.org
Thu May 31 04:36:15 CDT 2007


Hello,

I hope this is the right place for my question, if not I apologize
for the intrusion.

I have a MPI code written in Fortran 90. I have recompiled mpich2-1.0.5p4
using channel ch3:shm on both x86_64 (using the PGI compiler) and i686
(using the Intel compiler), and the code works fine on shared memory
machines, using the same compilers.

I have recompiled mpich to use either ch3:ssm or ch3:sock on i686,
using the Intel compiler. The compilation worked fine, but when I try
to use my MPI code with either version on a cluster of 2 machines,
I get the following error:

#####
mpiexec -l -n 2 -wdir `pwd` <MyBinary> <MyParameters>
rank 1 in job 1  bombadil_66667   caused collective abort of all ranks
   exit status of rank 1: return code 1
1: Fatal error in MPI_Init: Other MPI error, error stack:
1: MPIR_Init_thread(247): Initialization failed
1: MPID_Init(82)........: channel initialization failed
#####

All directories involved are NFS mounted and accessible R/W
on both machines at the same path.

The mpd ring seems to work fine:
#####
$ mpdtrace -l
bombadil_66667 (172.21.0.35)
dwalin_66666 (172.21.0.17)
#####

If the program doesn't use any MPI functions, it works fine:
#####
mpiexec -l -n 2 -wdir /boot /bin/ls
0: System.map-2.6.18-4-686-bigmem
0: config-2.6.18-4-686-bigmem
0: grub
0: initrd.img-2.6.18-4-686-bigmem
0: vmlinuz-2.6.18-4-686-bigmem
1: System.map-2.6.18-4-686
1: config-2.6.18-4-686
1: grub
1: initrd.img-2.6.18-4-686
1: vmlinuz-2.6.18-4-686
#####
(the bigmem kernel is on bombadil, the regular kernel on dwalin).

The example cpi also works fine:
#####
mpiexec -l -n 8 -wdir `pwd` ./cpi
0: Process 0 of 8 is on bombadil
1: Process 1 of 8 is on dwalin
3: Process 3 of 8 is on dwalin
5: Process 5 of 8 is on dwalin
6: Process 6 of 8 is on bombadil
7: Process 7 of 8 is on dwalin
4: Process 4 of 8 is on bombadil
2: Process 2 of 8 is on bombadil
0: pi is approximately 3.1415926544231265, Error is 0.0000000008333334
0: wall clock time = 0.330268
#####

There must be some features in the code that isn't quite supported/
right/working, but I didn't write the code, and I have no idea
where to look.

Any help greatly appreciated !

-- 
Romain Dolbeau




More information about the mpich-discuss mailing list