[MPICH] Fatal error in MPI_Init using ch3:sock or ch3:ssm

Rajeev Thakur thakur at mcs.anl.gov
Thu May 31 10:56:06 CDT 2007


If you run "make testing" in the top-level mpich2 directory, it will run the
entire test suite in test/mpi. Do all those tests pass?

Rajeev 

> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov 
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Romain Dolbeau
> Sent: Thursday, May 31, 2007 4:36 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: [MPICH] Fatal error in MPI_Init using ch3:sock or ch3:ssm
> 
> Hello,
> 
> I hope this is the right place for my question, if not I apologize
> for the intrusion.
> 
> I have a MPI code written in Fortran 90. I have recompiled 
> mpich2-1.0.5p4
> using channel ch3:shm on both x86_64 (using the PGI compiler) and i686
> (using the Intel compiler), and the code works fine on shared memory
> machines, using the same compilers.
> 
> I have recompiled mpich to use either ch3:ssm or ch3:sock on i686,
> using the Intel compiler. The compilation worked fine, but when I try
> to use my MPI code with either version on a cluster of 2 machines,
> I get the following error:
> 
> #####
> mpiexec -l -n 2 -wdir `pwd` <MyBinary> <MyParameters>
> rank 1 in job 1  bombadil_66667   caused collective abort of all ranks
>    exit status of rank 1: return code 1
> 1: Fatal error in MPI_Init: Other MPI error, error stack:
> 1: MPIR_Init_thread(247): Initialization failed
> 1: MPID_Init(82)........: channel initialization failed
> #####
> 
> All directories involved are NFS mounted and accessible R/W
> on both machines at the same path.
> 
> The mpd ring seems to work fine:
> #####
> $ mpdtrace -l
> bombadil_66667 (172.21.0.35)
> dwalin_66666 (172.21.0.17)
> #####
> 
> If the program doesn't use any MPI functions, it works fine:
> #####
> mpiexec -l -n 2 -wdir /boot /bin/ls
> 0: System.map-2.6.18-4-686-bigmem
> 0: config-2.6.18-4-686-bigmem
> 0: grub
> 0: initrd.img-2.6.18-4-686-bigmem
> 0: vmlinuz-2.6.18-4-686-bigmem
> 1: System.map-2.6.18-4-686
> 1: config-2.6.18-4-686
> 1: grub
> 1: initrd.img-2.6.18-4-686
> 1: vmlinuz-2.6.18-4-686
> #####
> (the bigmem kernel is on bombadil, the regular kernel on dwalin).
> 
> The example cpi also works fine:
> #####
> mpiexec -l -n 8 -wdir `pwd` ./cpi
> 0: Process 0 of 8 is on bombadil
> 1: Process 1 of 8 is on dwalin
> 3: Process 3 of 8 is on dwalin
> 5: Process 5 of 8 is on dwalin
> 6: Process 6 of 8 is on bombadil
> 7: Process 7 of 8 is on dwalin
> 4: Process 4 of 8 is on bombadil
> 2: Process 2 of 8 is on bombadil
> 0: pi is approximately 3.1415926544231265, Error is 0.0000000008333334
> 0: wall clock time = 0.330268
> #####
> 
> There must be some features in the code that isn't quite supported/
> right/working, but I didn't write the code, and I have no idea
> where to look.
> 
> Any help greatly appreciated !
> 
> -- 
> Romain Dolbeau
> 
> 




More information about the mpich-discuss mailing list