[mpich-discuss] MPI and slurm

Fri Oct 7 10:33:08 CDT 2011

Hello all,

I'm trying to configure MPI 1.4.1p1 and slurm 2.3.0 on a 4-node cluster. I've tried to do this in two different ways but with no success and could use some pointers. First, I configured MPI using slurm as the pmi:

./configure --sysconfdir=/etc --localstatedir=/var --with-pmi=slurm --with-pm=none

Using the srun command to run 32 ranks across 2 machines in the cluster, MPI bailed out with this error:

Fatal error in PMPI_Isend: Other MPI error, error stack:
PMPI_Isend(148)..........: MPI_Isend(buf=0x7ff41b42b010, count=2097152, MPI_DOUBLE, dest=0, tag=0, MPI_COMM_WORLD, request=0x6c8a38) failed
MPID_nem_lmt_RndvSend(81): 
MPIDI_CH3_RndvSend(63)...: failure occurred while attempting to send RTS packet
MPIDI_CH3_iStartMsg(36)..: Communication error with rank 0
srun: error: hercules-2: task 31: Exited with exit code 1
Fatal error in PMPI_Isend: Other MPI error, error stack:
PMPI_Isend(148)..........: MPI_Isend(buf=0x7f5df0e30010, count=2097152, MPI_DOUBLE, dest=24, tag=0, MPI_COMM_WORLD, request=0x6c8a38) failed
MPID_nem_lmt_RndvSend(81): 
MPIDI_CH3_RndvSend(63)...: failure occurred while attempting to send RTS packet
MPIDI_CH3_iStartMsg(36)..: Communication error with rank 24
srun: error: hercules-1: task 23: Exited with exit code 1
srun: First task exited 30s ago
srun: tasks 0-22,24-30: running
srun: tasks 23,31: exited abnormally
srun: Terminating job step 26.0

Given that ranks 23 and 31 terminated and that running the same command on a single machine works correctly, I assume it must be some inter-machine communication issue. I went to the FAQs (http://wiki.mcs.anl.gov/mpich2/index.php/Frequently_Asked_Questions#Q:_How_do_I_use_MPICH2_with_slurm.3F) to see if there was a solution to this. That document suggests verifying that one can ssh between machines and that the firewalls are off. This happens to be the case in my configuration so that appeared to be a dead end.

I went back and reconfigured MPI to use hydra to see if maybe there was an issue with the slurm PMI. Using a bash script containing the following mpiexec I ran an sbatch:

#!/bin/bash
# run.sh
export HYDRA_BOOTSTRAP=slurm
mpiexec -n 16 ./mat_mul 8192
#END run.sh

hercules-1# sbatch -t 10 -p two -n 16 ./run.sh

This also failed with the same error as above. Just to see what would happen, I added a -hosts option to the mpiexec call to see if that would help at all. This caused the processes to not crash, but instead they all blocked at an MPI_Recv operation, indicating that the MPI_Isend operations were not occurring correctly, similar to what had happened previously. At this point I've run out of ideas as to how to proceed, so if anyone can point me in the right direction I would greatly appreciate it.

Thanks for your time,
Evan

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111007/f67a970b/attachment.htm>