[mpich-discuss] mpi job crashing: rank 2 in job 7 caused collective abort of all ranks

Rajeev Thakur thakur at mcs.anl.gov
Fri Oct 31 13:21:19 CDT 2008


Does the code run elsewhere with 16 processes if you run it simply as
"mpiexec -n 16 a.out"?

Rajeev 

> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov 
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Mary 
> Ellen Fitzpatrick
> Sent: Friday, October 31, 2008 10:53 AM
> To: mpich-discuss at mcs.anl.gov; Mary Ellen Fitzpatrick
> Subject: [mpich-discuss] mpi job crashing: rank 2 in job 7 
> caused collective abort of all ranks
> 
> Hi,
> Trying to run mpi jobs on my cluster and get: rank 2 in job 7 caused 
> collective abort of all ranks   mpi job fails.
> 48 (dual-dual core cpu) node Linux cluster, running Torque 
> with Maui.  
> Have MPICH2-1.07 installed.  I have mpd's running on all 
> nodes started 
> from the head node as root. 
> 
> What I want to do is submit an mpi job that runs one process/per node 
> requests all 4 cores on the node and I want to submit this 
> one process 
> to 4 nodes.
> 
> If I request in my pbs script 1 node with 4 processors, then it works 
> fine:  #PBS -l nodes=1:ppn=4, everything runs on one node 4 cpus, mpi
> output says everything ran perfect.  If I request in my pbs script 4 
> nodes with 4 processors,  #PBS -l nodes=4:ppn=4, then it fails,  my 
> torque epilogue/proloque output file say the job ran on 4 nodes and 
> requests 16 processors.
> But the mpi output file has the following error:
> 
> -snippet-
> Initializing MPI Routines...
> Initializing MPI Routines...
> Initializing MPI Routines...
> rank 2 in job 7  node1043_55948   caused collective abort of all ranks
>   exit status of rank 2: killed by signal 9
> rank 1 in job 7  node1043_55948   caused collective abort of all ranks
>   exit status of rank 1: killed by signal 9
> -snippet -
> 
> Anyone know why my mpi job is crashing?
> 
> -- 
> Thanks
> Mary Ellen
> 
> 




More information about the mpich-discuss mailing list