[MPICH] Re: collective abort of all ranks

Rajeev Thakur thakur at mcs.anl.gov
Tue Jun 12 12:03:52 CDT 2007


There might be some problem on one of the nodes. Can you try running
individually on each of the nodes (not across the nodes).

Rajeev 

> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov 
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of 
> Kamaraju Kusumanchi
> Sent: Monday, June 11, 2007 10:33 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: [MPICH] Re: collective abort of all ranks
> 
> > Consider the test.f90 attached in this email.
> >
> > $mpif90 test.f90
> >
> > compiles fine. However, when I run it gives the following error.
> >
> > $mpiexec -l -n 4 ./a.out
> > rank 3 in job 282  node1.jit.mae.cornell.edu_33436   caused 
> collective
> > abort ofall ranks
> >  exit status of rank 3: killed by signal 11
> > rank 1 in job 282  node1.jit.mae.cornell.edu_33436   caused 
> collective
> > abort ofall ranks
> >  exit status of rank 1: killed by signal 11
> >
> 
> 
> Couple of additional info which might or might not be useful.
> 
> When I am using mpich2 1.0.5p4 libs compiled with gcc 4.2, absoft 8.0
> the above error occurs only if I boot 4 different physical nodes.
> 
> For example, if I boot just one node
> 
> $mpdboot -n 1
> $mpdtrace -l
> node1.jit.mae.cornell.edu_39865 (192.168.1.1)
> 
> Then compilation and execution goes without any errors.
> $mpif90 test.f90
> $mpiexec -l -n 4 ./a.out
> 
> However, if I boot 4 different (physical) nodes
> 
> $mpdboot -n 4 -f mpd.hosts
> $mpdtrace -l
> node1.jit.mae.cornell.edu_39937 (192.168.1.1)
> node4.jit.mae.cornell.edu_53096 (192.168.1.4)
> node3.jit.mae.cornell.edu_33458 (192.168.1.3)
> node2.jit.mae.cornell.edu_33188 (192.168.1.2)
> 
> $mpiexec -l -n 4 ./a.out
> rank 2 in job 1  node1.jit.mae.cornell.edu_39937   caused collective
> abort of all ranks
>   exit status of rank 2: killed by signal 11
> rank 1 in job 1  node1.jit.mae.cornell.edu_39937   caused collective
> abort of all ranks
>   exit status of rank 1: killed by signal 11
> 
> 
> raju
> 
> 




More information about the mpich-discuss mailing list