[MPICH] Re: collective abort of all ranks

Kamaraju Kusumanchi kamaraju at gmail.com
Mon Jun 11 22:33:17 CDT 2007


> Consider the test.f90 attached in this email.
>
> $mpif90 test.f90
>
> compiles fine. However, when I run it gives the following error.
>
> $mpiexec -l -n 4 ./a.out
> rank 3 in job 282  node1.jit.mae.cornell.edu_33436   caused collective
> abort ofall ranks
>  exit status of rank 3: killed by signal 11
> rank 1 in job 282  node1.jit.mae.cornell.edu_33436   caused collective
> abort ofall ranks
>  exit status of rank 1: killed by signal 11
>


Couple of additional info which might or might not be useful.

When I am using mpich2 1.0.5p4 libs compiled with gcc 4.2, absoft 8.0
the above error occurs only if I boot 4 different physical nodes.

For example, if I boot just one node

$mpdboot -n 1
$mpdtrace -l
node1.jit.mae.cornell.edu_39865 (192.168.1.1)

Then compilation and execution goes without any errors.
$mpif90 test.f90
$mpiexec -l -n 4 ./a.out

However, if I boot 4 different (physical) nodes

$mpdboot -n 4 -f mpd.hosts
$mpdtrace -l
node1.jit.mae.cornell.edu_39937 (192.168.1.1)
node4.jit.mae.cornell.edu_53096 (192.168.1.4)
node3.jit.mae.cornell.edu_33458 (192.168.1.3)
node2.jit.mae.cornell.edu_33188 (192.168.1.2)

$mpiexec -l -n 4 ./a.out
rank 2 in job 1  node1.jit.mae.cornell.edu_39937   caused collective
abort of all ranks
  exit status of rank 2: killed by signal 11
rank 1 in job 1  node1.jit.mae.cornell.edu_39937   caused collective
abort of all ranks
  exit status of rank 1: killed by signal 11


raju




More information about the mpich-discuss mailing list