[MPICH] Re: collective abort of all ranks
Kamaraju Kusumanchi
kamaraju at gmail.com
Mon Jun 11 22:33:17 CDT 2007
> Consider the test.f90 attached in this email.
>
> $mpif90 test.f90
>
> compiles fine. However, when I run it gives the following error.
>
> $mpiexec -l -n 4 ./a.out
> rank 3 in job 282 node1.jit.mae.cornell.edu_33436 caused collective
> abort ofall ranks
> exit status of rank 3: killed by signal 11
> rank 1 in job 282 node1.jit.mae.cornell.edu_33436 caused collective
> abort ofall ranks
> exit status of rank 1: killed by signal 11
>
Couple of additional info which might or might not be useful.
When I am using mpich2 1.0.5p4 libs compiled with gcc 4.2, absoft 8.0
the above error occurs only if I boot 4 different physical nodes.
For example, if I boot just one node
$mpdboot -n 1
$mpdtrace -l
node1.jit.mae.cornell.edu_39865 (192.168.1.1)
Then compilation and execution goes without any errors.
$mpif90 test.f90
$mpiexec -l -n 4 ./a.out
However, if I boot 4 different (physical) nodes
$mpdboot -n 4 -f mpd.hosts
$mpdtrace -l
node1.jit.mae.cornell.edu_39937 (192.168.1.1)
node4.jit.mae.cornell.edu_53096 (192.168.1.4)
node3.jit.mae.cornell.edu_33458 (192.168.1.3)
node2.jit.mae.cornell.edu_33188 (192.168.1.2)
$mpiexec -l -n 4 ./a.out
rank 2 in job 1 node1.jit.mae.cornell.edu_39937 caused collective
abort of all ranks
exit status of rank 2: killed by signal 11
rank 1 in job 1 node1.jit.mae.cornell.edu_39937 caused collective
abort of all ranks
exit status of rank 1: killed by signal 11
raju
More information about the mpich-discuss
mailing list