[mpich-discuss] rank 0 in job 5 -- caused collective abort of all ranks
Margaret Doll
Margaret_Doll at brown.edu
Tue May 13 09:26:29 CDT 2008
I am a system manager of a computer cluster running RedHat,
2.6.9-42.0.2.ELsmp. Rocks 4.3 is the cluster software.
mpich2 has been compiled using Portland Group compilers 7-1.1. F95
was enabled in the mpich2 build.
I have a user running a program which ends quickly with the following
error:
$ more output
**********warning**********
for atom 597 rhobar= NaN
NaN
**********warning**********
for atom 1 rhobar= NaN
NaN
rank 0 in job 1 compute-0-2.local_33364 caused collective abort of
all r
anks
exit status of rank 0: killed by signal 9
The log file only contains:
more PtNi04000PT.log
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
Starting execution at Tue May 13 09:25:07 EDT 2008
Finished at Tue May 13 09:25:08 EDT 2008
The script used contains:
/share/apps/mpich/bin/mpiexec -np 16 $HOME/ccv/meam/work/isobaric/
runs/PtN
i04000PT/ccp_meam > $HOME/ccv/meam/work/isobaric/runs/PtNi04000PT/output
The job is submitted using
qsub ./script
The default queue contains compute-0-1, compute-0-2, compute-0-3.
Each compute node has eight cores.
$ mpdtrace -l
ted.mmm.nnn.edu_49144 (128.148.nnn.nnn)
compute-0-3.local_33222 (10.255.255.249)
compute-0-2.local_33364 (10.255.255.253)
compute-0-1.local_42643 (10.255.255.251)
compute-0-0.local_58959 (10.255.255.250)
Is there a problem in the way that mpich2 was built, mpd is running or
with the Fortran 95 code?
How do I debug the problem?
Thank you for your help.
More information about the mpich-discuss
mailing list