[mpich-discuss] rank 0 in job 5 -- caused collective abort of all ranks

Tue May 13 09:26:29 CDT 2008

I am a system manager of a computer cluster running  RedHat,  
2.6.9-42.0.2.ELsmp.  Rocks 4.3 is the cluster software.
mpich2 has been compiled using Portland Group compilers 7-1.1.  F95  
was enabled in  the mpich2 build.

I have a  user running a program which ends quickly with the following  
error:

$ more output
  **********warning**********
  for atom           597  rhobar=                        NaN
                        NaN
  **********warning**********
  for atom             1  rhobar=                        NaN
                        NaN
rank 0 in job 1  compute-0-2.local_33364   caused collective abort of  
all r
anks
   exit status of rank 0: killed by signal 9

The log file only contains:
more PtNi04000PT.log
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
Starting execution at Tue May 13 09:25:07 EDT 2008
Finished at Tue May 13 09:25:08 EDT 2008

The script used contains:

/share/apps/mpich/bin/mpiexec -np  16 $HOME/ccv/meam/work/isobaric/ 
runs/PtN
i04000PT/ccp_meam > $HOME/ccv/meam/work/isobaric/runs/PtNi04000PT/output

The job is submitted using

qsub ./script

The default queue contains compute-0-1, compute-0-2, compute-0-3.   
Each compute node has eight cores.

$ mpdtrace -l
ted.mmm.nnn.edu_49144 (128.148.nnn.nnn)
compute-0-3.local_33222 (10.255.255.249)
compute-0-2.local_33364 (10.255.255.253)
compute-0-1.local_42643 (10.255.255.251)
compute-0-0.local_58959 (10.255.255.250)

Is there a problem in the way that mpich2 was built, mpd is running or  
with the Fortran 95 code?
How do I debug the problem?

Thank you for your help.