[mpich-discuss] Fwd: rank 0 in job 5 -- caused collective abort of all ranks

Tue May 13 10:31:35 CDT 2008

I changed the script to end in

/share/apps/mpich/bin/mpirun -gdb -np 2 $HOME/ccv/meam/work/isobaric/ 
runs/PtNi04000PT/ccp_meam > $HOME/ccv/meam/work/isobaric/runs/ 
PtNi04000PT/output

The contents of the log file:

more PtNi04000PT.log
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
Starting execution at Tue May 13 11:27:17 EDT 2008
0: Traceback (most recent call last):
0:   File "/share/apps/bin/mpdgdbdrv.py", line 75, in ?
0:     write(gdb_sin_fileno,'set confirm off\n')
0: OSError: [Errno 32] Broken pipe
1: Traceback (most recent call last):
1:   File "/share/apps/bin/mpdgdbdrv.py", line 75, in ?
1:     write(gdb_sin_fileno,'set confirm off\n')
1: OSError: [Errno 32] Broken pipe
Finished at Tue May 13 11:27:17 EDT 2008

Begin forwarded message:

> From: Margaret Doll <Margaret_Doll at brown.edu>
> Date: May 13, 2008 10:26:29 AM EDT
> To: mpich-discuss at mcs.anl.gov
> Subject: rank 0 in job 5 -- caused collective abort of all ranks
>
> I am a system manager of a computer cluster running  RedHat,  
> 2.6.9-42.0.2.ELsmp.  Rocks 4.3 is the cluster software.
> mpich2 has been compiled using Portland Group compilers 7-1.1.  F95  
> was enabled in  the mpich2 build.
>
> I have a  user running a program which ends quickly with the  
> following error:
>
> $ more output
> **********warning**********
> for atom           597  rhobar=                        NaN
>                       NaN
> **********warning**********
> for atom             1  rhobar=                        NaN
>                       NaN
> rank 0 in job 1  compute-0-2.local_33364   caused collective abort  
> of all r
> anks
>  exit status of rank 0: killed by signal 9
>
> The log file only contains:
> more PtNi04000PT.log
> Warning: no access to tty (Bad file descriptor).
> Thus no job control in this shell.
> Starting execution at Tue May 13 09:25:07 EDT 2008
> Finished at Tue May 13 09:25:08 EDT 2008
>
> The script used contains:
>
> /share/apps/mpich/bin/mpiexec -np  16 $HOME/ccv/meam/work/isobaric/ 
> runs/PtN
> i04000PT/ccp_meam > $HOME/ccv/meam/work/isobaric/runs/PtNi04000PT/ 
> output
>
>
> The job is submitted using
>
> qsub ./script
>
> The default queue contains compute-0-1, compute-0-2, compute-0-3.   
> Each compute node has eight cores.
>
> $ mpdtrace -l
> ted.mmm.nnn.edu_49144 (128.148.nnn.nnn)
> compute-0-3.local_33222 (10.255.255.249)
> compute-0-2.local_33364 (10.255.255.253)
> compute-0-1.local_42643 (10.255.255.251)
> compute-0-0.local_58959 (10.255.255.250)
>
> Is there a problem in the way that mpich2 was built, mpd is running  
> or with the Fortran 95 code?
> How do I debug the problem?
>
> Thank you for your help.
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080513/fb56297f/attachment.htm>