[mpich-discuss] Fwd: rank 0 in job 5 -- caused collective abort of all ranks
Margaret Doll
Margaret_Doll at brown.edu
Tue May 13 10:31:35 CDT 2008
I changed the script to end in
/share/apps/mpich/bin/mpirun -gdb -np 2 $HOME/ccv/meam/work/isobaric/
runs/PtNi04000PT/ccp_meam > $HOME/ccv/meam/work/isobaric/runs/
PtNi04000PT/output
The contents of the log file:
more PtNi04000PT.log
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
Starting execution at Tue May 13 11:27:17 EDT 2008
0: Traceback (most recent call last):
0: File "/share/apps/bin/mpdgdbdrv.py", line 75, in ?
0: write(gdb_sin_fileno,'set confirm off\n')
0: OSError: [Errno 32] Broken pipe
1: Traceback (most recent call last):
1: File "/share/apps/bin/mpdgdbdrv.py", line 75, in ?
1: write(gdb_sin_fileno,'set confirm off\n')
1: OSError: [Errno 32] Broken pipe
Finished at Tue May 13 11:27:17 EDT 2008
Begin forwarded message:
> From: Margaret Doll <Margaret_Doll at brown.edu>
> Date: May 13, 2008 10:26:29 AM EDT
> To: mpich-discuss at mcs.anl.gov
> Subject: rank 0 in job 5 -- caused collective abort of all ranks
>
> I am a system manager of a computer cluster running RedHat,
> 2.6.9-42.0.2.ELsmp. Rocks 4.3 is the cluster software.
> mpich2 has been compiled using Portland Group compilers 7-1.1. F95
> was enabled in the mpich2 build.
>
> I have a user running a program which ends quickly with the
> following error:
>
> $ more output
> **********warning**********
> for atom 597 rhobar= NaN
> NaN
> **********warning**********
> for atom 1 rhobar= NaN
> NaN
> rank 0 in job 1 compute-0-2.local_33364 caused collective abort
> of all r
> anks
> exit status of rank 0: killed by signal 9
>
> The log file only contains:
> more PtNi04000PT.log
> Warning: no access to tty (Bad file descriptor).
> Thus no job control in this shell.
> Starting execution at Tue May 13 09:25:07 EDT 2008
> Finished at Tue May 13 09:25:08 EDT 2008
>
> The script used contains:
>
> /share/apps/mpich/bin/mpiexec -np 16 $HOME/ccv/meam/work/isobaric/
> runs/PtN
> i04000PT/ccp_meam > $HOME/ccv/meam/work/isobaric/runs/PtNi04000PT/
> output
>
>
> The job is submitted using
>
> qsub ./script
>
> The default queue contains compute-0-1, compute-0-2, compute-0-3.
> Each compute node has eight cores.
>
> $ mpdtrace -l
> ted.mmm.nnn.edu_49144 (128.148.nnn.nnn)
> compute-0-3.local_33222 (10.255.255.249)
> compute-0-2.local_33364 (10.255.255.253)
> compute-0-1.local_42643 (10.255.255.251)
> compute-0-0.local_58959 (10.255.255.250)
>
> Is there a problem in the way that mpich2 was built, mpd is running
> or with the Fortran 95 code?
> How do I debug the problem?
>
> Thank you for your help.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080513/fb56297f/attachment.htm>
More information about the mpich-discuss
mailing list