[mpich-discuss] RE: [MPI #11279] Fwd: rank 0 in job 5 -- caused collective abort of all ranks
Rajeev Thakur
thakur at mcs.anl.gov
Tue May 13 13:24:47 CDT 2008
Hard to say what the problem might be, but it is likely in the application
or perhaps some mismatch of compilers. For debugging with gdb, don't
redirect the output to a file, just do /share/apps/mpich/bin/mpiexec -gdb -n
2 $HOME/ccv/meam/work/isobaric/runs/PtNi04000PT/ccp_meam
Rajeev
_____
From: Margaret Doll [mailto:Margaret_Doll at brown.edu]
Sent: Tuesday, May 13, 2008 10:32 AM
To: mpich-discuss at mcs.anl.gov; mpi-maint at mcs.anl.gov
Cc: mpi-maint at mcs.anl.gov
Subject: [MPI #11279] Fwd: rank 0 in job 5 -- caused collective abort of all
ranks
I changed the script to end in
/share/apps/mpich/bin/mpirun -gdb -np 2
$HOME/ccv/meam/work/isobaric/runs/PtNi04000PT/ccp_meam >
$HOME/ccv/meam/work/isobaric/runs/PtNi04000PT/output
The contents of the log file:
more PtNi04000PT.log
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
Starting execution at Tue May 13 11:27:17 EDT 2008
0: Traceback (most recent call last):
0: File "/share/apps/bin/mpdgdbdrv.py", line 75, in ?
0: write(gdb_sin_fileno,'set confirm off\n')
0: OSError: [Errno 32] Broken pipe
1: Traceback (most recent call last):
1: File "/share/apps/bin/mpdgdbdrv.py", line 75, in ?
1: write(gdb_sin_fileno,'set confirm off\n')
1: OSError: [Errno 32] Broken pipe
Finished at Tue May 13 11:27:17 EDT 2008
Begin forwarded message:
From: Margaret Doll <Margaret_Doll at brown.edu>
Date: May 13, 2008 10:26:29 AM EDT
To: mpich-discuss at mcs.anl.gov
Subject: rank 0 in job 5 -- caused collective abort of all ranks
I am a system manager of a computer cluster running RedHat,
2.6.9-42.0.2.ELsmp. Rocks 4.3 is the cluster software.
mpich2 has been compiled using Portland Group compilers 7-1.1. F95 was
enabled in the mpich2 build.
I have a user running a program which ends quickly with the following
error:
$ more output
**********warning**********
for atom 597 rhobar= NaN
NaN
**********warning**********
for atom 1 rhobar= NaN
NaN
rank 0 in job 1 compute-0-2.local_33364 caused collective abort of all r
anks
exit status of rank 0: killed by signal 9
The log file only contains:
more PtNi04000PT.log
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
Starting execution at Tue May 13 09:25:07 EDT 2008
Finished at Tue May 13 09:25:08 EDT 2008
The script used contains:
/share/apps/mpich/bin/mpiexec -np 16 $HOME/ccv/meam/work/isobaric/runs/PtN
i04000PT/ccp_meam > $HOME/ccv/meam/work/isobaric/runs/PtNi04000PT/output
The job is submitted using
qsub ./script
The default queue contains compute-0-1, compute-0-2, compute-0-3. Each
compute node has eight cores.
$ mpdtrace -l
ted.mmm.nnn.edu_49144 (128.148.nnn.nnn)
compute-0-3.local_33222 (10.255.255.249)
compute-0-2.local_33364 (10.255.255.253)
compute-0-1.local_42643 (10.255.255.251)
compute-0-0.local_58959 (10.255.255.250)
Is there a problem in the way that mpich2 was built, mpd is running or with
the Fortran 95 code?
How do I debug the problem?
Thank you for your help.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080513/957d6d70/attachment.htm>
More information about the mpich-discuss
mailing list