[mpich-discuss] RE: [MPI #11279] Fwd: rank 0 in job 5 -- caused collective abort of all ranks

Rajeev Thakur thakur at mcs.anl.gov
Tue May 13 13:24:47 CDT 2008


Hard to say what the problem might be, but it is likely in the application
or perhaps some mismatch of compilers. For debugging with gdb, don't
redirect the output to a file, just do /share/apps/mpich/bin/mpiexec -gdb -n
2 $HOME/ccv/meam/work/isobaric/runs/PtNi04000PT/ccp_meam 
 
Rajeev


  _____  

From: Margaret Doll [mailto:Margaret_Doll at brown.edu] 
Sent: Tuesday, May 13, 2008 10:32 AM
To: mpich-discuss at mcs.anl.gov; mpi-maint at mcs.anl.gov
Cc: mpi-maint at mcs.anl.gov
Subject: [MPI #11279] Fwd: rank 0 in job 5 -- caused collective abort of all
ranks


I changed the script to end in 

/share/apps/mpich/bin/mpirun -gdb -np 2
$HOME/ccv/meam/work/isobaric/runs/PtNi04000PT/ccp_meam >
$HOME/ccv/meam/work/isobaric/runs/PtNi04000PT/output

The contents of the log file:

more PtNi04000PT.log
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
Starting execution at Tue May 13 11:27:17 EDT 2008
0: Traceback (most recent call last):
0:   File "/share/apps/bin/mpdgdbdrv.py", line 75, in ?
0:     write(gdb_sin_fileno,'set confirm off\n')
0: OSError: [Errno 32] Broken pipe
1: Traceback (most recent call last):
1:   File "/share/apps/bin/mpdgdbdrv.py", line 75, in ?
1:     write(gdb_sin_fileno,'set confirm off\n')
1: OSError: [Errno 32] Broken pipe
Finished at Tue May 13 11:27:17 EDT 2008


Begin forwarded message:


From: Margaret Doll <Margaret_Doll at brown.edu>
Date: May 13, 2008 10:26:29 AM EDT
To: mpich-discuss at mcs.anl.gov
Subject: rank 0 in job 5 -- caused collective abort of all ranks

I am a system manager of a computer cluster running  RedHat,
2.6.9-42.0.2.ELsmp.  Rocks 4.3 is the cluster software.
mpich2 has been compiled using Portland Group compilers 7-1.1.  F95 was
enabled in  the mpich2 build.

I have a  user running a program which ends quickly with the following
error:

$ more output
**********warning**********
for atom           597  rhobar=                        NaN
                      NaN
**********warning**********
for atom             1  rhobar=                        NaN
                      NaN
rank 0 in job 1  compute-0-2.local_33364   caused collective abort of all r
anks
 exit status of rank 0: killed by signal 9

The log file only contains:
more PtNi04000PT.log
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
Starting execution at Tue May 13 09:25:07 EDT 2008
Finished at Tue May 13 09:25:08 EDT 2008

The script used contains:

/share/apps/mpich/bin/mpiexec -np  16 $HOME/ccv/meam/work/isobaric/runs/PtN
i04000PT/ccp_meam > $HOME/ccv/meam/work/isobaric/runs/PtNi04000PT/output


The job is submitted using

qsub ./script

The default queue contains compute-0-1, compute-0-2, compute-0-3.  Each
compute node has eight cores.

$ mpdtrace -l
ted.mmm.nnn.edu_49144 (128.148.nnn.nnn)
compute-0-3.local_33222 (10.255.255.249)
compute-0-2.local_33364 (10.255.255.253)
compute-0-1.local_42643 (10.255.255.251)
compute-0-0.local_58959 (10.255.255.250)

Is there a problem in the way that mpich2 was built, mpd is running or with
the Fortran 95 code?
How do I debug the problem?

Thank you for your help.





-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080513/957d6d70/attachment.htm>


More information about the mpich-discuss mailing list