<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.6000.16640" name=GENERATOR></HEAD>
<BODY
style="WORD-WRAP: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space">
<DIV dir=ltr align=left><SPAN class=120192318-13052008><FONT face=Arial
color=#0000ff size=2>Hard to say what the problem might be, but it is likely in
the application or perhaps some mismatch of compilers. For debugging with gdb,
don't redirect the output to a file, just do <FONT face="Times New Roman"
color=#000000 size=3>/share/apps/mpich/bin/mpiexec -gdb -n 2
$HOME/ccv/meam/work/isobaric/runs/PtNi04000PT/ccp_meam
</FONT></FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=120192318-13052008></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=120192318-13052008><FONT face=Arial
color=#0000ff size=2>Rajeev</FONT></SPAN></DIV><BR>
<BLOCKQUOTE dir=ltr
style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #0000ff 2px solid; MARGIN-RIGHT: 0px">
<DIV class=OutlookMessageHeader lang=en-us dir=ltr align=left>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> Margaret Doll
[mailto:Margaret_Doll@brown.edu] <BR><B>Sent:</B> Tuesday, May 13, 2008 10:32
AM<BR><B>To:</B> mpich-discuss@mcs.anl.gov;
mpi-maint@mcs.anl.gov<BR><B>Cc:</B> mpi-maint@mcs.anl.gov<BR><B>Subject:</B>
[MPI #11279] Fwd: rank 0 in job 5 -- caused collective abort of all
ranks<BR></FONT><BR></DIV>
<DIV></DIV>I changed the script to end in
<DIV><BR></DIV>
<DIV>
<DIV>/share/apps/mpich/bin/mpirun -gdb -np 2
$HOME/ccv/meam/work/isobaric/runs/PtNi04000PT/ccp_meam >
$HOME/ccv/meam/work/isobaric/runs/PtNi04000PT/output</DIV>
<DIV><BR></DIV>
<DIV>The contents of the log file:</DIV>
<DIV><BR></DIV>
<DIV>
<DIV>more PtNi04000PT.log</DIV>
<DIV>Warning: no access to tty (Bad file descriptor).</DIV>
<DIV>Thus no job control in this shell.</DIV>
<DIV>Starting execution at Tue May 13 11:27:17 EDT 2008</DIV>
<DIV>0: Traceback (most recent call last):</DIV>
<DIV>0: File "/share/apps/bin/mpdgdbdrv.py", line 75, in ?</DIV>
<DIV>0: write(gdb_sin_fileno,'set confirm off\n')</DIV>
<DIV>0: OSError: [Errno 32] Broken pipe</DIV>
<DIV>1: Traceback (most recent call last):</DIV>
<DIV>1: File "/share/apps/bin/mpdgdbdrv.py", line 75, in ?</DIV>
<DIV>1: write(gdb_sin_fileno,'set confirm off\n')</DIV>
<DIV>1: OSError: [Errno 32] Broken pipe</DIV>
<DIV>Finished at Tue May 13 11:27:17 EDT 2008</DIV>
<DIV><BR></DIV></DIV>
<DIV><BR>
<DIV>Begin forwarded message:</DIV><BR class=Apple-interchange-newline>
<BLOCKQUOTE type="cite">
<DIV>
<DIV style="MARGIN: 0px"><FONT style="FONT: 12px Helvetica; COLOR: #000000"
face=Helvetica size=3><B>From: </B></FONT><FONT style="FONT: 12px Helvetica"
face=Helvetica size=3>Margaret Doll <<A
href="mailto:Margaret_Doll@brown.edu">Margaret_Doll@brown.edu</A>></FONT></DIV>
<DIV style="MARGIN: 0px"><FONT style="FONT: 12px Helvetica; COLOR: #000000"
face=Helvetica color=#000000 size=3><B>Date: </B></FONT><FONT
style="FONT: 12px Helvetica" face=Helvetica size=3>May 13, 2008 10:26:29 AM
EDT</FONT></DIV>
<DIV style="MARGIN: 0px"><FONT style="FONT: 12px Helvetica; COLOR: #000000"
face=Helvetica color=#000000 size=3><B>To: </B></FONT><FONT
style="FONT: 12px Helvetica" face=Helvetica size=3><A
href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</A></FONT></DIV>
<DIV style="MARGIN: 0px"><FONT style="FONT: 12px Helvetica; COLOR: #000000"
face=Helvetica color=#000000 size=3><B>Subject: </B></FONT><FONT
style="FONT: 12px Helvetica" face=Helvetica size=3><B>rank 0 in job 5 --
caused collective abort of all ranks</B></FONT></DIV>
<DIV style="MIN-HEIGHT: 14px; MARGIN: 0px"><BR></DIV></DIV>I am a system
manager of a computer cluster running RedHat, 2.6.9-42.0.2.ELsmp.
Rocks 4.3 is the cluster software.<BR>mpich2 has been compiled using
Portland Group compilers 7-1.1. F95 was enabled in the mpich2
build.<BR><BR>I have a user running a program which ends quickly with
the following error:<BR><BR>$ more
output<BR>**********warning**********<BR>for atom
597
rhobar=
NaN<BR> NaN<BR>**********warning**********<BR>for
atom
1
rhobar=
NaN<BR> NaN<BR>rank
0 in job 1 compute-0-2.local_33364 caused collective abort
of all r<BR>anks<BR> exit status of rank 0: killed by signal
9<BR><BR>The log file only contains:<BR>more PtNi04000PT.log<BR>Warning: no
access to tty (Bad file descriptor).<BR>Thus no job control in this
shell.<BR>Starting execution at Tue May 13 09:25:07 EDT 2008<BR>Finished at
Tue May 13 09:25:08 EDT 2008<BR><BR>The script used
contains:<BR><BR>/share/apps/mpich/bin/mpiexec -np 16
$HOME/ccv/meam/work/isobaric/runs/PtN<BR>i04000PT/ccp_meam >
$HOME/ccv/meam/work/isobaric/runs/PtNi04000PT/output<BR><BR><BR>The job is
submitted using<BR><BR>qsub ./script<BR><BR>The default queue contains
compute-0-1, compute-0-2, compute-0-3. Each compute node has eight
cores.<BR><BR>$ mpdtrace -l<BR>ted.mmm.nnn.edu_49144
(128.148.nnn.nnn)<BR>compute-0-3.local_33222
(10.255.255.249)<BR>compute-0-2.local_33364
(10.255.255.253)<BR>compute-0-1.local_42643
(10.255.255.251)<BR>compute-0-0.local_58959 (10.255.255.250)<BR><BR>Is there
a problem in the way that mpich2 was built, mpd is running or with the
Fortran 95 code?<BR>How do I debug the problem?<BR><BR>Thank you for your
help.<BR><BR><BR></BLOCKQUOTE></DIV><BR></DIV></BLOCKQUOTE></BODY></HTML>