<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">I changed the script to end in<div><br></div><div><div>/share/apps/mpich/bin/mpirun -gdb -np 2 $HOME/ccv/meam/work/isobaric/runs/PtNi04000PT/ccp_meam > $HOME/ccv/meam/work/isobaric/runs/PtNi04000PT/output</div><div><br></div><div>The contents of the log file:</div><div><br></div><div><div>more PtNi04000PT.log</div><div>Warning: no access to tty (Bad file descriptor).</div><div>Thus no job control in this shell.</div><div>Starting execution at Tue May 13 11:27:17 EDT 2008</div><div>0: Traceback (most recent call last):</div><div>0: File "/share/apps/bin/mpdgdbdrv.py", line 75, in ?</div><div>0: write(gdb_sin_fileno,'set confirm off\n')</div><div>0: OSError: [Errno 32] Broken pipe</div><div>1: Traceback (most recent call last):</div><div>1: File "/share/apps/bin/mpdgdbdrv.py", line 75, in ?</div><div>1: write(gdb_sin_fileno,'set confirm off\n')</div><div>1: OSError: [Errno 32] Broken pipe</div><div>Finished at Tue May 13 11:27:17 EDT 2008</div><div><br></div></div><div><br><div>Begin forwarded message:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><font face="Helvetica" size="3" color="#000000" style="font: 12.0px Helvetica; color: #000000"><b>From: </b></font><font face="Helvetica" size="3" style="font: 12.0px Helvetica">Margaret Doll <<a href="mailto:Margaret_Doll@brown.edu">Margaret_Doll@brown.edu</a>></font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><font face="Helvetica" size="3" color="#000000" style="font: 12.0px Helvetica; color: #000000"><b>Date: </b></font><font face="Helvetica" size="3" style="font: 12.0px Helvetica">May 13, 2008 10:26:29 AM EDT</font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><font face="Helvetica" size="3" color="#000000" style="font: 12.0px Helvetica; color: #000000"><b>To: </b></font><font face="Helvetica" size="3" style="font: 12.0px Helvetica"><a href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a></font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><font face="Helvetica" size="3" color="#000000" style="font: 12.0px Helvetica; color: #000000"><b>Subject: </b></font><font face="Helvetica" size="3" style="font: 12.0px Helvetica"><b>rank 0 in job 5 -- caused collective abort of all ranks</b></font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; min-height: 14px; "><br></div> </div>I am a system manager of a computer cluster running RedHat, 2.6.9-42.0.2.ELsmp. Rocks 4.3 is the cluster software.<br>mpich2 has been compiled using Portland Group compilers 7-1.1. F95 was enabled in the mpich2 build.<br><br>I have a user running a program which ends quickly with the following error:<br><br>$ more output<br> **********warning**********<br> for atom 597 rhobar= NaN<br> NaN<br> **********warning**********<br> for atom 1 rhobar= NaN<br> NaN<br>rank 0 in job 1 compute-0-2.local_33364 caused collective abort of all r<br>anks<br> exit status of rank 0: killed by signal 9<br><br>The log file only contains:<br>more PtNi04000PT.log<br>Warning: no access to tty (Bad file descriptor).<br>Thus no job control in this shell.<br>Starting execution at Tue May 13 09:25:07 EDT 2008<br>Finished at Tue May 13 09:25:08 EDT 2008<br><br>The script used contains:<br><br>/share/apps/mpich/bin/mpiexec -np 16 $HOME/ccv/meam/work/isobaric/runs/PtN<br>i04000PT/ccp_meam > $HOME/ccv/meam/work/isobaric/runs/PtNi04000PT/output<br><br><br>The job is submitted using<br><br>qsub ./script<br><br>The default queue contains compute-0-1, compute-0-2, compute-0-3. Each compute node has eight cores.<br><br>$ mpdtrace -l<br>ted.mmm.nnn.edu_49144 (128.148.nnn.nnn)<br>compute-0-3.local_33222 (10.255.255.249)<br>compute-0-2.local_33364 (10.255.255.253)<br>compute-0-1.local_42643 (10.255.255.251)<br>compute-0-0.local_58959 (10.255.255.250)<br><br>Is there a problem in the way that mpich2 was built, mpd is running or with the Fortran 95 code?<br>How do I debug the problem?<br><br>Thank you for your help.<br><br><br></blockquote></div><br></div></body></html>