<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.2800.1498" name=GENERATOR></HEAD>
<BODY>
<DIV><FONT face=Arial size=2><SPAN class=921445721-14062005>I hope everyone can
excuse what might not be the best type of question for this list (not MPICH2
related)</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=921445721-14062005></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=921445721-14062005>Due to application
requirements I must use MPICH 1.2.x or LAM to run parallel jobs. Jobs are
run on RedHat 9 x86 SMP nodes. Jobs are launched via rsh by internally
developed queuing program. In the past we used LAM but I am trying
to migrate to MPICH due to other technical requirements. The problem
I'm having is that error handling seems to not be as clean as LAM's
implementation. If someone queues up a job which crashes out the job (not
due to MPICH errors), the following is produced :</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>[1] MPI Abort by user Aborting program !<BR>[1]
Aborting program!</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=921445721-14062005>This error is fine
but my problem is that I need the command prompt to return after this condition
for the queuing program to work correctly. The prompt returns
when I login locally on the cluster and launch the job but not if I start the
job by rsh or ssh. I tested this problem with rsh/ssh client being Linux
or Solaris. </SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=921445721-14062005></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=921445721-14062005>Running the same job
with LAM still causes the termination (as it should) but the prompt always
returns (local login or rsh launch). With MPICH one process is always
hung while LAM seems to terminate all the processes. This might explain
why the prompt behaves differently with MPICH rsh vs LAM
rsh.</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><SPAN class=921445721-14062005><FONT face=Arial size=2>I tested this
condition with MPICH 1.2.5.1a, 1.2.5.3 and 1.2.6. </FONT></SPAN></DIV>
<DIV><SPAN class=921445721-14062005><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=921445721-14062005><FONT face=Arial size=2>Jobs are launched
with this command:</FONT></SPAN></DIV>
<DIV><SPAN class=921445721-14062005><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=921445721-14062005><FONT face=Arial size=2>mpirun -machinefile
nodelist -np 2 program.sh </FONT></SPAN></DIV>
<DIV><SPAN class=921445721-14062005><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=921445721-14062005><FONT face=Arial size=2>The following
process is launched :</FONT></SPAN></DIV>
<DIV><SPAN class=921445721-14062005><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=921445721-14062005><FONT face=Arial
size=2>program.sh -p4pg /datadir/P124384 -p4wd /datadir
<BR></FONT></SPAN></DIV>
<DIV><SPAN class=921445721-14062005><FONT face=Arial size=2>I tried creating a
bash shell wrapper for the mpirun command and tried to use the trap command
like this :</FONT></SPAN></DIV>
<DIV><SPAN class=921445721-14062005><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=921445721-14062005><FONT face=Arial size=2>trap "kill
`pgrep -f program.sh -l | grep datadir| awk '{print $1}'`" 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 19 20</FONT></SPAN></DIV>
<DIV><SPAN class=921445721-14062005><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=921445721-14062005><FONT face=Arial size=2>I thought the [1]
response from MPICH would be caught by trap but maybe I'm not using
trap correctly. Or I'm using the wrong approach/command for the
problem? The trap command works fine if 0 is added (which just proves it's
working).</FONT></SPAN></DIV>
<DIV><SPAN class=921445721-14062005><FONT face=Arial size=2><SPAN
class=921445721-14062005><FONT face=Arial
size=2></FONT></SPAN></FONT></SPAN> </DIV>
<DIV><SPAN class=921445721-14062005><FONT face=Arial size=2><SPAN
class=921445721-14062005><FONT face=Arial size=2>Has anyone encountered this
type of problem or can they offer any suggestions to avoid this type of
condition? </FONT></SPAN></FONT></SPAN></DIV>
<DIV><SPAN class=921445721-14062005><FONT face=Arial size=2><SPAN
class=921445721-14062005></SPAN></FONT></SPAN><SPAN
class=921445721-14062005><FONT face=Arial size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=921445721-14062005><FONT face=Arial size=2>Thanks for any
suggestion,</FONT></SPAN></DIV>
<DIV><SPAN class=921445721-14062005><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=921445721-14062005><FONT face=Arial
size=2>Marcelino</FONT></SPAN></DIV>
<DIV><SPAN class=921445721-14062005><FONT face=Arial
size=2>Multimatic</DIV></FONT></SPAN></BODY></HTML>