[MPICH] MPI Abort by user Aborting program

Marcelino Mata mmata at multimatic.com
Tue Jun 14 17:44:43 CDT 2005


I hope everyone can excuse what might not be the best type of question
for this list (not MPICH2 related)
 
Due to application requirements I must use MPICH 1.2.x or LAM to run
parallel jobs.  Jobs are run on RedHat 9 x86 SMP nodes.  Jobs are
launched via rsh by internally developed queuing program.  In the past
we used LAM but I am trying to migrate to MPICH due to other technical
requirements.  The problem I'm having is that error handling seems to
not be as clean as LAM's implementation.  If someone queues up a job
which crashes out the job (not due to MPICH errors), the following is
produced :
 
[1] MPI Abort by user Aborting program !
[1] Aborting program!
 
This error is fine but my problem is that I need the command prompt to
return after this condition for the queuing program to work correctly.
The prompt returns when I login locally on the cluster and launch the
job but not if I start the job by rsh or ssh.  I tested this problem
with rsh/ssh client being Linux or Solaris.  
 
Running the same job with LAM still causes the termination (as it
should) but the prompt always returns (local login or rsh launch).  With
MPICH one process is always hung while LAM seems to terminate all the
processes.  This might explain why the prompt behaves differently with
MPICH rsh vs LAM rsh.
 
I tested this condition with MPICH 1.2.5.1a, 1.2.5.3 and 1.2.6.  
 
Jobs are launched with this command:
 
mpirun -machinefile nodelist -np 2 program.sh 
 
The following process is launched :
 
program.sh -p4pg /datadir/P124384 -p4wd /datadir 

I tried creating a bash shell wrapper for the mpirun command and tried
to use the trap command like this :
 
trap "kill `pgrep -f program.sh -l | grep datadir| awk '{print $1}'`" 1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 19 20
 
I thought the [1] response from MPICH would be caught by trap but maybe
I'm not using trap correctly.  Or I'm using the wrong approach/command
for the problem?  The trap command works fine if 0 is added (which just
proves it's working).
 
Has anyone encountered this type of problem or can they offer any
suggestions to avoid this type of condition? 
 
Thanks for any suggestion,
 
Marcelino
Multimatic
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20050614/620340f2/attachment.htm>


More information about the mpich-discuss mailing list