[MPICH] MPI Abort by user Aborting program

Wed Jun 15 07:54:15 CDT 2005

Hi,

as we don't know, which signal(s) your self developed queuing system is 
sending to which node, it's not easy to answer. Are you sending a 
sigkill, sigterm,...? Killing all the process groups on the nodes?

I'd suggest to look in a well established queuing system, preferable SUN 
GridEngine. It offers a Tight Integration of parallel jobs (whether you 
use MPICH, MPICH2, LAM/MPI or others) and failed jobs will be removed 
from the system independent of the trap handling of their own. As it's 
an open source project, you can look into it to check whether it fits 
your requirements:

http://gridengine.sunsource.net
http://gridengine.sunsource.net/howto/mpich-integration.html

Cheers - Reuti

Marcelino Mata wrote:
> I hope everyone can excuse what might not be the best type of question 
> for this list (not MPICH2 related)
>  
> Due to application requirements I must use MPICH 1.2.x or LAM to run 
> parallel jobs.  Jobs are run on RedHat 9 x86 SMP nodes.  Jobs are 
> launched via rsh by internally developed queuing program.  In the past 
> we used LAM but I am trying to migrate to MPICH due to other technical 
> requirements.  The problem I'm having is that error handling seems to 
> not be as clean as LAM's implementation.  If someone queues up a job 
> which crashes out the job (not due to MPICH errors), the following is 
> produced :
>  
> [1] MPI Abort by user Aborting program !
> [1] Aborting program!
>  
> This error is fine but my problem is that I need the command prompt to 
> return after this condition for the queuing program to work 
> correctly.  The prompt returns when I login locally on the cluster and 
> launch the job but not if I start the job by rsh or ssh.  I tested this 
> problem with rsh/ssh client being Linux or Solaris. 
>  
> Running the same job with LAM still causes the termination (as it 
> should) but the prompt always returns (local login or rsh launch).  With 
> MPICH one process is always hung while LAM seems to terminate all the 
> processes.  This might explain why the prompt behaves differently with 
> MPICH rsh vs LAM rsh.
>  
> I tested this condition with MPICH 1.2.5.1a, 1.2.5.3 and 1.2.6. 
>  
> Jobs are launched with this command:
>  
> mpirun -machinefile nodelist -np 2 program.sh
>  
> The following process is launched :
>  
> program.sh -p4pg /datadir/P124384 -p4wd /datadir
> I tried creating a bash shell wrapper for the mpirun command and 
> tried to use the trap command like this :
>  
> trap "kill `pgrep -f program.sh -l | grep datadir| awk '{print $1}'`" 1 
> 2 3 4 5 6 7 8 9 10 11 12 13 14 15 19 20
>  
> I thought the [1] response from MPICH would be caught by trap but maybe 
> I'm not using trap correctly.  Or I'm using the wrong approach/command 
> for the problem?  The trap command works fine if 0 is added (which just 
> proves it's working).
>  
> Has anyone encountered this type of problem or can they offer any 
> suggestions to avoid this type of condition? 
>  
> Thanks for any suggestion,
>  
> Marcelino
> Multimatic