[MPICH2-dev] Long jobs dying with sigkill (signal 9)

Rajeev Thakur thakur at mcs.anl.gov
Mon Sep 19 09:54:22 CDT 2005


That's because when you run make testing, the test/mpi/runtests script sets
MPIEXEC_TIMEOUT to 180 seconds. You can edit that script (or the runtests.in
file) to increase that number.

Rajeev
 

> -----Original Message-----
> From: owner-mpich2-dev at mcs.anl.gov 
> [mailto:owner-mpich2-dev at mcs.anl.gov] On Behalf Of Michael Heinz
> Sent: Monday, September 19, 2005 9:41 AM
> To: mpich-discuss at mcs.anl.gov; mpich2-dev at mcs.anl.gov
> Subject: [MPICH2-dev] Long jobs dying with sigkill (signal 9)
> 
> "rank 3 in job 1  st47_38138   caused collective abort of all ranks
>    exit status of rank 3: killed by signal 9"
> 
> When trying to do 32 processor tests of MPICH2, the longer tests are  
> frequently failing with this error message.
> 
> Looking through MPICH2 itself, I can't find any place where a 
> SIGKILL  
> is triggered. MPD sends SIGKILLs under some conditions, but I can't  
> figure out why it would be doing this at this point (the job appears  
> to be running normally).
> 
> Which rank gets killed varies from run to run, but the jobs always  
> appear to be running correctly when they are killed.
> 
> Has anyone seen anything like this?
> 
> 




More information about the mpich2-dev mailing list