[MPICH2-dev] Long jobs dying with sigkill (signal 9)

Michael Heinz mheinz at silverstorm.com
Mon Sep 19 09:41:00 CDT 2005


"rank 3 in job 1  st47_38138   caused collective abort of all ranks
   exit status of rank 3: killed by signal 9"

When trying to do 32 processor tests of MPICH2, the longer tests are  
frequently failing with this error message.

Looking through MPICH2 itself, I can't find any place where a SIGKILL  
is triggered. MPD sends SIGKILLs under some conditions, but I can't  
figure out why it would be doing this at this point (the job appears  
to be running normally).

Which rank gets killed varies from run to run, but the jobs always  
appear to be running correctly when they are killed.

Has anyone seen anything like this?




More information about the mpich2-dev mailing list