[MPICH2-dev] Long jobs dying with sigkill (signal 9)
Michael Heinz
mheinz at silverstorm.com
Mon Sep 19 09:41:00 CDT 2005
"rank 3 in job 1 st47_38138 caused collective abort of all ranks
exit status of rank 3: killed by signal 9"
When trying to do 32 processor tests of MPICH2, the longer tests are
frequently failing with this error message.
Looking through MPICH2 itself, I can't find any place where a SIGKILL
is triggered. MPD sends SIGKILLs under some conditions, but I can't
figure out why it would be doing this at this point (the job appears
to be running normally).
Which rank gets killed varies from run to run, but the jobs always
appear to be running correctly when they are killed.
Has anyone seen anything like this?
More information about the mpich2-dev
mailing list