[MPICH] RE: [MPICH2-dev] Long jobs dying with sigkill (signal 9)
Rajeev Thakur
thakur at mcs.anl.gov
Mon Sep 19 09:54:22 CDT 2005
That's because when you run make testing, the test/mpi/runtests script sets
MPIEXEC_TIMEOUT to 180 seconds. You can edit that script (or the runtests.in
file) to increase that number.
Rajeev
> -----Original Message-----
> From: owner-mpich2-dev at mcs.anl.gov
> [mailto:owner-mpich2-dev at mcs.anl.gov] On Behalf Of Michael Heinz
> Sent: Monday, September 19, 2005 9:41 AM
> To: mpich-discuss at mcs.anl.gov; mpich2-dev at mcs.anl.gov
> Subject: [MPICH2-dev] Long jobs dying with sigkill (signal 9)
>
> "rank 3 in job 1 st47_38138 caused collective abort of all ranks
> exit status of rank 3: killed by signal 9"
>
> When trying to do 32 processor tests of MPICH2, the longer tests are
> frequently failing with this error message.
>
> Looking through MPICH2 itself, I can't find any place where a
> SIGKILL
> is triggered. MPD sends SIGKILLs under some conditions, but I can't
> figure out why it would be doing this at this point (the job appears
> to be running normally).
>
> Which rank gets killed varies from run to run, but the jobs always
> appear to be running correctly when they are killed.
>
> Has anyone seen anything like this?
>
>
More information about the mpich-discuss
mailing list