[mpich-discuss] Suspend jobs that use MPICH2 with Hydra
Reuti
reuti at staff.uni-marburg.de
Mon Jun 11 11:47:40 CDT 2012
Am 07.05.2012 um 14:57 schrieb Shan-ho Tsai:
> Pavan, thank you so much for creating a ticket to include this
> support to Hydra. I really appreciate it.
>
> Ju, thank you very much for your suggestion. We currently use
> a variant of SGE as our job scheduler. However, when we suspend
> an MPICH2/Hydra job, the master process and the slave processes
> that are on the same host as the master get suspended, but the
> slave processes on other hosts continue to run (they do not get
> suspended). If someone is aware of a way to get SGE to suspend all
> processes properly in such a case, I would appreciate hearing how
> that is done.
Which version of SGE are you using? There was only a minimal patch necessary to suspend also slave tasks on other nodes IIRC.
-- Reuti
> Thank you very much again!
> Shan-Ho
>
> ----------------------------------------------------
> Shan-Ho Tsai
> University of Georgia, Athens GA
>
> From: mpich-discuss-bounces at mcs.anl.gov [mpich-discuss-bounces at mcs.anl.gov] on behalf of Ju JiaJia [jujj603 at gmail.com]
> Sent: Friday, May 04, 2012 9:37 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] Suspend jobs that use MPICH2 with Hydra
>
> I think you can use a resource manager and scheduler to do this, like torque + maui. You can suspend and resume jobs.
>
> On Sat, May 5, 2012 at 8:46 AM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
> Hello,
>
> We don't support this right now. I've created a ticket for it.
>
> https://trac.mcs.anl.gov/projects/mpich2/ticket/1627
>
> Please add yourself to the cc list of this ticket, if you'd like to be informed about updates on this issue.
>
> -- Pavan
>
>
> On 05/04/2012 12:54 PM, Shan-ho Tsai wrote:
> Hello all,
> We have mpich2 1.4.1p1 installed on a RHEL5 cluster
> and sometimes have the need to suspend all jobs clusterwide.
>
> Is there a way to suspend MPICH2 jobs that use Hydra, in
> such a way that the master process and all slave process
> (on multiple nodes) get properly suspended?
>
> If there is a way to do this, what is the procedure? Is there
> a signal that we could send to mpiexec?
>
> I tried sending a SIGSTOP to mpiexec, but only mpiexec
> got suspended, the actual a.out processes continued to run.
>
> I really appreciate any suggestions.
> thank you,
> Shan-Ho
>
> ----------------------------------------------------
> Shan-Ho Tsai
> University of Georgia, Athens GA
>
>
>
> _______________________________________________
> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
>
> _______________________________________________
> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list