[mpich-discuss] Suspend jobs that use MPICH2 with Hydra

Reuti reuti at staff.uni-marburg.de
Mon Jun 11 11:47:40 CDT 2012


Am 07.05.2012 um 14:57 schrieb Shan-ho Tsai:

> Pavan, thank you so much for creating a ticket to include this 
> support to Hydra. I really appreciate it.
> 
> Ju, thank you very much for your suggestion. We currently use 
> a variant of SGE as our job scheduler. However,  when we suspend
> an MPICH2/Hydra job, the master process and the slave processes
> that are on the same host as the master get suspended, but the 
> slave processes on other hosts continue to run (they do not get
> suspended). If someone is aware of a way to get SGE to suspend all
> processes properly in such a case, I would appreciate hearing how 
> that is done.

Which version of SGE are you using? There was only a minimal patch necessary to suspend also slave tasks on other nodes IIRC.

-- Reuti


> Thank you very much again!
> Shan-Ho
> 
> ----------------------------------------------------
> Shan-Ho Tsai
> University of Georgia, Athens GA
> 
> From: mpich-discuss-bounces at mcs.anl.gov [mpich-discuss-bounces at mcs.anl.gov] on behalf of Ju JiaJia [jujj603 at gmail.com]
> Sent: Friday, May 04, 2012 9:37 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] Suspend jobs that use MPICH2 with Hydra
> 
> I think you can use a resource manager and scheduler to do this, like torque + maui. You can suspend and resume jobs.
> 
> On Sat, May 5, 2012 at 8:46 AM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
> Hello,
> 
> We don't support this right now.  I've created a ticket for it.
> 
> https://trac.mcs.anl.gov/projects/mpich2/ticket/1627
> 
> Please add yourself to the cc list of this ticket, if you'd like to be informed about updates on this issue.
> 
>  -- Pavan
> 
> 
> On 05/04/2012 12:54 PM, Shan-ho Tsai wrote:
> Hello all,
> We have mpich2 1.4.1p1 installed on a RHEL5 cluster
> and sometimes have the need to suspend all jobs clusterwide.
> 
> Is there a way to suspend MPICH2 jobs that use Hydra, in
> such a way that the master process and all slave process
> (on multiple nodes) get properly suspended?
> 
> If there is a way to do this, what is the procedure? Is there
> a signal that we could send to mpiexec?
> 
> I tried sending a SIGSTOP to mpiexec, but only mpiexec
> got suspended, the actual a.out processes continued to run.
> 
> I really appreciate any suggestions.
> thank you,
> Shan-Ho
> 
> ----------------------------------------------------
> Shan-Ho Tsai
> University of Georgia, Athens GA
> 
> 
> 
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> -- 
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
> 
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list