[mpich-discuss] Suspend jobs that use MPICH2 with Hydra

Shan-ho Tsai shtsai at uga.edu
Mon May 7 07:57:43 CDT 2012


Hello,

Pavan, thank you so much for creating a ticket to include this
support to Hydra. I really appreciate it.

Ju, thank you very much for your suggestion. We currently use
a variant of SGE as our job scheduler. However,  when we suspend
an MPICH2/Hydra job, the master process and the slave processes
that are on the same host as the master get suspended, but the
slave processes on other hosts continue to run (they do not get
suspended). If someone is aware of a way to get SGE to suspend all
processes properly in such a case, I would appreciate hearing how
that is done.

Thank you very much again!
Shan-Ho

----------------------------------------------------
Shan-Ho Tsai
University of Georgia, Athens GA

________________________________
From: mpich-discuss-bounces at mcs.anl.gov [mpich-discuss-bounces at mcs.anl.gov] on behalf of Ju JiaJia [jujj603 at gmail.com]
Sent: Friday, May 04, 2012 9:37 PM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] Suspend jobs that use MPICH2 with Hydra

I think you can use a resource manager and scheduler to do this, like torque + maui. You can suspend and resume jobs.

On Sat, May 5, 2012 at 8:46 AM, Pavan Balaji <balaji at mcs.anl.gov<mailto:balaji at mcs.anl.gov>> wrote:
Hello,

We don't support this right now.  I've created a ticket for it.

https://trac.mcs.anl.gov/projects/mpich2/ticket/1627

Please add yourself to the cc list of this ticket, if you'd like to be informed about updates on this issue.

 -- Pavan


On 05/04/2012 12:54 PM, Shan-ho Tsai wrote:
Hello all,
We have mpich2 1.4.1p1 installed on a RHEL5 cluster
and sometimes have the need to suspend all jobs clusterwide.

Is there a way to suspend MPICH2 jobs that use Hydra, in
such a way that the master process and all slave process
(on multiple nodes) get properly suspended?

If there is a way to do this, what is the procedure? Is there
a signal that we could send to mpiexec?

I tried sending a SIGSTOP to mpiexec, but only mpiexec
got suspended, the actual a.out processes continued to run.

I really appreciate any suggestions.
thank you,
Shan-Ho

----------------------------------------------------
Shan-Ho Tsai
University of Georgia, Athens GA



_______________________________________________
mpich-discuss mailing list     mpich-discuss at mcs.anl.gov<mailto:mpich-discuss at mcs.anl.gov>
To manage subscription options or unsubscribe:
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

--
Pavan Balaji
http://www.mcs.anl.gov/~balaji

_______________________________________________
mpich-discuss mailing list     mpich-discuss at mcs.anl.gov<mailto:mpich-discuss at mcs.anl.gov>
To manage subscription options or unsubscribe:
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120507/d331b04c/attachment.htm>


More information about the mpich-discuss mailing list