[mpich-discuss] Suspend jobs that use MPICH2 with Hydra

Rayson Ho raysonlogin at gmail.com
Mon Jun 11 16:26:31 CDT 2012


Hmm, interesting... I was reading this thread, and found the same
discussion on the Grid Engine mailing list.

Reuti - We did not check in the code as Sun Microsystems at that time
did not think that it's a good idea to suspend parallel jobs. If other
batch systems handle parallel job suspension differently, then we
should look into it.

Rayson



On Mon, Jun 11, 2012 at 4:42 PM, Reuti <reuti at staff.uni-marburg.de> wrote:
> Am 11.06.2012 um 21:31 schrieb Shan-ho Tsai:
>
>> Thanks for your response! We are using Univa Grid Engine 8.0.1p4.
>> Is the patch freely available?
>
> Ask the vendor, as it's commercial software.
>
> There was a discussion some time ago about it, but suspending slave tasks never made into any release:
>
> https://arc.liv.ac.uk/trac/SGE/ticket/577
>
> -- Reuti
>
>
>> Thanks so much,
>> Shan-Ho
>>
>> ----------------------------------------------------
>> Shan-Ho Tsai
>> University of Georgia, Athens GA
>>
>> ________________________________________
>> From: mpich-discuss-bounces at mcs.anl.gov [mpich-discuss-bounces at mcs.anl.gov] on behalf of Reuti [reuti at staff.uni-marburg.de]
>> Sent: Monday, June 11, 2012 12:47 PM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] Suspend jobs that use MPICH2 with Hydra
>>
>> Am 07.05.2012 um 14:57 schrieb Shan-ho Tsai:
>>
>>> Pavan, thank you so much for creating a ticket to include this
>>> support to Hydra. I really appreciate it.
>>>
>>> Ju, thank you very much for your suggestion. We currently use
>>> a variant of SGE as our job scheduler. However,  when we suspend
>>> an MPICH2/Hydra job, the master process and the slave processes
>>> that are on the same host as the master get suspended, but the
>>> slave processes on other hosts continue to run (they do not get
>>> suspended). If someone is aware of a way to get SGE to suspend all
>>> processes properly in such a case, I would appreciate hearing how
>>> that is done.
>>
>> Which version of SGE are you using? There was only a minimal patch necessary to suspend also slave tasks on other nodes IIRC.
>>
>> -- Reuti
>>
>>
>>> Thank you very much again!
>>> Shan-Ho
>>>
>>> ----------------------------------------------------
>>> Shan-Ho Tsai
>>> University of Georgia, Athens GA
>>>
>>> From: mpich-discuss-bounces at mcs.anl.gov [mpich-discuss-bounces at mcs.anl.gov] on behalf of Ju JiaJia [jujj603 at gmail.com]
>>> Sent: Friday, May 04, 2012 9:37 PM
>>> To: mpich-discuss at mcs.anl.gov
>>> Subject: Re: [mpich-discuss] Suspend jobs that use MPICH2 with Hydra
>>>
>>> I think you can use a resource manager and scheduler to do this, like torque + maui. You can suspend and resume jobs.
>>>
>>> On Sat, May 5, 2012 at 8:46 AM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>>> Hello,
>>>
>>> We don't support this right now.  I've created a ticket for it.
>>>
>>> https://trac.mcs.anl.gov/projects/mpich2/ticket/1627
>>>
>>> Please add yourself to the cc list of this ticket, if you'd like to be informed about updates on this issue.
>>>
>>> -- Pavan
>>>
>>>
>>> On 05/04/2012 12:54 PM, Shan-ho Tsai wrote:
>>> Hello all,
>>> We have mpich2 1.4.1p1 installed on a RHEL5 cluster
>>> and sometimes have the need to suspend all jobs clusterwide.
>>>
>>> Is there a way to suspend MPICH2 jobs that use Hydra, in
>>> such a way that the master process and all slave process
>>> (on multiple nodes) get properly suspended?
>>>
>>> If there is a way to do this, what is the procedure? Is there
>>> a signal that we could send to mpiexec?
>>>
>>> I tried sending a SIGSTOP to mpiexec, but only mpiexec
>>> got suspended, the actual a.out processes continued to run.
>>>
>>> I really appreciate any suggestions.
>>> thank you,
>>> Shan-Ho
>>>
>>> ----------------------------------------------------
>>> Shan-Ho Tsai
>>> University of Georgia, Athens GA
>>>
>>>
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>>> To manage subscription options or unsubscribe:
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>> --
>>> Pavan Balaji
>>> http://www.mcs.anl.gov/~balaji
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>>> To manage subscription options or unsubscribe:
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>>> To manage subscription options or unsubscribe:
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>> _______________________________________________
>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>>
>> _______________________________________________
>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



-- 
==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/


More information about the mpich-discuss mailing list