[mpich-discuss] mpich2 does not work with SGE
Reuti
reuti at staff.uni-marburg.de
Mon Jul 25 06:47:01 CDT 2011
Hi,
Am 25.07.2011 um 11:16 schrieb tilakraj dattaram:
> Thanks a lot for all your help.
>
> Now we can run parallel jobs through the SGE (using a script file and
> qsub). We submitted some test jobs and were keeping all the records
> about the timings to compare the speeds with respect to mpirun
> executed from the command line
>
> For some reason (which I can't find out) running jobs through SGE is
> much slower than the command line. Is it so that command line works
> faster then the SGE ?
>
> This is the comparison table,
>
> CASE 1
> =======
> mpiexec (both mpiexec and mpirun point to the same link, i.e.
> mpiexec.hydra) from command line with a hostfile
>
> # mpiexec -f hostfile -np = n ./a.out<input
> ..................................................................................
> 16 cores per node
>
> N cpus time (in secs) speedup
>
> 1 262.27 1
> 2 161.34 1.63
> 4 82.41 3.18
> 8 42.45 6.18
> 16 33.56 7.82
> 24 25.33 10.35
> 32 20.38 12.87
> ..................................................................................
>
> CASE 2
> =======
> mpich2 integrated with SGE
>
> mpirun executed from within the job script
>
> This is our PE
>
> pe_name mpich2
> slots 999
> user_lists NONE
> xuser_lists NONE
> start_proc_args NONE
> stop_proc_args NONE
> allocation_rule $fill_up
> control_slaves TRUE
> job_is_first_task FALSE
> urgency_slots min
> accounting_summary FALSE
>
> # qsub -q queue ./a.out<input>output (the PE is defined inside the
> script file using the following option, #$ -pe mpich2 n)
> .................................................................................
> qsub [ 16 cores per node ]
>
> N cpus time (in secs) speedup
>
> 1 383.5 1
> 2 205.1 1.87
> 4 174.3 2.2
> 8 159.3 2.4
> 16 123.2 3.1
> 24 136.8 2.8
> 32 124.6 3.1
> .................................................................................
>
>
> As you can notice, the speed up is only about 3-times for a job run on
> 16 processors and submitted through the SGE interface, whereas it is
> nearly 8-fold when the parallel jobs are submitted from the command
> line using mpirun. Another thing to note is that speed nearly
> saturates around 3 for SGE, whereas it keeps increasing to around 13
> for 32 processors for command line execution. In fact, we had earlier
> found that the speed up keeps increasing till about 144 processors,
> where it gives a maximum speed up of about 20-fold over the serial
> job.
no, this shouldn't be. There might be a small delay for the very first startup as SGE will be used by it's integrated `qrsh` startup instead of a plain `ssh`, but this shouldn't create such a huge difference. Once it's started, the usual communication inside MPICH2 will be used and there shouldn't be any difference noticeable.
Can you please check on the various nodes, whether the allocation of the processes of your job is correct:
$ ps -e f
(f w/o -) and all are bound to the sge_shepherd on the intended nodes and no node is overloaded? Did you define a special queue for this (in SGE it's not necessary but possible depending on your personal taste), and limtit the available cores per node across all queues to avoid oversubscription?
-- Reuti
> Your help will be greatly appreciated!
>
> Thank you
>
> Regards
> Tilak
>
>>>
>>> Message: 7
>>> Date: Wed, 13 Jul 2011 12:30:18 +0200
>>> From: Reuti <reuti at staff.uni-marburg.de>
>>> Subject: Re: [mpich-discuss] mpich2 does not work with SGE
>>> To: mpich-discuss at mcs.anl.gov
>>> Message-ID:
>>> <33B507D9-C8E7-4F10-89E9-942633D81547 at staff.uni-marburg.de>
>>> Content-Type: text/plain; charset=us-ascii
>>>
>>> Hi,
>>>
>>> Am 13.07.2011 um 12:13 schrieb tilakraj dattaram:
>>>
>>>> Hi Reuti
>>>>
>>>> Thanks for your reply. I forgot to mention in my previous message, but I had tried adding a Parallel Environment in SGE using qconf -ap. I did the following,
>>>>
>>>> qconf -Ap mpich
>>>
>>> `qconf -Ap mpich` is the command to add a parallel environment after you edited the configuration file already beforehand. This will then be stored in SGE's configuration. Editing it later on won't be honored. If you want to edit a configuration which is already stored in SGE, you can use:
>>>
>>> `qconf -mp mpich`
>>>
>>> or start form scratch after deleting the old one by:
>>>
>>> `qconf -ap mpich`
>>>
>>> To see a list of all PEs:
>>>
>>> `qconf -spl`
>>>
>>> Don't forget to add the PE to any of your queues by editing them with `qconf -mq all.q` or alike.
>>>
>>> General rule: uppercase commands will read configuration from files, while lowercase commands will open an editor for interactive setup.
>>>
>>>
>>>> and then edited the pe file to,
>>>>
>>>> pe_name mpich
>>>> slots 999
>>>> user_lists NONE
>>>> xuser_lists NONE
>>>> start_proc_args NONE
>>>> stop_proc_args NONE
>>>> allocation_rule $fill_up
>>>> control_slaves TRUE
>>>> job_is_first_task FALSE
>>>> urgency_slots min
>>>> accounting_summary FALSE
>>>>
>>>> This did not work. However, here I don't see how SGE would know where to look for mpich2/ hydra. I do see a mpi directory in the $SGE_ROOT directory, where there is rocks-mpich.template file that reads the following
>>>>
>>>> pe_name mpich
>>>> slots 9999
>>>> user_lists NONE
>>>> xuser_lists NONE
>>>> start_proc_args /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
>>>> stop_proc_args /opt/gridengine/mpi/stopmpi.sh
>>>> allocation_rule $fill_up
>>>> control_slaves TRUE
>>>> job_is_first_task FALSE
>>>> urgency_slots min
>>>> accounting_summary TRUE
>>>
>>> This was for the old MPICH1. I don't know what ROCKS delivers, I would ignore it.
>>>
>>> SGE don't have to know anything about MPICH2. You only have to setup an appropriate PATH in your jobscript, so that you can access the `mpirun` of MPICH2 (and not one of any other MPI implementation) without the need of a hostlist and that's all. So, your jobscript will call `mpirun`, and a proper built MPICH2 will know about SGE and detect that it's running under SGE by checking some environment variables and then start processes on slave nodes by calling `qrsh -inherit ...` for the granted nodes.
>>>
>>> On the master node of the parallel job you can check the invocation chain by:
>>>
>>> `ps -e f`
>>>
>>> (f w/o -).
>>>
>>> -- Reuti
>>>
>>>
>>>> Does SGE need re-configuration after the mpich2 install?
>>>>
>>>> Thanks in advance!
>>>>
>>>> Regards
>>>> Tilak
>>>>
>>>> Message: 6
>>>> Date: Tue, 12 Jul 2011 13:19:18 +0200
>>>> From: Reuti <reuti at staff.uni-marburg.de>
>>>> Subject: Re: [mpich-discuss] mpich2 does not work with SGE
>>>> To: mpich-discuss at mcs.anl.gov
>>>> Message-ID:
>>>> <8768BE3D-BE2D-498C-98A6-D3A72F397291 at staff.uni-marburg.de>
>>>> Content-Type: text/plain; charset=us-ascii
>>>>
>>>> Hi,
>>>>
>>>> Am 12.07.2011 um 13:03 schrieb tilakraj dattaram:
>>>>
>>>>> We have a rocks cluster with 10 nodes, with sun grid engine installed and running. I then installed the most recent version of mpich2 (1.4) on the master and compute nodes. However, we are unable to run parallel jobs through SGE (we can submit serial jobs without a problem). I am a sge newbie, and most of the installation that we have done is by reading step-by-step tutorials on the web.
>>>>>
>>>>> The mpich2 manual says that hydra is the default process manager for mpich2, and I have checked that the mpiexec command points to mpiexec.hydra. Also, which mpicc, which mpiexec point to the desired location of mpich2. I understand that in this version of mpich2, hydra should be integrated with SGE by default. But maybe I am missing something here.
>>>>>
>>>>> We are able to run parallel jobs using command line by specifying a host file (e.g, mpiexec -f hostfile -np 16 ./a.out), but would like the resource manager to take care of allocating resources on the cluster.
>>>>
>>>> it's necessary to set up a so called parallel environment (i.e. a PE) in SGE and request it during the job submission. Then a plain mpirun without any hostfile or -np specification will do, as all is directly delivered by SGE. If all is set up in a proper way, you could even switch off `rsh` and `ssh` inside the cluster completely, as SGE's internal startup mechanism is used then to start processes on other nodes. In fact, disabling or limiting `ssh` to admin staff is a good way to check whether your parallel application has a tight integration into the queuingsystem where all slave processes are accounted also correctly and under full SGE control for a delition by `qdel`.
>>>>
>>>> For SGE there is also a mailing list: http://gridengine.org/mailman/listinfo/users
>>>>
>>>> -- Reuti
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>
>>>>
>>>> End of mpich-discuss Digest, Vol 34, Issue 15
>>>> *********************************************
>>>>
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>>
>>>
>>> ------------------------------
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>>
>>> End of mpich-discuss Digest, Vol 34, Issue 16
>>> *********************************************
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>>
>>
>> ------------------------------
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>>
>> End of mpich-discuss Digest, Vol 34, Issue 26
>> *********************************************
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list