[mpich-discuss] mpich2 does not work with SGE
tilakraj dattaram
tilakraj1985 at gmail.com
Tue Jul 19 01:59:06 CDT 2011
Hi Reuti,
Thanks again for your reply. We were able to add a PE using
qconf -ap. As I had mentioned earlier, mpirun points to the correct
location, and now I can get the job script to execute the mpirun command.
However, it seems to disregard the number of processes specified. Instead,
the job is run on one processor only.
Here is how the PE looks like
[root at Spinode ~]# qconf -sp mpich2
pe_name mpich2
slots 999
user_lists NONE
xuser_lists NONE
start_proc_args NONE
stop_proc_args NONE
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE
The list of all the PEs on the cluster is as follows,
[root at Spinode ~]# qconf -spl
make
mpich2
Next, I added the PE to one of the queues running on the cluster. It looks
like this,
[root at Spinode ~]# qconf -sq molsim.q
qname molsim.q
hostlist @molsimhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make mpich2
rerun FALSE
slots 1,[compute-0-1.local=16],[compute-0-2.local=16], \
[compute-0-3.local=16],[compute-0-4.local=16], \
[compute-0-5.local=16],[compute-0-6.local=16]
tmpdir /tmp
shell /bin/bash
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
Then, we tried to submit a job using a job script
--------------------------------------------------------------
#!/bin/sh
# request Bourne shell as shell for job
#$ -S /bin/sh
# Name of the job
#$ -N mpich2_lammps_test
# Name of the output log file
#$ -o lammps_test.log
# Combine output/ error messages into one file
#$ -j y
# Use current working directory
#$ -cwd
# Commands to be executed
mpirun -np 8 ./a.out < input > output
-------------------------------------------------------------
The problem is that no matter what number I specify after the -np option,
the job is always executed on a single processor.
Your help would be appreciated.
Thanks in advance
Regards
Tilak
> Message: 7
> Date: Wed, 13 Jul 2011 12:30:18 +0200
> From: Reuti <reuti at staff.uni-marburg.de>
> Subject: Re: [mpich-discuss] mpich2 does not work with SGE
> To: mpich-discuss at mcs.anl.gov
> Message-ID:
> <33B507D9-C8E7-4F10-89E9-942633D81547 at staff.uni-marburg.de>
> Content-Type: text/plain; charset=us-ascii
>
> Hi,
>
> Am 13.07.2011 um 12:13 schrieb tilakraj dattaram:
>
> > Hi Reuti
> >
> > Thanks for your reply. I forgot to mention in my previous message, but I
> had tried adding a Parallel Environment in SGE using qconf -ap. I did the
> following,
> >
> > qconf -Ap mpich
>
> `qconf -Ap mpich` is the command to add a parallel environment after you
> edited the configuration file already beforehand. This will then be stored
> in SGE's configuration. Editing it later on won't be honored. If you want to
> edit a configuration which is already stored in SGE, you can use:
>
> `qconf -mp mpich`
>
> or start form scratch after deleting the old one by:
>
> `qconf -ap mpich`
>
> To see a list of all PEs:
>
> `qconf -spl`
>
> Don't forget to add the PE to any of your queues by editing them with
> `qconf -mq all.q` or alike.
>
> General rule: uppercase commands will read configuration from files, while
> lowercase commands will open an editor for interactive setup.
>
>
> > and then edited the pe file to,
> >
> > pe_name mpich
> > slots 999
> > user_lists NONE
> > xuser_lists NONE
> > start_proc_args NONE
> > stop_proc_args NONE
> > allocation_rule $fill_up
> > control_slaves TRUE
> > job_is_first_task FALSE
> > urgency_slots min
> > accounting_summary FALSE
> >
> > This did not work. However, here I don't see how SGE would know where to
> look for mpich2/ hydra. I do see a mpi directory in the $SGE_ROOT directory,
> where there is rocks-mpich.template file that reads the following
> >
> > pe_name mpich
> > slots 9999
> > user_lists NONE
> > xuser_lists NONE
> > start_proc_args /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
> > stop_proc_args /opt/gridengine/mpi/stopmpi.sh
> > allocation_rule $fill_up
> > control_slaves TRUE
> > job_is_first_task FALSE
> > urgency_slots min
> > accounting_summary TRUE
>
> This was for the old MPICH1. I don't know what ROCKS delivers, I would
> ignore it.
>
> SGE don't have to know anything about MPICH2. You only have to setup an
> appropriate PATH in your jobscript, so that you can access the `mpirun` of
> MPICH2 (and not one of any other MPI implementation) without the need of a
> hostlist and that's all. So, your jobscript will call `mpirun`, and a proper
> built MPICH2 will know about SGE and detect that it's running under SGE by
> checking some environment variables and then start processes on slave nodes
> by calling `qrsh -inherit ...` for the granted nodes.
>
> On the master node of the parallel job you can check the invocation chain
> by:
>
> `ps -e f`
>
> (f w/o -).
>
> -- Reuti
>
>
> > Does SGE need re-configuration after the mpich2 install?
> >
> > Thanks in advance!
> >
> > Regards
> > Tilak
> >
> > Message: 6
> > Date: Tue, 12 Jul 2011 13:19:18 +0200
> > From: Reuti <reuti at staff.uni-marburg.de>
> > Subject: Re: [mpich-discuss] mpich2 does not work with SGE
> > To: mpich-discuss at mcs.anl.gov
> > Message-ID:
> > <8768BE3D-BE2D-498C-98A6-D3A72F397291 at staff.uni-marburg.de>
> > Content-Type: text/plain; charset=us-ascii
> >
> > Hi,
> >
> > Am 12.07.2011 um 13:03 schrieb tilakraj dattaram:
> >
> > > We have a rocks cluster with 10 nodes, with sun grid engine installed
> and running. I then installed the most recent version of mpich2 (1.4) on the
> master and compute nodes. However, we are unable to run parallel jobs
> through SGE (we can submit serial jobs without a problem). I am a sge
> newbie, and most of the installation that we have done is by reading
> step-by-step tutorials on the web.
> > >
> > > The mpich2 manual says that hydra is the default process manager for
> mpich2, and I have checked that the mpiexec command points to mpiexec.hydra.
> Also, which mpicc, which mpiexec point to the desired location of mpich2. I
> understand that in this version of mpich2, hydra should be integrated with
> SGE by default. But maybe I am missing something here.
> > >
> > > We are able to run parallel jobs using command line by specifying a
> host file (e.g, mpiexec -f hostfile -np 16 ./a.out), but would like the
> resource manager to take care of allocating resources on the cluster.
> >
> > it's necessary to set up a so called parallel environment (i.e. a PE) in
> SGE and request it during the job submission. Then a plain mpirun without
> any hostfile or -np specification will do, as all is directly delivered by
> SGE. If all is set up in a proper way, you could even switch off `rsh` and
> `ssh` inside the cluster completely, as SGE's internal startup mechanism is
> used then to start processes on other nodes. In fact, disabling or limiting
> `ssh` to admin staff is a good way to check whether your parallel
> application has a tight integration into the queuingsystem where all slave
> processes are accounted also correctly and under full SGE control for a
> delition by `qdel`.
> >
> > For SGE there is also a mailing list:
> http://gridengine.org/mailman/listinfo/users
> >
> > -- Reuti
> >
> >
> >
> >
> > ------------------------------
> >
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> >
> > End of mpich-discuss Digest, Vol 34, Issue 15
> > *********************************************
> >
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>
>
> ------------------------------
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>
> End of mpich-discuss Digest, Vol 34, Issue 16
> *********************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110719/9fa32112/attachment-0001.htm>
More information about the mpich-discuss
mailing list