[mpich-discuss] mpich2 does not work with SGE
Reuti
reuti at staff.uni-marburg.de
Tue Jul 26 04:45:29 CDT 2011
Hi,
Am 26.07.2011 um 08:37 schrieb tilakraj dattaram:
> Here is how we submitted the jobs through a defined SGE queue.
>
> 1. Submit the job using a job script (job_lammps.sh)
> $ qsub -q molsim.q job_lammps.sh (In the previous message there was a typo
> because I had mistakenly typed; $ qsub -q queue ./a.out<input>output )
>
> The job script looks like this
> ----------------------------------------------------------
> #!/bin/sh
> # request Bourne shell as shell for job
> #$ -S /bin/sh
> # Name of the job
> #$ -N mpich2_lammps_test
> # Name of the output log file
> #$ -o lammps_test.log
> # Combine output/ error messages into one file
> #$ -j y
> # Use current working directory
> #$ -cwd
> # Specify the parallel environment (pe)
> #$ -pe mpich2 8
> # Commands to be executed
> mpirun ./lmp_g++ < in.shear > thermo_shear
> ----------------------------------------------------------
>
> 2. Below is output for qstat and shows the job running on compute-0-5 corresponding to molsim.q
> $ qstat -f
>
> queuename qtype resv/used/tot. load_avg arch states
> ---------------------------------------------------------------------------------
> all.q at compute-0-1.local BIP 0/0/16 0.00 lx26-amd64
> ---------------------------------------------------------------------------------
> all.q at compute-0-10.local BIP 0/0/16 0.09 lx26-amd64
> ---------------------------------------------------------------------------------
> all.q at compute-0-2.local BIP 0/0/16 0.03 lx26-amd64
> ---------------------------------------------------------------------------------
> all.q at compute-0-3.local BIP 0/0/16 0.00 lx26-amd64
> ---------------------------------------------------------------------------------
> all.q at compute-0-4.local BIP 0/0/16 0.00 lx26-amd64
> ---------------------------------------------------------------------------------
> all.q at compute-0-5.local BIP 0/0/16 0.02 lx26-amd64
> ---------------------------------------------------------------------------------
> all.q at compute-0-6.local BIP 0/0/16 0.01 lx26-amd64
> ---------------------------------------------------------------------------------
> all.q at compute-0-7.local BIP 0/0/16 0.02 lx26-amd64
> ---------------------------------------------------------------------------------
> all.q at compute-0-8.local BIP 0/0/16 0.04 lx26-amd64
> ---------------------------------------------------------------------------------
> all.q at compute-0-9.local BIP 0/0/16 0.04 lx26-amd64
> ---------------------------------------------------------------------------------
> guru.q at compute-0-10.local BIP 0/0/16 0.09 lx26-amd64
> ---------------------------------------------------------------------------------
> guru.q at compute-0-7.local BIP 0/0/16 0.02 lx26-amd64
> ---------------------------------------------------------------------------------
> guru.q at compute-0-8.local BIP 0/0/16 0.04 lx26-amd64
> ---------------------------------------------------------------------------------
> guru.q at compute-0-9.local BIP 0/0/16 0.04 lx26-amd64
> ---------------------------------------------------------------------------------
> molsim.q at compute-0-1.local BIP 0/0/16 0.00 lx26-amd64
> ---------------------------------------------------------------------------------
> molsim.q at compute-0-2.local BIP 0/0/16 0.03 lx26-amd64
> ---------------------------------------------------------------------------------
> molsim.q at compute-0-3.local BIP 0/0/16 0.00 lx26-amd64
> ---------------------------------------------------------------------------------
> molsim.q at compute-0-4.local BIP 0/0/16 0.00 lx26-amd64
> ---------------------------------------------------------------------------------
> molsim.q at compute-0-5.local BIP 0/8/16 0.02 lx26-amd64
> 301 0.55500 mpich2_lam ajay r 07/26/2011 19:40:11 8
> ---------------------------------------------------------------------------------
> molsim.q at compute-0-6.local BIP 0/0/16 0.01 lx26-amd64
> ---------------------------------------------------------------------------------
> test_mpi.q at compute-0-10.local BIP 0/0/8 0.09 lx26-amd64
> ---------------------------------------------------------------------------------
> test_mpi.q at compute-0-9.local BIP 0/0/8 0.04 lx26-amd64
>
> 3. I log into compute-0-5 and do a ps -e f
>
> compute-0-5 ~]$ ps -e f
> It seems to show the jobs bound to sge_shephard
>
> 3901 ? S 0:00 \_ hald-runner
> 3909 ? S 0:00 \_ hald-addon-acpi: listening on acpid socket /v
> 3915 ? S 0:00 \_ hald-addon-keyboard: listening on /dev/input/
> 4058 ? Ssl 0:00 automount
> 4122 ? Sl 0:00 /opt/gridengine/bin/lx26-amd64/sge_execd
> 5466 ? S 0:00 \_ sge_shepherd-301 -bg
> 5467 ? Ss 0:00 \_ -sh /opt/gridengine/default/spool/execd/compute-0-5/job_scripts/301
> 5609 ? S 0:00 \_ mpirun ./lmp_g++
> 5610 ? S 0:00 \_ /opt/mpich2/gnu/bin//hydra_pmi_proxy --control-port compute-0-5.loc
> 5611 ? R 0:25 \_ ./lmp_g++
> 5612 ? R 0:25 \_ ./lmp_g++
> 5613 ? R 0:25 \_ ./lmp_g++
> 5614 ? R 0:25 \_ ./lmp_g++
> 5615 ? R 0:25 \_ ./lmp_g++
> 5616 ? R 0:25 \_ ./lmp_g++
> 5617 ? R 0:25 \_ ./lmp_g++
> 5618 ? R 0:25 \_ ./lmp_g++
fine.
> 4143 ? Sl 0:00 /usr/sbin/snmpd -Lsd -Lf /dev/null -p /var/run/snmpd.
> 4158 ? Ss 0:00 /usr/sbin/sshd
> 5619 ? Ss 0:00 \_ sshd: ajay [priv]
> 5621 ? S 0:00 \_ sshd: ajay at pts/0
> 5622 pts/0 Ss 0:00 \_ -bash
> 5728 pts/0 R+ 0:00 \_ ps -e f
>
> The following shows that the mpirun is indeed pointing to the correct location and the mpirun defined inside the
> job script file is the right one.
> 5610 ? S 0:00 \_ /opt/mpich2/gnu/bin//hydra_pmi_proxy --control-port compute-0-5.loc
>
> 4. However, the compute time is still slower (159 seconds) than that for a job run
> through the command line using mpirun (42 seconds, mpiexec -f hostfile -np 8 ./lmp_g++ < in.shear > thermo.shear).
Okay, this is something to investigate, especially as no processes on slave nodes are involved. I remember slightly that there was a similar discussion some time ago (by this I mean 3 or 4 years) but I don't remember how it ended and at that time also slave processes were involved.
What we can check is a different setting in:
- the environment (may an `env` in the script will show it)?
- different ulimits?
- can you go interactively to a node and run the job inside SGE in an interactive queue?
- you issued the command beforehand on the headnode and the machinefile contained only one node in the cluster and no process was running local on the headnode?
- The syntax you used "mpiexec -f hostfile -np = ...": the "=" therein is a placeholder or a valid (undocumented) syntax for `mpiexec`?
-- Reuti
>
> I can't understand why there should be large difference between the plain mpiexec and that started through the sge.
>
> Thanks in advance
>
> Regards
> Tilak
>
>
>
>
> ------------------------------
>
> Message: 2
> Date: Mon, 25 Jul 2011 13:47:01 +0200
> From: Reuti <reuti at staff.uni-marburg.de>
> Subject: Re: [mpich-discuss] mpich2 does not work with SGE
> To: mpich-discuss at mcs.anl.gov
> Message-ID:
> <6724EB30-A6E2-4172-A073-222820574652 at staff.uni-marburg.de>
> Content-Type: text/plain; charset=us-ascii
>
> Hi,
>
> Am 25.07.2011 um 11:16 schrieb tilakraj dattaram:
>
> > Thanks a lot for all your help.
> >
> > Now we can run parallel jobs through the SGE (using a script file and
> > qsub). We submitted some test jobs and were keeping all the records
> > about the timings to compare the speeds with respect to mpirun
> > executed from the command line
> >
> > For some reason (which I can't find out) running jobs through SGE is
> > much slower than the command line. Is it so that command line works
> > faster then the SGE ?
> >
> > This is the comparison table,
> >
> > CASE 1
> > =======
> > mpiexec (both mpiexec and mpirun point to the same link, i.e.
> > mpiexec.hydra) from command line with a hostfile
> >
> > # mpiexec -f hostfile -np = n ./a.out<input
> > ..................................................................................
> > 16 cores per node
> >
> > N cpus time (in secs) speedup
> >
> > 1 262.27 1
> > 2 161.34 1.63
> > 4 82.41 3.18
> > 8 42.45 6.18
> > 16 33.56 7.82
> > 24 25.33 10.35
> > 32 20.38 12.87
> > ..................................................................................
> >
> > CASE 2
> > =======
> > mpich2 integrated with SGE
> >
> > mpirun executed from within the job script
> >
> > This is our PE
> >
> > pe_name mpich2
> > slots 999
> > user_lists NONE
> > xuser_lists NONE
> > start_proc_args NONE
> > stop_proc_args NONE
> > allocation_rule $fill_up
> > control_slaves TRUE
> > job_is_first_task FALSE
> > urgency_slots min
> > accounting_summary FALSE
> >
> > # qsub -q queue ./a.out<input>output (the PE is defined inside the
> > script file using the following option, #$ -pe mpich2 n)
> > .................................................................................
> > qsub [ 16 cores per node ]
> >
> > N cpus time (in secs) speedup
> >
> > 1 383.5 1
> > 2 205.1 1.87
> > 4 174.3 2.2
> > 8 159.3 2.4
> > 16 123.2 3.1
> > 24 136.8 2.8
> > 32 124.6 3.1
> > .................................................................................
> >
> >
> > As you can notice, the speed up is only about 3-times for a job run on
> > 16 processors and submitted through the SGE interface, whereas it is
> > nearly 8-fold when the parallel jobs are submitted from the command
> > line using mpirun. Another thing to note is that speed nearly
> > saturates around 3 for SGE, whereas it keeps increasing to around 13
> > for 32 processors for command line execution. In fact, we had earlier
> > found that the speed up keeps increasing till about 144 processors,
> > where it gives a maximum speed up of about 20-fold over the serial
> > job.
>
> no, this shouldn't be. There might be a small delay for the very first startup as SGE will be used by it's integrated `qrsh` startup instead of a plain `ssh`, but this shouldn't create such a huge difference. Once it's started, the usual communication inside MPICH2 will be used and there shouldn't be any difference noticeable.
>
> Can you please check on the various nodes, whether the allocation of the processes of your job is correct:
>
> $ ps -e f
>
> (f w/o -) and all are bound to the sge_shepherd on the intended nodes and no node is overloaded? Did you define a special queue for this (in SGE it's not necessary but possible depending on your personal taste), and limtit the available cores per node across all queues to avoid oversubscription?
>
> -- Reuti
>
>
> > Your help will be greatly appreciated!
> >
> > Thank you
> >
> > Regards
> > Tilak
> >
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>
> End of mpich-discuss Digest, Vol 34, Issue 32
> *********************************************
>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list