[mpich-discuss] mpich2 does not work with SGE

Tue Jul 26 04:45:29 CDT 2011

Hi,

Am 26.07.2011 um 08:37 schrieb tilakraj dattaram:

> Here is how we submitted the jobs through a defined SGE queue. 
> 
> 1. Submit the job using a job script (job_lammps.sh)
> $ qsub -q molsim.q job_lammps.sh (In the previous message there was a typo 
> because I had mistakenly typed; $ qsub -q queue ./a.out<input>output )
> 
> The job script looks like this
> ----------------------------------------------------------
> #!/bin/sh
> # request Bourne shell as shell for job
> #$ -S /bin/sh
> # Name of the job
> #$ -N mpich2_lammps_test
> # Name of the output log file
> #$ -o lammps_test.log 
> # Combine output/ error messages into one file
> #$ -j y
> # Use current working directory
> #$ -cwd
> # Specify the parallel environment (pe)
> #$ -pe mpich2 8
> # Commands to be executed
> mpirun ./lmp_g++ < in.shear > thermo_shear
> ----------------------------------------------------------
> 
> 2. Below is output for qstat and shows the job running on compute-0-5 corresponding to molsim.q
> $ qstat -f
> 
> queuename                      qtype resv/used/tot. load_avg arch          states
> ---------------------------------------------------------------------------------
> all.q at compute-0-1.local        BIP   0/0/16         0.00     lx26-amd64    
> ---------------------------------------------------------------------------------
> all.q at compute-0-10.local       BIP   0/0/16         0.09     lx26-amd64    
> ---------------------------------------------------------------------------------
> all.q at compute-0-2.local        BIP   0/0/16         0.03     lx26-amd64    
> ---------------------------------------------------------------------------------
> all.q at compute-0-3.local        BIP   0/0/16         0.00     lx26-amd64    
> ---------------------------------------------------------------------------------
> all.q at compute-0-4.local        BIP   0/0/16         0.00     lx26-amd64    
> ---------------------------------------------------------------------------------
> all.q at compute-0-5.local        BIP   0/0/16         0.02     lx26-amd64    
> ---------------------------------------------------------------------------------
> all.q at compute-0-6.local        BIP   0/0/16         0.01     lx26-amd64    
> ---------------------------------------------------------------------------------
> all.q at compute-0-7.local        BIP   0/0/16         0.02     lx26-amd64    
> ---------------------------------------------------------------------------------
> all.q at compute-0-8.local        BIP   0/0/16         0.04     lx26-amd64    
> ---------------------------------------------------------------------------------
> all.q at compute-0-9.local        BIP   0/0/16         0.04     lx26-amd64    
> ---------------------------------------------------------------------------------
> guru.q at compute-0-10.local      BIP   0/0/16         0.09     lx26-amd64    
> ---------------------------------------------------------------------------------
> guru.q at compute-0-7.local       BIP   0/0/16         0.02     lx26-amd64    
> ---------------------------------------------------------------------------------
> guru.q at compute-0-8.local       BIP   0/0/16         0.04     lx26-amd64    
> ---------------------------------------------------------------------------------
> guru.q at compute-0-9.local       BIP   0/0/16         0.04     lx26-amd64    
> ---------------------------------------------------------------------------------
> molsim.q at compute-0-1.local     BIP   0/0/16         0.00     lx26-amd64    
> ---------------------------------------------------------------------------------
> molsim.q at compute-0-2.local     BIP   0/0/16         0.03     lx26-amd64    
> ---------------------------------------------------------------------------------
> molsim.q at compute-0-3.local     BIP   0/0/16         0.00     lx26-amd64    
> ---------------------------------------------------------------------------------
> molsim.q at compute-0-4.local     BIP   0/0/16         0.00     lx26-amd64    
> ---------------------------------------------------------------------------------
> molsim.q at compute-0-5.local     BIP   0/8/16         0.02     lx26-amd64    
>     301 0.55500 mpich2_lam ajay         r     07/26/2011 19:40:11     8        
> ---------------------------------------------------------------------------------
> molsim.q at compute-0-6.local     BIP   0/0/16         0.01     lx26-amd64    
> ---------------------------------------------------------------------------------
> test_mpi.q at compute-0-10.local  BIP   0/0/8          0.09     lx26-amd64    
> ---------------------------------------------------------------------------------
> test_mpi.q at compute-0-9.local   BIP   0/0/8          0.04     lx26-amd64    
> 
> 3.  I log into compute-0-5 and do a ps -e f
> 
> compute-0-5 ~]$ ps -e f
> It seems to show the jobs bound to sge_shephard
> 
> 3901 ?        S      0:00  \_ hald-runner
>  3909 ?        S      0:00      \_ hald-addon-acpi: listening on acpid socket /v
>  3915 ?        S      0:00      \_ hald-addon-keyboard: listening on /dev/input/
>  4058 ?        Ssl    0:00 automount
>  4122 ?        Sl     0:00 /opt/gridengine/bin/lx26-amd64/sge_execd
>  5466 ?        S      0:00  \_ sge_shepherd-301 -bg
>  5467 ?        Ss     0:00      \_ -sh /opt/gridengine/default/spool/execd/compute-0-5/job_scripts/301
>  5609 ?        S      0:00          \_ mpirun ./lmp_g++
>  5610 ?        S      0:00              \_ /opt/mpich2/gnu/bin//hydra_pmi_proxy --control-port compute-0-5.loc
>  5611 ?        R      0:25                  \_ ./lmp_g++
>  5612 ?        R      0:25                  \_ ./lmp_g++
>  5613 ?        R      0:25                  \_ ./lmp_g++
>  5614 ?        R      0:25                  \_ ./lmp_g++
>  5615 ?        R      0:25                  \_ ./lmp_g++
>  5616 ?        R      0:25                  \_ ./lmp_g++
>  5617 ?        R      0:25                  \_ ./lmp_g++
>  5618 ?        R      0:25                  \_ ./lmp_g++

fine.

>  4143 ?        Sl     0:00 /usr/sbin/snmpd -Lsd -Lf /dev/null -p /var/run/snmpd.
>  4158 ?        Ss     0:00 /usr/sbin/sshd
>  5619 ?        Ss     0:00  \_ sshd: ajay [priv]
>  5621 ?        S      0:00      \_ sshd: ajay at pts/0 
>  5622 pts/0    Ss     0:00          \_ -bash
>  5728 pts/0    R+     0:00              \_ ps -e f
> 
> The following shows that the mpirun is indeed pointing to the correct location and the mpirun defined inside the
> job script file is the right one. 
> 5610 ?        S      0:00              \_ /opt/mpich2/gnu/bin//hydra_pmi_proxy --control-port compute-0-5.loc
> 
> 4. However, the compute time is still slower  (159 seconds) than that for a job run 
> through the command line using mpirun (42 seconds, mpiexec -f hostfile -np 8 ./lmp_g++ < in.shear > thermo.shear).

Okay, this is something to investigate, especially as no processes on slave nodes are involved. I remember slightly that there was a similar discussion some time ago (by this I mean 3 or 4 years) but I don't remember how it ended and at that time also slave processes were involved.

What we can check is a different setting in:

- the environment (may an `env` in the script will show it)?

- different ulimits?

- can you go interactively to a node and run the job inside SGE in an interactive queue?

- you issued the command beforehand on the headnode and the machinefile contained only one node in the cluster and no process was running local on the headnode?

- The syntax you used "mpiexec -f hostfile -np = ...": the "=" therein is a placeholder or a valid (undocumented) syntax for `mpiexec`?

-- Reuti

> 
> I can't understand why there should be large difference between the plain mpiexec and that started through the sge. 
> 
> Thanks in advance
> 
> Regards
> Tilak
> 
> 
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Mon, 25 Jul 2011 13:47:01 +0200
> From: Reuti <reuti at staff.uni-marburg.de>
> Subject: Re: [mpich-discuss] mpich2 does not work with SGE
> To: mpich-discuss at mcs.anl.gov
> Message-ID:
>        <6724EB30-A6E2-4172-A073-222820574652 at staff.uni-marburg.de>
> Content-Type: text/plain; charset=us-ascii
> 
> Hi,
> 
> Am 25.07.2011 um 11:16 schrieb tilakraj dattaram:
> 
> > Thanks a lot for all your help.
> >
> > Now we can run parallel jobs through the SGE (using a script file and
> > qsub). We submitted some test jobs and were keeping all the records
> > about the timings to compare the speeds with respect to mpirun
> > executed from the command line
> >
> > For some reason (which I can't find out) running jobs through SGE is
> > much slower than the command line. Is it so that command line works
> > faster then the SGE ?
> >
> > This is the comparison table,
> >
> > CASE 1
> > =======
> > mpiexec (both mpiexec and mpirun point to the same link, i.e.
> > mpiexec.hydra) from command line with a hostfile
> >
> > # mpiexec -f hostfile -np = n ./a.out<input
> > ..................................................................................
> > 16 cores per node
> >
> > N cpus                   time (in secs)               speedup
> >
> >    1                             262.27                         1
> >    2                             161.34                         1.63
> >    4                               82.41                         3.18
> >    8                               42.45                         6.18
> >   16                              33.56                         7.82
> >   24                              25.33                         10.35
> >   32                              20.38                         12.87
> > ..................................................................................
> >
> > CASE 2
> > =======
> > mpich2 integrated with SGE
> >
> > mpirun executed from within the job script
> >
> > This is our PE
> >
> > pe_name            mpich2
> > slots              999
> > user_lists         NONE
> > xuser_lists        NONE
> > start_proc_args    NONE
> > stop_proc_args     NONE
> > allocation_rule    $fill_up
> > control_slaves     TRUE
> > job_is_first_task  FALSE
> > urgency_slots      min
> > accounting_summary FALSE
> >
> > # qsub -q queue ./a.out<input>output (the PE is defined inside the
> > script file using the following option, #$ -pe mpich2 n)
> > .................................................................................
> > qsub [ 16 cores per node ]
> >
> > N cpus                   time (in secs)                 speedup
> >
> >   1                             383.5                             1
> >   2                             205.1                             1.87
> >   4                             174.3                             2.2
> >   8                             159.3                             2.4
> >  16                            123.2                             3.1
> >  24                            136.8                             2.8
> >  32                            124.6                             3.1
> > .................................................................................
> >
> >
> > As you can notice, the speed up is only about 3-times for a job run on
> > 16 processors and submitted through the SGE interface, whereas it is
> > nearly 8-fold when the parallel jobs are submitted from the command
> > line using mpirun. Another thing to note is that speed nearly
> > saturates around 3 for SGE, whereas it keeps increasing to around 13
> > for 32 processors for command line execution. In fact, we had earlier
> > found that the speed up keeps increasing till about 144 processors,
> > where it gives a maximum speed up of about 20-fold over the serial
> > job.
> 
> no, this shouldn't be. There might be a small delay for the very first startup as SGE will be used by it's integrated `qrsh` startup instead of a plain `ssh`, but this shouldn't create such a huge difference. Once it's started, the usual communication inside MPICH2 will be used and there shouldn't be any difference noticeable.
> 
> Can you please check on the various nodes, whether the allocation of the processes of your job is correct:
> 
> $ ps -e f
> 
> (f w/o -) and all are bound to the sge_shepherd on the intended nodes and no node is overloaded? Did you define a special queue for this (in SGE it's not necessary but possible depending on your personal taste), and limtit the available cores per node across all queues to avoid oversubscription?
> 
> -- Reuti
> 
> 
> > Your help will be greatly appreciated!
> >
> > Thank you
> >
> > Regards
> > Tilak
> >
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> 
> End of mpich-discuss Digest, Vol 34, Issue 32
> *********************************************
> 
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss