[mpich-discuss] mpich2 does not work with SGE

Tue Jul 19 05:11:29 CDT 2011

Hi,

Am 19.07.2011 um 08:59 schrieb tilakraj dattaram:

>              Thanks again for your reply. We were able to add a PE using qconf -ap. As I had mentioned earlier, mpirun points to the correct location, and now I can get the job script to execute the mpirun command. However, it seems to disregard the number of processes specified. Instead, the job is run on one processor only. 
> 
> Here is how the PE  looks like 
> 
> [root at Spinode ~]# qconf -sp mpich2
> pe_name            mpich2
> slots              999
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    NONE
> stop_proc_args     NONE
> allocation_rule    $fill_up
> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots      min
> accounting_summary FALSE

looks fine. Which version of mpich2 are you using? At least the 1.4 should indeed work out-of-the-box.

> The list of all the PEs on the cluster is as follows, 
> 
> [root at Spinode ~]# qconf -spl
> make
> mpich2
> 
> Next, I added the PE to one of the queues running on the cluster. It looks like this, 
> 
> [root at Spinode ~]# qconf -sq molsim.q
> qname                 molsim.q
> hostlist              @molsimhosts
> seq_no                0
> load_thresholds       np_load_avg=1.75

I would suggest to set it to:

load_thresholds NONE

Due to the fact that recent kernels judge uninterruptible kernel tasks as running, the np_load_avg is no longer the load of the running jobs but might be much higher and block by this setting otherwise idling cores. It would really like if there would be another value available from the kernel.

It might be possible to create a load-sensor in SGE and scan `ps` only for running tasks skipping kernel processes in "D" state.

> suspend_thresholds    NONE
> nsuspend              1
> suspend_interval      00:05:00
> priority              0
> min_cpu_interval      00:05:00
> processors            UNDEFINED
> qtype                 BATCH INTERACTIVE
> ckpt_list             NONE
> pe_list               make mpich2
> rerun                 FALSE
> slots                 1,[compute-0-1.local=16],[compute-0-2.local=16], \
>                       [compute-0-3.local=16],[compute-0-4.local=16], \
>                       [compute-0-5.local=16],[compute-0-6.local=16]

All your nodes have 16 cores? Then you could shorten it to:

slots 16

> tmpdir                /tmp
> shell                 /bin/bash
> prolog                NONE
> epilog                NONE
> shell_start_mode      posix_compliant

shell_start_mode unix_behavior

Then the first line of the script will determine the shell like you are used to get on the command prompt.

> starter_method        NONE
> suspend_method        NONE
> resume_method         NONE
> terminate_method      NONE
> notify                00:00:60
> owner_list            NONE
> user_lists            NONE
> xuser_lists           NONE
> subordinate_list      NONE
> complex_values        NONE
> projects              NONE
> xprojects             NONE
> calendar              NONE
> initial_state         default
> s_rt                  INFINITY
> h_rt                  INFINITY
> s_cpu                 INFINITY
> h_cpu                 INFINITY
> s_fsize               INFINITY
> h_fsize               INFINITY
> s_data                INFINITY
> h_data                INFINITY
> s_stack               INFINITY
> h_stack               INFINITY
> s_core                INFINITY
> h_core                INFINITY
> s_rss                 INFINITY
> h_rss                 INFINITY
> s_vmem                INFINITY
> h_vmem                INFINITY
> 
> Then, we tried to submit a job using a job script
> --------------------------------------------------------------
> #!/bin/sh
> 
> # request Bourne shell as shell for job
> #$ -S /bin/sh

With the above setting this -S shouldn't be necessary: it's only useful for shell_start_mode posix_compliant and even then you defined already on the queue level to use /bin/bash. Here you could change it if you don't like it and have shell_start_mode posix_compliant set with a different default.

> # Name of the job
> #$ -N mpich2_lammps_test
> 
> # Name of the output log file
> #$ -o lammps_test.log
> 
> # Combine output/ error messages into one file
> #$ -j y
> 
> # Use current working directory
> #$ -cwd
> 
> # Commands to be executed
> mpirun -np 8 ./a.out < input > output

Don't use -np 8, it should detect it automatically.

> -------------------------------------------------------------
> 
> The problem is that no matter what number I specify after the -np option, the job is always executed on a single processor. 

Do you also request the PE in the job submission? Either on the command line or in the job you need:

#$ -pe mpich2 8

to request 8 slots by using PE mpich2. The idea behind this is, that in the PE you could define to reformat the machinefile, setup directories on the involved nodes, and so on.

Then on the involved you can check with:

$ ps -e f

(f w/o -) whether all processes of this job are attached to the sge_shepherd.

-- Reuti

> Your help would be appreciated. 
> 
> Thanks in advance
> 
> Regards
> Tilak
> 
> 
> 
> Message: 7
> Date: Wed, 13 Jul 2011 12:30:18 +0200
> From: Reuti <reuti at staff.uni-marburg.de>
> Subject: Re: [mpich-discuss] mpich2 does not work with SGE
> To: mpich-discuss at mcs.anl.gov
> Message-ID:
>        <33B507D9-C8E7-4F10-89E9-942633D81547 at staff.uni-marburg.de>
> Content-Type: text/plain; charset=us-ascii
> 
> Hi,
> 
> Am 13.07.2011 um 12:13 schrieb tilakraj dattaram:
> 
> > Hi Reuti
> >
> > Thanks for your reply. I forgot to mention in my previous message, but I had tried adding a Parallel Environment in SGE using qconf -ap. I did the following,
> >
> > qconf -Ap mpich
> 
> `qconf -Ap mpich` is the command to add a parallel environment after you edited the configuration file already beforehand. This will then be stored in SGE's configuration. Editing it later on won't be honored. If you want to edit a configuration which is already stored in SGE, you can use:
> 
> `qconf -mp mpich`
> 
> or start form scratch after deleting the old one by:
> 
> `qconf -ap mpich`
> 
> To see a list of all PEs:
> 
> `qconf -spl`
> 
> Don't forget to add the PE to any of your queues by editing them with `qconf -mq all.q` or alike.
> 
> General rule: uppercase commands will read configuration from files, while lowercase commands will open an editor for interactive setup.
> 
> 
> > and then edited the pe file to,
> >
> > pe_name            mpich
> > slots              999
> > user_lists         NONE
> > xuser_lists        NONE
> > start_proc_args    NONE
> > stop_proc_args    NONE
> > allocation_rule    $fill_up
> > control_slaves     TRUE
> > job_is_first_task  FALSE
> > urgency_slots      min
> > accounting_summary FALSE
> >
> > This did not work. However, here I don't see how SGE would know where to look for mpich2/ hydra. I do see a mpi directory in the $SGE_ROOT directory, where there is rocks-mpich.template file that reads the following
> >
> > pe_name          mpich
> > slots            9999
> > user_lists       NONE
> > xuser_lists      NONE
> > start_proc_args  /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
> > stop_proc_args   /opt/gridengine/mpi/stopmpi.sh
> > allocation_rule  $fill_up
> > control_slaves   TRUE
> > job_is_first_task FALSE
> > urgency_slots     min
> > accounting_summary TRUE
> 
> This was for the old MPICH1. I don't know what ROCKS delivers, I would ignore it.
> 
> SGE don't have to know anything about MPICH2. You only have to setup an appropriate PATH in your jobscript, so that you can access the `mpirun` of MPICH2 (and not one of any other MPI implementation) without the need of a hostlist and that's all. So, your jobscript will call `mpirun`, and a proper built MPICH2 will know about SGE and detect that it's running under SGE by checking some environment variables and then start processes on slave nodes by calling `qrsh -inherit ...` for the granted nodes.
> 
> On the master node of the parallel job you can check the invocation chain by:
> 
> `ps -e f`
> 
> (f w/o -).
> 
> -- Reuti
> 
> 
> > Does SGE need re-configuration after the mpich2 install?
> >
> > Thanks in advance!
> >
> > Regards
> > Tilak
> >
> > Message: 6
> > Date: Tue, 12 Jul 2011 13:19:18 +0200
> > From: Reuti <reuti at staff.uni-marburg.de>
> > Subject: Re: [mpich-discuss] mpich2 does not work with SGE
> > To: mpich-discuss at mcs.anl.gov
> > Message-ID:
> >        <8768BE3D-BE2D-498C-98A6-D3A72F397291 at staff.uni-marburg.de>
> > Content-Type: text/plain; charset=us-ascii
> >
> > Hi,
> >
> > Am 12.07.2011 um 13:03 schrieb tilakraj dattaram:
> >
> > > We have a rocks cluster with 10 nodes, with sun grid engine installed and running. I then installed the most recent version of mpich2 (1.4) on the master and compute nodes. However, we are unable to run parallel jobs through SGE (we can submit serial jobs without a problem). I am a sge newbie, and most of the installation that we have done is by reading step-by-step tutorials on the web.
> > >
> > > The mpich2 manual says that hydra is the default process manager for mpich2, and I have checked that the mpiexec command points to mpiexec.hydra. Also, which mpicc, which mpiexec point to the desired location of mpich2. I understand that in this version of mpich2, hydra should be integrated with SGE by default. But maybe I am missing something here.
> > >
> > > We are able to run parallel jobs using command line by specifying a host file (e.g, mpiexec -f hostfile -np 16 ./a.out), but would like the resource manager to take care of allocating resources on the cluster.
> >
> > it's necessary to set up a so called parallel environment (i.e. a PE) in SGE and request it during the job submission. Then a plain mpirun without any hostfile or -np specification will do, as all is directly delivered by SGE. If all is set up in a proper way, you could even switch off `rsh` and `ssh` inside the cluster completely, as SGE's internal startup mechanism is used then to start processes on other nodes. In fact, disabling or limiting `ssh` to admin staff is a good way to check whether your parallel application has a tight integration into the queuingsystem where all slave processes are accounted also correctly and under full SGE control for a delition by `qdel`.
> >
> > For SGE there is also a mailing list: http://gridengine.org/mailman/listinfo/users
> >
> > -- Reuti
> >
> >
> >
> >
> > ------------------------------
> >
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> >
> > End of mpich-discuss Digest, Vol 34, Issue 15
> > *********************************************
> >
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> 
> 
> ------------------------------
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> 
> End of mpich-discuss Digest, Vol 34, Issue 16
> *********************************************
> 
> 
> 
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss