[mpich-discuss] mpich2 does not work with SGE

Reuti reuti at staff.uni-marburg.de
Tue Aug 9 05:15:14 CDT 2011


Hi,

Am 09.08.2011 um 11:36 schrieb tilakraj dattaram:

> Thanks for your help. I investigated execution of parallel jobs using both plain mpiexec (issued from the head node) and also sent through the SGE setup. I found similar run times using both modes. I think we had, most likely, used a different number of time steps in our earlier attempt to compare the plain mpiexec and that through SGE. The plain mpiexec run time data was old, whereas the SGE jobs were run after sorting out all the problems (with your help) we were facing with SGE. Here is new comparison table with upto 32 processors (8 per node, we have disabled hyperthreading for now).

yes it depends on the application whether it makes  a big difference, I found it's like 1.5 times better (I tried it some time ago with our applications). So 8 cores plus HT would mean 12 to set up in SGE (contrary to the found 16). The only way to get a good number for the slot count is to load the CPUs with your applications and more and more slots set up until they run slower again due to overloading. I prefer leaving it disabled.


> Also we are facing some problems with couple of nodes, so we haven't carried out more extensive comparisons involving all the nodes (8 X 10 processors).
> 
> # Nodes    Time (plain mpiexec)  Speed-up  Time (SGE)   Speed-up
>                          (seconds)                             (seconds)
> 1                     277.9                      1                 263.6              1
> 2                     175                        1.59             161.6              1.63
> 4                      90.1                      3.08               86.3              3.05
> 8                      48.6                      5.72               49                5.38
> 16                    32.2                       8.63               31.5             8.37    
> 24                    27.3                      10.18              27.4             9.62       
> 32                    22.5                      12.35              22.3             11.82 

This looks reasonable again :-) Maybe with a longer running jobs the offset due to the different startup mechanisms will become less important.

But it also looks like more than 8 is not scaling well with this job.

-- Reuti


> Regards
> Tilak
> 
> Message: 2
> Date: Tue, 26 Jul 2011 11:45:29 +0200
> From: Reuti <reuti at staff.uni-marburg.de>
> Subject: Re: [mpich-discuss] mpich2 does not work with SGE
> To: mpich-discuss at mcs.anl.gov
> Message-ID:
>        <37977AA2-12B5-4DF6-A032-A33B1A2A49F7 at staff.uni-marburg.de>
> Content-Type: text/plain; charset=us-ascii
> 
> Hi,
> 
> Am 26.07.2011 um 08:37 schrieb tilakraj dattaram:
> 
> > Here is how we submitted the jobs through a defined SGE queue.
> >
> > 1. Submit the job using a job script (job_lammps.sh)
> > $ qsub -q molsim.q job_lammps.sh (In the previous message there was a typo
> > because I had mistakenly typed; $ qsub -q queue ./a.out<input>output )
> >
> > The job script looks like this
> > ----------------------------------------------------------
> > #!/bin/sh
> > # request Bourne shell as shell for job
> > #$ -S /bin/sh
> > # Name of the job
> > #$ -N mpich2_lammps_test
> > # Name of the output log file
> > #$ -o lammps_test.log
> > # Combine output/ error messages into one file
> > #$ -j y
> > # Use current working directory
> > #$ -cwd
> > # Specify the parallel environment (pe)
> > #$ -pe mpich2 8
> > # Commands to be executed
> > mpirun ./lmp_g++ < in.shear > thermo_shear
> > ----------------------------------------------------------
> >
> > 2. Below is output for qstat and shows the job running on compute-0-5 corresponding to molsim.q
> > $ qstat -f
> >
> > queuename                      qtype resv/used/tot. load_avg arch          states
> > ---------------------------------------------------------------------------------
> > all.q at compute-0-1.local        BIP   0/0/16         0.00     lx26-amd64
> > ---------------------------------------------------------------------------------
> > all.q at compute-0-10.local       BIP   0/0/16         0.09     lx26-amd64
> > ---------------------------------------------------------------------------------
> > all.q at compute-0-2.local        BIP   0/0/16         0.03     lx26-amd64
> > ---------------------------------------------------------------------------------
> > all.q at compute-0-3.local        BIP   0/0/16         0.00     lx26-amd64
> > ---------------------------------------------------------------------------------
> > all.q at compute-0-4.local        BIP   0/0/16         0.00     lx26-amd64
> > ---------------------------------------------------------------------------------
> > all.q at compute-0-5.local        BIP   0/0/16         0.02     lx26-amd64
> > ---------------------------------------------------------------------------------
> > all.q at compute-0-6.local        BIP   0/0/16         0.01     lx26-amd64
> > ---------------------------------------------------------------------------------
> > all.q at compute-0-7.local        BIP   0/0/16         0.02     lx26-amd64
> > ---------------------------------------------------------------------------------
> > all.q at compute-0-8.local        BIP   0/0/16         0.04     lx26-amd64
> > ---------------------------------------------------------------------------------
> > all.q at compute-0-9.local        BIP   0/0/16         0.04     lx26-amd64
> > ---------------------------------------------------------------------------------
> > guru.q at compute-0-10.local      BIP   0/0/16         0.09     lx26-amd64
> > ---------------------------------------------------------------------------------
> > guru.q at compute-0-7.local       BIP   0/0/16         0.02     lx26-amd64
> > ---------------------------------------------------------------------------------
> > guru.q at compute-0-8.local       BIP   0/0/16         0.04     lx26-amd64
> > ---------------------------------------------------------------------------------
> > guru.q at compute-0-9.local       BIP   0/0/16         0.04     lx26-amd64
> > ---------------------------------------------------------------------------------
> > molsim.q at compute-0-1.local     BIP   0/0/16         0.00     lx26-amd64
> > ---------------------------------------------------------------------------------
> > molsim.q at compute-0-2.local     BIP   0/0/16         0.03     lx26-amd64
> > ---------------------------------------------------------------------------------
> > molsim.q at compute-0-3.local     BIP   0/0/16         0.00     lx26-amd64
> > ---------------------------------------------------------------------------------
> > molsim.q at compute-0-4.local     BIP   0/0/16         0.00     lx26-amd64
> > ---------------------------------------------------------------------------------
> > molsim.q at compute-0-5.local     BIP   0/8/16         0.02     lx26-amd64
> >     301 0.55500 mpich2_lam ajay         r     07/26/2011 19:40:11     8
> > ---------------------------------------------------------------------------------
> > molsim.q at compute-0-6.local     BIP   0/0/16         0.01     lx26-amd64
> > ---------------------------------------------------------------------------------
> > test_mpi.q at compute-0-10.local  BIP   0/0/8          0.09     lx26-amd64
> > ---------------------------------------------------------------------------------
> > test_mpi.q at compute-0-9.local   BIP   0/0/8          0.04     lx26-amd64
> >
> > 3.  I log into compute-0-5 and do a ps -e f
> >
> > compute-0-5 ~]$ ps -e f
> > It seems to show the jobs bound to sge_shephard
> >
> > 3901 ?        S      0:00  \_ hald-runner
> >  3909 ?        S      0:00      \_ hald-addon-acpi: listening on acpid socket /v
> >  3915 ?        S      0:00      \_ hald-addon-keyboard: listening on /dev/input/
> >  4058 ?        Ssl    0:00 automount
> >  4122 ?        Sl     0:00 /opt/gridengine/bin/lx26-amd64/sge_execd
> >  5466 ?        S      0:00  \_ sge_shepherd-301 -bg
> >  5467 ?        Ss     0:00      \_ -sh /opt/gridengine/default/spool/execd/compute-0-5/job_scripts/301
> >  5609 ?        S      0:00          \_ mpirun ./lmp_g++
> >  5610 ?        S      0:00              \_ /opt/mpich2/gnu/bin//hydra_pmi_proxy --control-port compute-0-5.loc
> >  5611 ?        R      0:25                  \_ ./lmp_g++
> >  5612 ?        R      0:25                  \_ ./lmp_g++
> >  5613 ?        R      0:25                  \_ ./lmp_g++
> >  5614 ?        R      0:25                  \_ ./lmp_g++
> >  5615 ?        R      0:25                  \_ ./lmp_g++
> >  5616 ?        R      0:25                  \_ ./lmp_g++
> >  5617 ?        R      0:25                  \_ ./lmp_g++
> >  5618 ?        R      0:25                  \_ ./lmp_g++
> 
> fine.
> 
> 
> >  4143 ?        Sl     0:00 /usr/sbin/snmpd -Lsd -Lf /dev/null -p /var/run/snmpd.
> >  4158 ?        Ss     0:00 /usr/sbin/sshd
> >  5619 ?        Ss     0:00  \_ sshd: ajay [priv]
> >  5621 ?        S      0:00      \_ sshd: ajay at pts/0
> >  5622 pts/0    Ss     0:00          \_ -bash
> >  5728 pts/0    R+     0:00              \_ ps -e f
> >
> > The following shows that the mpirun is indeed pointing to the correct location and the mpirun defined inside the
> > job script file is the right one.
> > 5610 ?        S      0:00              \_ /opt/mpich2/gnu/bin//hydra_pmi_proxy --control-port compute-0-5.loc
> >
> > 4. However, the compute time is still slower  (159 seconds) than that for a job run
> > through the command line using mpirun (42 seconds, mpiexec -f hostfile -np 8 ./lmp_g++ < in.shear > thermo.shear).
> 
> Okay, this is something to investigate, especially as no processes on slave nodes are involved. I remember slightly that there was a similar discussion some time ago (by this I mean 3 or 4 years) but I don't remember how it ended and at that time also slave processes were involved.
> 
> What we can check is a different setting in:
> 
> - the environment (may an `env` in the script will show it)?
> 
> - different ulimits?
> 
> - can you go interactively to a node and run the job inside SGE in an interactive queue?
> 
> - you issued the command beforehand on the headnode and the machinefile contained only one node in the cluster and no process was running local on the headnode?
> 
> - The syntax you used "mpiexec -f hostfile -np = ...": the "=" therein is a placeholder or a valid (undocumented) syntax for `mpiexec`?
> 
> -- Reuti
> 
> 
> >
> > I can't understand why there should be large difference between the plain mpiexec and that started through the sge.
> >
> > Thanks in advance
> >
> > Regards
> > Tilak
> >
> >
> >
> >
> > ------------------------------
> >
> > Message: 2
> > Date: Mon, 25 Jul 2011 13:47:01 +0200
> > From: Reuti <reuti at staff.uni-marburg.de>
> > Subject: Re: [mpich-discuss] mpich2 does not work with SGE
> > To: mpich-discuss at mcs.anl.gov
> > Message-ID:
> >        <6724EB30-A6E2-4172-A073-222820574652 at staff.uni-marburg.de>
> > Content-Type: text/plain; charset=us-ascii
> >
> > Hi,
> >
> > Am 25.07.2011 um 11:16 schrieb tilakraj dattaram:
> >
> > > Thanks a lot for all your help.
> > >
> > > Now we can run parallel jobs through the SGE (using a script file and
> > > qsub). We submitted some test jobs and were keeping all the records
> > > about the timings to compare the speeds with respect to mpirun
> > > executed from the command line
> > >
> > > For some reason (which I can't find out) running jobs through SGE is
> > > much slower than the command line. Is it so that command line works
> > > faster then the SGE ?
> > >
> > > This is the comparison table,
> > >
> > > CASE 1
> > > =======
> > > mpiexec (both mpiexec and mpirun point to the same link, i.e.
> > > mpiexec.hydra) from command line with a hostfile
> > >
> > > # mpiexec -f hostfile -np = n ./a.out<input
> > > ..................................................................................
> > > 16 cores per node
> > >
> > > N cpus                   time (in secs)               speedup
> > >
> > >    1                             262.27                         1
> > >    2                             161.34                         1.63
> > >    4                               82.41                         3.18
> > >    8                               42.45                         6.18
> > >   16                              33.56                         7.82
> > >   24                              25.33                         10.35
> > >   32                              20.38                         12.87
> > > ..................................................................................
> > >
> > > CASE 2
> > > =======
> > > mpich2 integrated with SGE
> > >
> > > mpirun executed from within the job script
> > >
> > > This is our PE
> > >
> > > pe_name            mpich2
> > > slots              999
> > > user_lists         NONE
> > > xuser_lists        NONE
> > > start_proc_args    NONE
> > > stop_proc_args     NONE
> > > allocation_rule    $fill_up
> > > control_slaves     TRUE
> > > job_is_first_task  FALSE
> > > urgency_slots      min
> > > accounting_summary FALSE
> > >
> > > # qsub -q queue ./a.out<input>output (the PE is defined inside the
> > > script file using the following option, #$ -pe mpich2 n)
> > > .................................................................................
> > > qsub [ 16 cores per node ]
> > >
> > > N cpus                   time (in secs)                 speedup
> > >
> > >   1                             383.5                             1
> > >   2                             205.1                             1.87
> > >   4                             174.3                             2.2
> > >   8                             159.3                             2.4
> > >  16                            123.2                             3.1
> > >  24                            136.8                             2.8
> > >  32                            124.6                             3.1
> > > .................................................................................
> > >
> > >
> > > As you can notice, the speed up is only about 3-times for a job run on
> > > 16 processors and submitted through the SGE interface, whereas it is
> > > nearly 8-fold when the parallel jobs are submitted from the command
> > > line using mpirun. Another thing to note is that speed nearly
> > > saturates around 3 for SGE, whereas it keeps increasing to around 13
> > > for 32 processors for command line execution. In fact, we had earlier
> > > found that the speed up keeps increasing till about 144 processors,
> > > where it gives a maximum speed up of about 20-fold over the serial
> > > job.
> >
> > no, this shouldn't be. There might be a small delay for the very first startup as SGE will be used by it's integrated `qrsh` startup instead of a plain `ssh`, but this shouldn't create such a huge difference. Once it's started, the usual communication inside MPICH2 will be used and there shouldn't be any difference noticeable.
> >
> > Can you please check on the various nodes, whether the allocation of the processes of your job is correct:
> >
> > $ ps -e f
> >
> > (f w/o -) and all are bound to the sge_shepherd on the intended nodes and no node is overloaded? Did you define a special queue for this (in SGE it's not necessary but possible depending on your personal taste), and limtit the available cores per node across all queues to avoid oversubscription?
> >
> > -- Reuti
> >
> >
> > > Your help will be greatly appreciated!
> > >
> > > Thank you
> > >
> > > Regards
> > > Tilak
> > >
> >
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> >
> > End of mpich-discuss Digest, Vol 34, Issue 32
> > *********************************************
> >
> >
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> 
> 
> ------------------------------
> 
> Message: 3
> Date: Tue, 26 Jul 2011 08:45:23 -0300
> From: Charles Sartori <charles.sartori at gmail.com>
> Subject: Re: [mpich-discuss] mpdboot error
> To: Pavan Balaji <balaji at mcs.anl.gov>
> Cc: mpich-discuss at mcs.anl.gov
> Message-ID:
>        <CALBvAyv8qojPZBxYUUsMa0ccg45nr0iyWt877LoHwEcFv3q=ZQ at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
> 
> i changed the location and now its works o/
> 
> pexe-note at pexe-note:~$ mpiexec -n 5 -f machinefile ./mpich2-1.4/examples/cpi
> > Process 1 of 5 is on pexe-note
> > Process 3 of 5 is on pexe-note
> > Process 0 of 5 is on pexe-pc
> > Process 2 of 5 is on pexe-pc
> > Process 4 of 5 is on pexe-pc
> > pi is approximately 3.1415926544231225, Error is 0.0000000008333294
> > wall clock time = 0.002060
> >
> 
> Pavan, Reuti thank you so much for support :D
> 
> 
> --
> *Charles Sartori
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110726/a1053253/attachment-0001.htm>
> 
> ------------------------------
> 
> Message: 4
> Date: Tue, 26 Jul 2011 08:35:41 -0500
> From: Pavan Balaji <balaji at mcs.anl.gov>
> Subject: Re: [mpich-discuss] Fatal error in PMPI_Bcast: Other MPI
>        error, error stack:
> To: mpich-discuss at mcs.anl.gov
> Message-ID: <4E2EC2AD.9040509 at mcs.anl.gov>
> Content-Type: text/plain; charset=UTF-8; format=flowed
> 
> 
> Did you do the checks listed on this FAQ entry?
> 
> http://wiki.mcs.anl.gov/mpich2/index.php/Frequently_Asked_Questions#Q:_My_MPI_program_aborts_with_an_error_saying_it_cannot_communicate_with_other_processes
> 
>  -- Pavan
> 
> On 07/26/2011 01:55 AM, ???? wrote:
> > Hi ?
> > My hosts:
> > hksbs-s13.com:8
> > hksbs-s11.com:8
> > When i run in one node,it is ok.
> > [root at hksbs-s13 examples_collchk]# mpiexec -f hosts -n 8 ./time_bcast_nochk
> > time taken by 1X1 MPI_Bcast() at rank 0 = 0.000005
> > time taken by 1X1 MPI_Bcast() at rank 1 = 0.000002
> > time taken by 1X1 MPI_Bcast() at rank 2 = 0.000003
> > time taken by 1X1 MPI_Bcast() at rank 3 = 0.000002
> > time taken by 1X1 MPI_Bcast() at rank 4 = 0.000004
> > time taken by 1X1 MPI_Bcast() at rank 5 = 0.000002
> > time taken by 1X1 MPI_Bcast() at rank 6 = 0.000003
> > time taken by 1X1 MPI_Bcast() at rank 7 = 0.000002
> > but when i connect to other node, it failed
> > [root at hksbs-s13 examples_logging]# mpiexec -f hosts -n 9 ./srtest
> > Fatal error in PMPI_Bcast: Other MPI error, error stack:
> > PMPI_Bcast(1478)......................: MPI_Bcast(buf=0x16fc2aa8,
> > count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed
> > MPIR_Bcast_impl(1321).................:
> > MPIR_Bcast_intra(1119)................:
> > MPIR_Bcast_scatter_ring_allgather(961):
> > MPIR_Bcast_binomial(213)..............: Failure during collective
> > MPIR_Bcast_scatter_ring_allgather(952):
> > MPIR_Bcast_binomial(189)..............:
> > MPIC_Send(63).........................:
> > MPIDI_EagerContigShortSend(262).......: failure occurred while
> > attempting to send an eager message
> > MPIDI_CH3_iStartMsg(36)...............: Communication error with rank 8
> > when i ssh the other node, for example
> >
> > [root at hksbs-s13 examples_logging]# ssh hksbs-s11.com
> > Last login: Tue Jul 26 15:45:22 2011 from 10.33.15.233
> > [root at hksbs-s11 ~]#
> > it works.
> > How can check the reason?
> >
> >
> >
> >
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
> 
> 
> ------------------------------
> 
> Message: 5
> Date: Tue, 26 Jul 2011 15:32:39 +0000
> From: "Carson, John" <John.CarsonJr at shawgrp.com>
> Subject: [mpich-discuss] remove
> To: "'mpich-discuss at mcs.anl.gov'" <mpich-discuss at mcs.anl.gov>
> Message-ID:
>        <822E8B78D786544F8F86BBAB6D223E0C0BAF8517 at entbosemb01.shawgrp.com>
> Content-Type: text/plain; charset="us-ascii"
> 
> 
> 
> 
> ****Internet Email Confidentiality Footer****
> Privileged/Confidential Information may be contained in this
> message. If you are not the addressee indicated in this message (or
> responsible for delivery of the message to such person), you may
> not copy or deliver this message to anyone. In such case, you
> should destroy this message and notify the sender by reply email.
> Please advise immediately if you or your employer do not consent to
> Internet email for messages of this kind. Opinions, conclusions and
> other information in this message that do not relate to the
> official business of The Shaw Group Inc. or its subsidiaries shall
> be understood as neither given nor endorsed by it.
> ______________________________________ The Shaw Group Inc.
> http://www.shawgrp.com
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110726/999e8b05/attachment.htm>
> 
> ------------------------------
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> 
> End of mpich-discuss Digest, Vol 34, Issue 38
> *********************************************
> 
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list