[mpich-discuss] HP-XC 3000 cluster issues

Anthony Chan chan at mcs.anl.gov
Tue Mar 3 11:26:31 CST 2009


Does your LSF setup support any interactive or debugging session ?
If so, try doing srun in the interactive session ?

A.Chan
----- "Gauri Kulkarni" <gaurivk at gmail.com> wrote:

> Please bear with me, it is a long query.
> 
> I don't think those instructions are particularly useful to me (see
> Rajeev's
> reply below). First of all, I cannot use 'srun' from command line, I
> can
> only use it as an option to mpirun when I am submitting the job
> through LSF.
> What I mean is, when I use srun from command line, this is what I get
> (the
> command is from the script mentioned at the bottom of the webpage you
> provided, Rajeev):
> 
> [So What?? ~]$ srun hostname -s | sort -u
> srun: error: Unable to allocate resources: No partition specified or
> system
> default partition
> 
> But when I submit it through LSF, this is what I get:
> [So What?? ~]$ bsub -n4 -o srun.%J.out mpirun -srun hostname -s | sort
> -u
> Job <14474> is submitted to default queue <normal>.
> 
> <output>
> Your job looked like:
> 
> ------------------------------------------------------------
> # LSBATCH: User input
> mpirun -srun hostname -s
> ------------------------------------------------------------
> 
> Successfully completed.
> 
> Resource usage summary:
> 
>     CPU time   :      0.14 sec.
>     Max Memory :         2 MB
>     Max Swap   :       103 MB
> 
> 
> The output (if any) follows:
> 
> n4
> n4
> n4
> n4
> </output>
> 
> Now this is true when I am using HP-MPI. When I switch to MPICH1, the
> output
> is like this:
> [So What?? ~]$ bsub -n15 -o srun.%J.out mpirun -srun -np 15
> -machinefile
> mpd.hosts hostname
> Job <14479> is submitted to default queue <normal>.
> 
> <output>
> Your job looked like:
> 
> ------------------------------------------------------------
> # LSBATCH: User input
> mpirun -srun -np 15 -machinefile mpd.hosts hostname
> ------------------------------------------------------------
> 
> Exited with exit code 1.
> 
> Resource usage summary:
> 
>     CPU time   :      0.12 sec.
>     Max Memory :         2 MB
>     Max Swap   :       103 MB
> 
> 
> The output (if any) follows:
> 
> Warning: Command line arguments for program should be given
> after the program name.  Assuming that hostname is a
> command line argument for the program.
> Missing: program name
> Program -srun either does not exist, is not
> executable, or is an erroneous argument to mpirun.
> </output>
> 
> The SLURM version that we are using here is:
> [So What?? ~]$ srun --version
> slurm 1.0.15
> 
> That means, the patch that website mentions for SLURM and MPICH1
> combo
> doesn't apply here as it is for SLURM version 1.2.11 of higher.
> 
> If I go to MPICH2 and use it through bsub, it obviously fails,
> probably
> because it wasn't configured with the options that Rajeev had
> suggested
> earlier.
> 
> The problem boils down to this:
> 1. The cluster is NOT configured for users to access each node
> individually,
> it's forbidden. I cannot launch my tasks (including starting mpd) on
> any
> node different from the head node.
> 2. This is so done as to prevent users from ssh-ing to individual
> nodes and
> submitting jobs, thereby hogging resources. Users can only submit jobs
> to
> other nodes via LSF (i.e. when bsub [options] mpirun -srun
> ./executable is
> used).
> 3. Obviously, since only HP-MPI impelemtation allows mpirun to take
> srun
> option while used with bsub, only in that implemetation, can I get my
> programs to run on multiple nodes.
> 
> So, it is not just MPICH+SLURM that I need, I also need help with
> MPICH+(LSF+SLURM).
> 
> Hail your patience.
> 
> Gauri.
> ---------
> 
> Date: Wed, 25 Feb 2009 12:34:29 -0600
> From: "Rajeev Thakur" <thakur at mcs.anl.gov>
> Subject: Re: [mpich-discuss] HP-XC 3000 cluster issues
> To: <mpich-discuss at mcs.anl.gov>
> Message-ID: <9273167066F94A0391E51E21D045C0FB at mcs.anl.gov>
> Content-Type: text/plain; charset="us-ascii"
> 
> Gauri,
>          For MPICH-1, the instructions at the bottom of
> https://computing.llnl.gov/linux/slurm/quickstart.html may be
> sufficient (I
> don't know).
> 
> Rajeev


More information about the mpich-discuss mailing list