[mpich-discuss] HP-XC 3000 cluster issues

Gauri Kulkarni gaurivk at gmail.com
Tue Mar 3 05:48:42 CST 2009


Please bear with me, it is a long query.

I don't think those instructions are particularly useful to me (see Rajeev's
reply below). First of all, I cannot use 'srun' from command line, I can
only use it as an option to mpirun when I am submitting the job through LSF.
What I mean is, when I use srun from command line, this is what I get (the
command is from the script mentioned at the bottom of the webpage you
provided, Rajeev):

[So What?? ~]$ srun hostname -s | sort -u
srun: error: Unable to allocate resources: No partition specified or system
default partition

But when I submit it through LSF, this is what I get:
[So What?? ~]$ bsub -n4 -o srun.%J.out mpirun -srun hostname -s | sort -u
Job <14474> is submitted to default queue <normal>.

<output>
Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun -srun hostname -s
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time   :      0.14 sec.
    Max Memory :         2 MB
    Max Swap   :       103 MB


The output (if any) follows:

n4
n4
n4
n4
</output>

Now this is true when I am using HP-MPI. When I switch to MPICH1, the output
is like this:
[So What?? ~]$ bsub -n15 -o srun.%J.out mpirun -srun -np 15 -machinefile
mpd.hosts hostname
Job <14479> is submitted to default queue <normal>.

<output>
Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun -srun -np 15 -machinefile mpd.hosts hostname
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time   :      0.12 sec.
    Max Memory :         2 MB
    Max Swap   :       103 MB


The output (if any) follows:

Warning: Command line arguments for program should be given
after the program name.  Assuming that hostname is a
command line argument for the program.
Missing: program name
Program -srun either does not exist, is not
executable, or is an erroneous argument to mpirun.
</output>

The SLURM version that we are using here is:
[So What?? ~]$ srun --version
slurm 1.0.15

That means, the patch that website mentions for SLURM and MPICH1 combo
doesn't apply here as it is for SLURM version 1.2.11 of higher.

If I go to MPICH2 and use it through bsub, it obviously fails, probably
because it wasn't configured with the options that Rajeev had suggested
earlier.

The problem boils down to this:
1. The cluster is NOT configured for users to access each node individually,
it's forbidden. I cannot launch my tasks (including starting mpd) on any
node different from the head node.
2. This is so done as to prevent users from ssh-ing to individual nodes and
submitting jobs, thereby hogging resources. Users can only submit jobs to
other nodes via LSF (i.e. when bsub [options] mpirun -srun ./executable is
used).
3. Obviously, since only HP-MPI impelemtation allows mpirun to take srun
option while used with bsub, only in that implemetation, can I get my
programs to run on multiple nodes.

So, it is not just MPICH+SLURM that I need, I also need help with
MPICH+(LSF+SLURM).

Hail your patience.

Gauri.
---------

Date: Wed, 25 Feb 2009 12:34:29 -0600
From: "Rajeev Thakur" <thakur at mcs.anl.gov>
Subject: Re: [mpich-discuss] HP-XC 3000 cluster issues
To: <mpich-discuss at mcs.anl.gov>
Message-ID: <9273167066F94A0391E51E21D045C0FB at mcs.anl.gov>
Content-Type: text/plain; charset="us-ascii"

Gauri,
         For MPICH-1, the instructions at the bottom of
https://computing.llnl.gov/linux/slurm/quickstart.html may be sufficient (I
don't know).

Rajeev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20090303/f68d24f9/attachment.htm>


More information about the mpich-discuss mailing list