[mpich-discuss] HP-XC 3000 cluster issues
Dave Goodell
goodell at mcs.anl.gov
Thu Mar 5 07:57:39 CST 2009
Gauri,
Unless I'm misunderstanding your situation, I don't believe that you
should be trying to run MPD. MPICH2 should work directly with SLURM
(at least srun) when configured with --with-pmi=slurm.
Can you try something like this instead:
% bsub -n 16 srun ./helloworld.mpich2
-Dave
On Mar 5, 2009, at 1:58 AM, Gauri Kulkarni wrote:
> Thanks Dave! This worked. At least I can use bsub and mpirun is
> accepting the srun option. Of course, the job is failing still. I
> issued the following command:
> bsub -n15 -o helloworld.mympi.%J.out mpirun -srun ./helloworld.mpich2
> When I checked where the job was running, it showed me two nodes
> assigned. Our cluster is 16 nodes with 8 processrors per node, so
> this is exactly what I wanted. But the problem is, no mpd is running
> on those assigned nodes. Hence the program cannot run. I cannot
> start mpd on those nodes because users have no ssh/rsh access to the
> nodes other than head node and the last node. So I started mpd on
> the headnode (n0) and last node (n53) and issued above command with
> additional option:
> bsub -n15 -o helloworld.mympi.%J.out -ext "SLURM[nodelist=n0,n53]"
> mpirun -srun ./helloworld.mpich2
> Now the job just remains in pending state instead of running. I
> cannot do anything with it but to kill it.
>
> Since only root can access all the nodes, only root can start mpds
> on all of them. Can root start mpd once, say using mpdboot -n 16 -f
> <mpd.hosts>, and then let it be? Or do I need to terminate mpd once
> the job is done? Is there a way for users to start mpd on the node
> as it gets assigned through bsub?
>
> Thank you for your replies so far, they are really helping.
>
> Gauri.
> ---------
>
>
> On Wed, Mar 4, 2009 at 10:27 PM, Dave Goodell <goodell at mcs.anl.gov>
> wrote:
> Gauri,
>
> Do you know where your slurm headers and libraries are located? You
> can specify a root for the slurm installation via the "--with-slurm=/
> path/to/slurm/prefix" option to configure.
>
> For example, if you have the following files:
>
> /foo/bar/baz/lib/libpmi.a
> /foo/bar/baz/include/slurm/pmi.h
>
> Then pass "--with-slurm=/foo/bar/baz" to configure. If "/foo/bar/
> baz" is "/usr" or "/" then this should have worked without the "--
> with-slurm" option. Almost any other prefix will require this option.
>
> If you have a nonstandard layout for your slurm installation there
> are other configure arguments you can pass to make everything work
> too. But let's hold off on discussing those until we know that you
> need it.
>
> -Dave
>
>
> On Mar 4, 2009, at 6:40 AM, Gauri Kulkarni wrote:
>
> Ok, I have tried to recompile MPICH2 with following options. I
> cannot recompile the 'global version', so I have tried to install it
> in my home dir and would update the PATH accordingly. But compiling
> is failing at the 'configure' step with following error:
>
> command: ./configure --prefix=/data1/visitor/cgaurik/mympi/ --with-
> pmi=slurm --with-pm=no
> End part of the output:
> RUNNING CONFIGURE FOR THE SLURM PMI
> checking for make... make
> checking whether clock skew breaks make... no
> checking whether make supports include... yes
> checking whether make allows comments in actions... yes
> checking for virtual path format... VPATH
> checking whether make sets CFLAGS... yes
> checking for gcc... gcc
> checking for C compiler default output file name... a.out
> checking whether the C compiler works... yes
> checking whether we are cross compiling... no
> checking for suffix of executables...
> checking for suffix of object files... o
> checking whether we are using the GNU C compiler... yes
> checking whether gcc accepts -g... yes
> checking for gcc option to accept ANSI C... none needed
> checking how to run the C preprocessor... gcc -E
> checking for slurm/pmi.h... no
> configure: error: could not find slurm/pmi.h. Configure aborted
> configure: error: Configure of src/pmi/slurm failed!
>
>
> Gauri.
> ---------
>
>
>
> > > > Message: 4
> > > > Date: Mon, 23 Feb 2009 23:38:06 -0600
> > > > From: "Rajeev Thakur" <thakur at mcs.anl.gov>
> > > > Subject: Re: [mpich-discuss] HP-XC 3000 cluster issues
> > > > To: <mpich-discuss at mcs.anl.gov>
> > > > Message-ID: <72376B2D10EC43F9A0A433C960F951B6 at thakurlaptop>
> > > > Content-Type: text/plain; charset="us-ascii"
> > > >
> > > > To run MPICH2 with SLURM, configure with the options
> > > > "--with-pmi=slurm
> > > > --with-pm=no" as described in the MPICH2 README file. Also see
> > the
> > > > instructions on how to run MPICH2 with SLURM at
> > > > https://computing.llnl.gov/linux/slurm/quickstart.html .
> > > >
> > > > Rajeev
> > > >
> > > >
> > > >
> > > > _____
> > > >
> > > > From: mpich-discuss-bounces at mcs.anl.gov
> > > > [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Gauri
> > > > Kulkarni
> > > > Sent: Monday, February 23, 2009 11:19 PM
> > > > To: mpich-discuss at mcs.anl.gov
> > > > Subject: [mpich-discuss] HP-XC 3000 cluster issues
> > > >
> > > >
> > > > Hi,
> > > >
> > > > I am a newbie to the MPI in general. Currently in our institute,
> > we
> > > > have a
> > > > cluster of 16nodes-8processors. It is an HP-XC 3000 cluster
> which
> > > > basically
> > > > means, it's quite proprietary. It has its own MPI implementation
> > -
> > > > HP-MPI -
> > > > in which, the parallelization is managed by SLURM (Simple Linux
> > > > Utility for
> > > > Resource Management). There is also a batch job scheduler - LSF
> > (Load
> > > > Sharing Facility) which works in tandem with SLURM to
> parallelize
> > the
> > > > batch
> > > > jobs. We have installed both MPICH and MPICH2 and are testing
> it,
> > but
> > > > we are
> > > > running into compatibility issues. For a simple helloworld.c
> > program:
> > > > 1. For HPMPI: Compiled with mpicc of this implemetation and
> > executed
> > > > with
> > > > its mpirun: mpirun -np 4 helloworld works correctly. For batch
> > > > scheduling,
> > > > we need to isse "bsub -n4 [other options] mpirun -srun
> helloworld"
> > and
> > > > it
> > > > runs fine too. "srun" is SLURM utility that parallelizes the
> > jobs.
> > > > 2. For MPICH and mPICH2: Again, compiled with mpicc of these
> > > > respective
> > > > implemetations and executed with their own mpirun:
> > > > i) mpirun -np 4 helloword : Works.
> > > > ii) mpirun -np 15 helloworld: The parallelization is limited
> to
> > just
> > > > a
> > > > single node - that is 8 processes run first on 8 processors of a
> > > > single node
> > > > and then remaining ones.
> > > > iii) bsub -n4 [options] mpirun -srun hellowrold: Job
> terminated.
> > > > srun
> > > > option not recognized.
> > > > iv) bsub [options] mpirun -np 4 helloworld: Works
> > > > V) bsub [options] mpirun -np 15 helloworld: (Same as iii)
> > > >
> > > > Anybody aware of HP cluster issues with MPICH? Am I
> > misinterpreting?
> > > > Any
> > > > help is appreciated.
> > > >
> > > > Gauri.
> > > > ---------
> > >
>
>
>
More information about the mpich-discuss
mailing list