[mpich-discuss] HP-XC 3000 cluster issues

Gauri Kulkarni gaurivk at gmail.com
Thu Mar 5 01:58:47 CST 2009


Thanks Dave! This worked. At least I can use bsub and mpirun is accepting
the srun option. Of course, the job is failing still. I issued the following
command:
bsub -n15 -o helloworld.mympi.%J.out mpirun -srun ./helloworld.mpich2
When I checked where the job was running, it showed me two nodes assigned.
Our cluster is 16 nodes with 8 processrors per node, so this is exactly what
I wanted. But the problem is, no mpd is running on those assigned nodes.
Hence the program cannot run. I cannot start mpd on those nodes because
users have no ssh/rsh access to the nodes other than head node and the last
node. So I started mpd on the headnode (n0) and last node (n53) and issued
above command with additional option:
bsub -n15 -o helloworld.mympi.%J.out -ext "SLURM[nodelist=n0,n53]" mpirun
-srun ./helloworld.mpich2
Now the job just remains in pending state instead of running. I cannot do
anything with it but to kill it.

Since only root can access all the nodes, only root can start mpds on all of
them. Can root start mpd once, say using mpdboot -n 16 -f <mpd.hosts>, and
then let it be? Or do I need to terminate mpd once the job is done? Is there
a way for users to start mpd on the node as it gets assigned through bsub?

Thank you for your replies so far, they are really helping.

Gauri.
---------


On Wed, Mar 4, 2009 at 10:27 PM, Dave Goodell <goodell at mcs.anl.gov> wrote:

> Gauri,
>
> Do you know where your slurm headers and libraries are located?  You can
> specify a root for the slurm installation via the
> "--with-slurm=/path/to/slurm/prefix" option to configure.
>
> For example, if you have the following files:
>
> /foo/bar/baz/lib/libpmi.a
> /foo/bar/baz/include/slurm/pmi.h
>
> Then pass "--with-slurm=/foo/bar/baz" to configure.  If "/foo/bar/baz" is
> "/usr" or "/" then this should have worked without the "--with-slurm"
> option.  Almost any other prefix will require this option.
>
> If you have a nonstandard layout for your slurm installation there are
> other configure arguments you can pass to make everything work too.  But
> let's hold off on discussing those until we know that you need it.
>
> -Dave
>
>
> On Mar 4, 2009, at 6:40 AM, Gauri Kulkarni wrote:
>
>  Ok, I have tried to recompile MPICH2 with following options. I cannot
>> recompile the 'global version', so I have tried to install it in my home dir
>> and would update the PATH accordingly. But compiling is failing at the
>> 'configure' step with following error:
>>
>> command: ./configure --prefix=/data1/visitor/cgaurik/mympi/
>> --with-pmi=slurm --with-pm=no
>> End part of the output:
>> RUNNING CONFIGURE FOR THE SLURM PMI
>> checking for make... make
>> checking whether clock skew breaks make... no
>> checking whether make supports include... yes
>> checking whether make allows comments in actions... yes
>> checking for virtual path format... VPATH
>> checking whether make sets CFLAGS... yes
>> checking for gcc... gcc
>> checking for C compiler default output file name... a.out
>> checking whether the C compiler works... yes
>> checking whether we are cross compiling... no
>> checking for suffix of executables...
>> checking for suffix of object files... o
>> checking whether we are using the GNU C compiler... yes
>> checking whether gcc accepts -g... yes
>> checking for gcc option to accept ANSI C... none needed
>> checking how to run the C preprocessor... gcc -E
>> checking for slurm/pmi.h... no
>> configure: error: could not find slurm/pmi.h.  Configure aborted
>> configure: error: Configure of src/pmi/slurm failed!
>>
>>
>> Gauri.
>> ---------
>>
>>
>>
>> > > > Message: 4
>> > > > Date: Mon, 23 Feb 2009 23:38:06 -0600
>> > > > From: "Rajeev Thakur" <thakur at mcs.anl.gov>
>> > > > Subject: Re: [mpich-discuss] HP-XC 3000 cluster issues
>> > > > To: <mpich-discuss at mcs.anl.gov>
>> > > > Message-ID: <72376B2D10EC43F9A0A433C960F951B6 at thakurlaptop>
>> > > > Content-Type: text/plain; charset="us-ascii"
>> > > >
>> > > > To run MPICH2 with SLURM, configure with the options
>> > > > "--with-pmi=slurm
>> > > > --with-pm=no" as described in the MPICH2 README file. Also see
>> > the
>> > > > instructions on how to run MPICH2 with SLURM at
>> > > > https://computing.llnl.gov/linux/slurm/quickstart.html .
>> > > >
>> > > > Rajeev
>> > > >
>> > > >
>> > > >
>> > > >  _____
>> > > >
>> > > > From: mpich-discuss-bounces at mcs.anl.gov
>> > > > [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Gauri
>> > > > Kulkarni
>> > > > Sent: Monday, February 23, 2009 11:19 PM
>> > > > To: mpich-discuss at mcs.anl.gov
>> > > > Subject: [mpich-discuss] HP-XC 3000 cluster issues
>> > > >
>> > > >
>> > > > Hi,
>> > > >
>> > > > I am a newbie to the MPI in general. Currently in our institute,
>> > we
>> > > > have a
>> > > > cluster of 16nodes-8processors. It is an HP-XC 3000 cluster which
>> > > > basically
>> > > > means, it's quite proprietary. It has its own MPI implementation
>> > -
>> > > > HP-MPI -
>> > > > in which, the parallelization is managed by SLURM (Simple Linux
>> > > > Utility for
>> > > > Resource Management). There is also a batch job scheduler - LSF
>> > (Load
>> > > > Sharing Facility) which works in tandem with SLURM to parallelize
>> > the
>> > > > batch
>> > > > jobs. We have installed both MPICH and MPICH2 and are testing it,
>> > but
>> > > > we are
>> > > > running into compatibility issues. For a simple helloworld.c
>> > program:
>> > > > 1. For HPMPI: Compiled with mpicc of this implemetation and
>> > executed
>> > > > with
>> > > > its mpirun: mpirun -np 4 helloworld works correctly. For batch
>> > > > scheduling,
>> > > > we need to isse "bsub -n4 [other options] mpirun -srun helloworld"
>> > and
>> > > > it
>> > > > runs fine too. "srun" is SLURM utility that parallelizes the
>> > jobs.
>> > > > 2. For MPICH and mPICH2: Again, compiled with mpicc of these
>> > > > respective
>> > > > implemetations and executed with their own mpirun:
>> > > >    i) mpirun -np 4 helloword : Works.
>> > > >   ii) mpirun -np 15 helloworld: The parallelization is limited to
>> > just
>> > > > a
>> > > > single node - that is 8 processes run first on 8 processors of a
>> > > > single node
>> > > > and then remaining ones.
>> > > >  iii) bsub -n4 [options] mpirun -srun hellowrold: Job terminated.
>> > > > srun
>> > > > option not recognized.
>> > > >   iv) bsub [options] mpirun -np 4 helloworld: Works
>> > > >   V) bsub [options] mpirun -np 15 helloworld: (Same as iii)
>> > > >
>> > > > Anybody aware of HP cluster issues with MPICH? Am I
>> > misinterpreting?
>> > > > Any
>> > > > help is appreciated.
>> > > >
>> > > > Gauri.
>> > > > ---------
>> > >
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20090305/9147d8b9/attachment.htm>


More information about the mpich-discuss mailing list