[mpich-discuss] HP-XC 3000 cluster issues

Gauri Kulkarni gaurivk at gmail.com
Thu Mar 5 08:23:28 CST 2009


Dave,

The output of "bsub -n 16 srun ./helloworld.mpich2" is as follows: (exactly
as it should be)

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
srun ./helloworld.mpich2
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time   :      0.21 sec.
    Max Memory :        12 MB
    Max Swap   :       178 MB


The output (if any) follows:

Hello world! I'm 0 of 1 on n4
Hello world! I'm 0 of 1 on n5
Hello world! I'm 0 of 1 on n4
Hello world! I'm 0 of 1 on n4
Hello world! I'm 0 of 1 on n4
Hello world! I'm 0 of 1 on n4
Hello world! I'm 0 of 1 on n4
Hello world! I'm 0 of 1 on n4
Hello world! I'm 0 of 1 on n4
Hello world! I'm 0 of 1 on n5
Hello world! I'm 0 of 1 on n5
Hello world! I'm 0 of 1 on n5
Hello world! I'm 0 of 1 on n5
Hello world! I'm 0 of 1 on n5
Hello world! I'm 0 of 1 on n5
Hello world! I'm 0 of 1 on n5

Of course, using the same command - only with mpirun -srun
./helloworld.mpich2 - still gives me the error that no mpd is running on the
node on which the job was assigned.

Does this mean, I do not need to use mpirun when MPICH2 is configured with
SLURM? What about the softwares that make specific call to mpirun or
mpiexec?

Gauri.
---------


On Thu, Mar 5, 2009 at 7:27 PM, Dave Goodell <goodell at mcs.anl.gov> wrote:

> Gauri,
>
> Unless I'm misunderstanding your situation, I don't believe that you should
> be trying to run MPD.  MPICH2 should work directly with SLURM (at least
> srun) when configured with --with-pmi=slurm.
>
> Can you try something like this instead:
>
> % bsub -n 16 srun ./helloworld.mpich2
>
> -Dave
>
>
> On Mar 5, 2009, at 1:58 AM, Gauri Kulkarni wrote:
>
>  Thanks Dave! This worked. At least I can use bsub and mpirun is accepting
>> the srun option. Of course, the job is failing still. I issued the following
>> command:
>> bsub -n15 -o helloworld.mympi.%J.out mpirun -srun ./helloworld.mpich2
>> When I checked where the job was running, it showed me two nodes assigned.
>> Our cluster is 16 nodes with 8 processrors per node, so this is exactly what
>> I wanted. But the problem is, no mpd is running on those assigned nodes.
>> Hence the program cannot run. I cannot start mpd on those nodes because
>> users have no ssh/rsh access to the nodes other than head node and the last
>> node. So I started mpd on the headnode (n0) and last node (n53) and issued
>> above command with additional option:
>> bsub -n15 -o helloworld.mympi.%J.out -ext "SLURM[nodelist=n0,n53]" mpirun
>> -srun ./helloworld.mpich2
>> Now the job just remains in pending state instead of running. I cannot do
>> anything with it but to kill it.
>>
>> Since only root can access all the nodes, only root can start mpds on all
>> of them. Can root start mpd once, say using mpdboot -n 16 -f <mpd.hosts>,
>> and then let it be? Or do I need to terminate mpd once the job is done? Is
>> there a way for users to start mpd on the node as it gets assigned through
>> bsub?
>>
>> Thank you for your replies so far, they are really helping.
>>
>> Gauri.
>> ---------
>>
>>
>> On Wed, Mar 4, 2009 at 10:27 PM, Dave Goodell <goodell at mcs.anl.gov>
>> wrote:
>> Gauri,
>>
>> Do you know where your slurm headers and libraries are located?  You can
>> specify a root for the slurm installation via the
>> "--with-slurm=/path/to/slurm/prefix" option to configure.
>>
>> For example, if you have the following files:
>>
>> /foo/bar/baz/lib/libpmi.a
>> /foo/bar/baz/include/slurm/pmi.h
>>
>> Then pass "--with-slurm=/foo/bar/baz" to configure.  If "/foo/bar/baz" is
>> "/usr" or "/" then this should have worked without the "--with-slurm"
>> option.  Almost any other prefix will require this option.
>>
>> If you have a nonstandard layout for your slurm installation there are
>> other configure arguments you can pass to make everything work too.  But
>> let's hold off on discussing those until we know that you need it.
>>
>> -Dave
>>
>>
>> On Mar 4, 2009, at 6:40 AM, Gauri Kulkarni wrote:
>>
>> Ok, I have tried to recompile MPICH2 with following options. I cannot
>> recompile the 'global version', so I have tried to install it in my home dir
>> and would update the PATH accordingly. But compiling is failing at the
>> 'configure' step with following error:
>>
>> command: ./configure --prefix=/data1/visitor/cgaurik/mympi/
>> --with-pmi=slurm --with-pm=no
>> End part of the output:
>> RUNNING CONFIGURE FOR THE SLURM PMI
>> checking for make... make
>> checking whether clock skew breaks make... no
>> checking whether make supports include... yes
>> checking whether make allows comments in actions... yes
>> checking for virtual path format... VPATH
>> checking whether make sets CFLAGS... yes
>> checking for gcc... gcc
>> checking for C compiler default output file name... a.out
>> checking whether the C compiler works... yes
>> checking whether we are cross compiling... no
>> checking for suffix of executables...
>> checking for suffix of object files... o
>> checking whether we are using the GNU C compiler... yes
>> checking whether gcc accepts -g... yes
>> checking for gcc option to accept ANSI C... none needed
>> checking how to run the C preprocessor... gcc -E
>> checking for slurm/pmi.h... no
>> configure: error: could not find slurm/pmi.h.  Configure aborted
>> configure: error: Configure of src/pmi/slurm failed!
>>
>>
>> Gauri.
>> ---------
>>
>>
>>
>> > > > Message: 4
>> > > > Date: Mon, 23 Feb 2009 23:38:06 -0600
>> > > > From: "Rajeev Thakur" <thakur at mcs.anl.gov>
>> > > > Subject: Re: [mpich-discuss] HP-XC 3000 cluster issues
>> > > > To: <mpich-discuss at mcs.anl.gov>
>> > > > Message-ID: <72376B2D10EC43F9A0A433C960F951B6 at thakurlaptop>
>> > > > Content-Type: text/plain; charset="us-ascii"
>> > > >
>> > > > To run MPICH2 with SLURM, configure with the options
>> > > > "--with-pmi=slurm
>> > > > --with-pm=no" as described in the MPICH2 README file. Also see
>> > the
>> > > > instructions on how to run MPICH2 with SLURM at
>> > > > https://computing.llnl.gov/linux/slurm/quickstart.html .
>> > > >
>> > > > Rajeev
>> > > >
>> > > >
>> > > >
>> > > >  _____
>> > > >
>> > > > From: mpich-discuss-bounces at mcs.anl.gov
>> > > > [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Gauri
>> > > > Kulkarni
>> > > > Sent: Monday, February 23, 2009 11:19 PM
>> > > > To: mpich-discuss at mcs.anl.gov
>> > > > Subject: [mpich-discuss] HP-XC 3000 cluster issues
>> > > >
>> > > >
>> > > > Hi,
>> > > >
>> > > > I am a newbie to the MPI in general. Currently in our institute,
>> > we
>> > > > have a
>> > > > cluster of 16nodes-8processors. It is an HP-XC 3000 cluster which
>> > > > basically
>> > > > means, it's quite proprietary. It has its own MPI implementation
>> > -
>> > > > HP-MPI -
>> > > > in which, the parallelization is managed by SLURM (Simple Linux
>> > > > Utility for
>> > > > Resource Management). There is also a batch job scheduler - LSF
>> > (Load
>> > > > Sharing Facility) which works in tandem with SLURM to parallelize
>> > the
>> > > > batch
>> > > > jobs. We have installed both MPICH and MPICH2 and are testing it,
>> > but
>> > > > we are
>> > > > running into compatibility issues. For a simple helloworld.c
>> > program:
>> > > > 1. For HPMPI: Compiled with mpicc of this implemetation and
>> > executed
>> > > > with
>> > > > its mpirun: mpirun -np 4 helloworld works correctly. For batch
>> > > > scheduling,
>> > > > we need to isse "bsub -n4 [other options] mpirun -srun helloworld"
>> > and
>> > > > it
>> > > > runs fine too. "srun" is SLURM utility that parallelizes the
>> > jobs.
>> > > > 2. For MPICH and mPICH2: Again, compiled with mpicc of these
>> > > > respective
>> > > > implemetations and executed with their own mpirun:
>> > > >    i) mpirun -np 4 helloword : Works.
>> > > >   ii) mpirun -np 15 helloworld: The parallelization is limited to
>> > just
>> > > > a
>> > > > single node - that is 8 processes run first on 8 processors of a
>> > > > single node
>> > > > and then remaining ones.
>> > > >  iii) bsub -n4 [options] mpirun -srun hellowrold: Job terminated.
>> > > > srun
>> > > > option not recognized.
>> > > >   iv) bsub [options] mpirun -np 4 helloworld: Works
>> > > >   V) bsub [options] mpirun -np 15 helloworld: (Same as iii)
>> > > >
>> > > > Anybody aware of HP cluster issues with MPICH? Am I
>> > misinterpreting?
>> > > > Any
>> > > > help is appreciated.
>> > > >
>> > > > Gauri.
>> > > > ---------
>> > >
>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20090305/69610dfb/attachment-0001.htm>


More information about the mpich-discuss mailing list