Dave,<br><br>The output of "bsub -n 16 srun ./helloworld.mpich2" is as follows: (exactly as it should be)<br><br>Your job looked like:<br><br>------------------------------------------------------------<br># LSBATCH: User input<br>
srun ./helloworld.mpich2<br>------------------------------------------------------------<br><br>Successfully completed.<br><br>Resource usage summary:<br><br> CPU time : 0.21 sec.<br> Max Memory : 12 MB<br>
Max Swap : 178 MB<br><br><br>The output (if any) follows:<br><br>Hello world! I'm 0 of 1 on n4<br>Hello world! I'm 0 of 1 on n5<br>Hello world! I'm 0 of 1 on n4<br>Hello world! I'm 0 of 1 on n4<br>
Hello world! I'm 0 of 1 on n4<br>Hello world! I'm 0 of 1 on n4<br>Hello world! I'm 0 of 1 on n4<br>Hello world! I'm 0 of 1 on n4<br>Hello world! I'm 0 of 1 on n4<br>Hello world! I'm 0 of 1 on n5<br>
Hello world! I'm 0 of 1 on n5<br>Hello world! I'm 0 of 1 on n5<br>Hello world! I'm 0 of 1 on n5<br>Hello world! I'm 0 of 1 on n5<br>Hello world! I'm 0 of 1 on n5<br>Hello world! I'm 0 of 1 on n5<br>
<br>Of course, using the same command - only with mpirun -srun ./helloworld.mpich2 - still gives me the error that no mpd is running on the node on which the job was assigned.<br><br clear="all">Does this mean, I do not need to use mpirun when MPICH2 is configured with SLURM? What about the softwares that make specific call to mpirun or mpiexec?<br>
<br>Gauri.<br>---------<br>
<br><br><div class="gmail_quote">On Thu, Mar 5, 2009 at 7:27 PM, Dave Goodell <span dir="ltr"><<a href="mailto:goodell@mcs.anl.gov">goodell@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Gauri,<br>
<br>
Unless I'm misunderstanding your situation, I don't believe that you should be trying to run MPD. MPICH2 should work directly with SLURM (at least srun) when configured with --with-pmi=slurm.<br>
<br>
Can you try something like this instead:<br>
<br>
% bsub -n 16 srun ./helloworld.mpich2<br><font color="#888888">
<br>
-Dave</font><div><div></div><div class="h5"><br>
<br>
On Mar 5, 2009, at 1:58 AM, Gauri Kulkarni wrote:<br>
<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Thanks Dave! This worked. At least I can use bsub and mpirun is accepting the srun option. Of course, the job is failing still. I issued the following command:<br>
bsub -n15 -o helloworld.mympi.%J.out mpirun -srun ./helloworld.mpich2<br>
When I checked where the job was running, it showed me two nodes assigned. Our cluster is 16 nodes with 8 processrors per node, so this is exactly what I wanted. But the problem is, no mpd is running on those assigned nodes. Hence the program cannot run. I cannot start mpd on those nodes because users have no ssh/rsh access to the nodes other than head node and the last node. So I started mpd on the headnode (n0) and last node (n53) and issued above command with additional option:<br>
bsub -n15 -o helloworld.mympi.%J.out -ext "SLURM[nodelist=n0,n53]" mpirun -srun ./helloworld.mpich2<br>
Now the job just remains in pending state instead of running. I cannot do anything with it but to kill it.<br>
<br>
Since only root can access all the nodes, only root can start mpds on all of them. Can root start mpd once, say using mpdboot -n 16 -f <mpd.hosts>, and then let it be? Or do I need to terminate mpd once the job is done? Is there a way for users to start mpd on the node as it gets assigned through bsub?<br>
<br>
Thank you for your replies so far, they are really helping.<br>
<br>
Gauri.<br>
---------<br>
<br>
<br>
On Wed, Mar 4, 2009 at 10:27 PM, Dave Goodell <<a href="mailto:goodell@mcs.anl.gov" target="_blank">goodell@mcs.anl.gov</a>> wrote:<br>
Gauri,<br>
<br>
Do you know where your slurm headers and libraries are located? You can specify a root for the slurm installation via the "--with-slurm=/path/to/slurm/prefix" option to configure.<br>
<br>
For example, if you have the following files:<br>
<br>
/foo/bar/baz/lib/libpmi.a<br>
/foo/bar/baz/include/slurm/pmi.h<br>
<br>
Then pass "--with-slurm=/foo/bar/baz" to configure. If "/foo/bar/baz" is "/usr" or "/" then this should have worked without the "--with-slurm" option. Almost any other prefix will require this option.<br>
<br>
If you have a nonstandard layout for your slurm installation there are other configure arguments you can pass to make everything work too. But let's hold off on discussing those until we know that you need it.<br>
<br>
-Dave<br>
<br>
<br>
On Mar 4, 2009, at 6:40 AM, Gauri Kulkarni wrote:<br>
<br>
Ok, I have tried to recompile MPICH2 with following options. I cannot recompile the 'global version', so I have tried to install it in my home dir and would update the PATH accordingly. But compiling is failing at the 'configure' step with following error:<br>
<br>
command: ./configure --prefix=/data1/visitor/cgaurik/mympi/ --with-pmi=slurm --with-pm=no<br>
End part of the output:<br>
RUNNING CONFIGURE FOR THE SLURM PMI<br>
checking for make... make<br>
checking whether clock skew breaks make... no<br>
checking whether make supports include... yes<br>
checking whether make allows comments in actions... yes<br>
checking for virtual path format... VPATH<br>
checking whether make sets CFLAGS... yes<br>
checking for gcc... gcc<br>
checking for C compiler default output file name... a.out<br>
checking whether the C compiler works... yes<br>
checking whether we are cross compiling... no<br>
checking for suffix of executables...<br>
checking for suffix of object files... o<br>
checking whether we are using the GNU C compiler... yes<br>
checking whether gcc accepts -g... yes<br>
checking for gcc option to accept ANSI C... none needed<br>
checking how to run the C preprocessor... gcc -E<br>
checking for slurm/pmi.h... no<br>
configure: error: could not find slurm/pmi.h. Configure aborted<br>
configure: error: Configure of src/pmi/slurm failed!<br>
<br>
<br>
Gauri.<br>
---------<br>
<br>
<br>
<br>
> > > Message: 4<br>
> > > Date: Mon, 23 Feb 2009 23:38:06 -0600<br>
> > > From: "Rajeev Thakur" <<a href="mailto:thakur@mcs.anl.gov" target="_blank">thakur@mcs.anl.gov</a>><br>
> > > Subject: Re: [mpich-discuss] HP-XC 3000 cluster issues<br>
> > > To: <<a href="mailto:mpich-discuss@mcs.anl.gov" target="_blank">mpich-discuss@mcs.anl.gov</a>><br>
> > > Message-ID: <72376B2D10EC43F9A0A433C960F951B6@thakurlaptop><br>
> > > Content-Type: text/plain; charset="us-ascii"<br>
> > ><br>
> > > To run MPICH2 with SLURM, configure with the options<br>
> > > "--with-pmi=slurm<br>
> > > --with-pm=no" as described in the MPICH2 README file. Also see<br>
> the<br>
> > > instructions on how to run MPICH2 with SLURM at<br>
> > > <a href="https://computing.llnl.gov/linux/slurm/quickstart.html" target="_blank">https://computing.llnl.gov/linux/slurm/quickstart.html</a> .<br>
> > ><br>
> > > Rajeev<br>
> > ><br>
> > ><br>
> > ><br>
> > > _____<br>
> > ><br>
> > > From: <a href="mailto:mpich-discuss-bounces@mcs.anl.gov" target="_blank">mpich-discuss-bounces@mcs.anl.gov</a><br>
> > > [mailto:<a href="mailto:mpich-discuss-bounces@mcs.anl.gov" target="_blank">mpich-discuss-bounces@mcs.anl.gov</a>] On Behalf Of Gauri<br>
> > > Kulkarni<br>
> > > Sent: Monday, February 23, 2009 11:19 PM<br>
> > > To: <a href="mailto:mpich-discuss@mcs.anl.gov" target="_blank">mpich-discuss@mcs.anl.gov</a><br>
> > > Subject: [mpich-discuss] HP-XC 3000 cluster issues<br>
> > ><br>
> > ><br>
> > > Hi,<br>
> > ><br>
> > > I am a newbie to the MPI in general. Currently in our institute,<br>
> we<br>
> > > have a<br>
> > > cluster of 16nodes-8processors. It is an HP-XC 3000 cluster which<br>
> > > basically<br>
> > > means, it's quite proprietary. It has its own MPI implementation<br>
> -<br>
> > > HP-MPI -<br>
> > > in which, the parallelization is managed by SLURM (Simple Linux<br>
> > > Utility for<br>
> > > Resource Management). There is also a batch job scheduler - LSF<br>
> (Load<br>
> > > Sharing Facility) which works in tandem with SLURM to parallelize<br>
> the<br>
> > > batch<br>
> > > jobs. We have installed both MPICH and MPICH2 and are testing it,<br>
> but<br>
> > > we are<br>
> > > running into compatibility issues. For a simple helloworld.c<br>
> program:<br>
> > > 1. For HPMPI: Compiled with mpicc of this implemetation and<br>
> executed<br>
> > > with<br>
> > > its mpirun: mpirun -np 4 helloworld works correctly. For batch<br>
> > > scheduling,<br>
> > > we need to isse "bsub -n4 [other options] mpirun -srun helloworld"<br>
> and<br>
> > > it<br>
> > > runs fine too. "srun" is SLURM utility that parallelizes the<br>
> jobs.<br>
> > > 2. For MPICH and mPICH2: Again, compiled with mpicc of these<br>
> > > respective<br>
> > > implemetations and executed with their own mpirun:<br>
> > > i) mpirun -np 4 helloword : Works.<br>
> > > ii) mpirun -np 15 helloworld: The parallelization is limited to<br>
> just<br>
> > > a<br>
> > > single node - that is 8 processes run first on 8 processors of a<br>
> > > single node<br>
> > > and then remaining ones.<br>
> > > iii) bsub -n4 [options] mpirun -srun hellowrold: Job terminated.<br>
> > > srun<br>
> > > option not recognized.<br>
> > > iv) bsub [options] mpirun -np 4 helloworld: Works<br>
> > > V) bsub [options] mpirun -np 15 helloworld: (Same as iii)<br>
> > ><br>
> > > Anybody aware of HP cluster issues with MPICH? Am I<br>
> misinterpreting?<br>
> > > Any<br>
> > > help is appreciated.<br>
> > ><br>
> > > Gauri.<br>
> > > ---------<br>
> ><br>
<br>
<br>
<br>
</blockquote>
<br>
</div></div></blockquote></div><br>