[mpich-discuss] What do these errors mean?
Pavan Balaji
balaji at mcs.anl.gov
Thu Apr 2 01:07:29 CDT 2009
Gauri,
This might be a repetition of things you already know, but here's some
brief description of the relationship between PMI and the process
manager (each process manager has its own mpiexec).
MPI applications talk to process managers to launch them as well as get
information such as their rank, the size of the job, etc. The protocol
they use to do this communication is called PMI. Unfortunately, everyone
went out and implemented their own PMI library, e.g., (a) simple PMI
(MPICH2's default PMI library), (b) smpd PMI (for linux/windows
compatibility; will be deprecated soon) and (c) slurm PMI (implemented
by the slurm guys).
MPD, Gforker, Remshell, Hydra, OSC mpiexec, OSU mpirun and probably many
other process managers use the simple PMI wire protocol. So, as long as
the MPI application is linked with the simple PMI library, you can use
any of these process managers interchangeably. Simple PMI library is
what you are linked to by default when you build MPICH2 using the
default options.
srun uses slurm PMI. When you configure MPICH2 using --with-pmi=slurm,
it should link with the slurm PMI library (I didn't go and check the
code, but I think this is correct). Only srun is compatible with this
slurm PMI library, so only that can be used. Note that the slurm folks
came out with their own "mpiexec" executable, which essentially wraps
around srun, so that uses the slurm PMI as well and should work. But
I've never used it.
So, in some sense, mpiexec or srun is just a user interface for you to
talk in the appropriate PMI wire protocol. If you have a mismatch, the
MPI process will not be able to detect their rank, the job size, etc.,
so all processes think they are rank 0 (this is what you are seeing).
What is your exact requirement? If your requirement is to use the
"mpiexec" interface, you can either use MPD (which will require MPICH2's
default configuration), or try to download slurm's mpiexec (which will
require MPICH2 to be configured using --with-slurm=<foo>).
On the other hand, if your requirement is to use slurm, but don't care
about the actual user interface, you can configure MPICH2 using
--with-slurm=<foo> and just use srun.
As a final note, we are working on defining PMI-2, which will hopefully
unify all this into one single library and these issues will be gone.
Hope that helps.
-- Pavan
Gauri Kulkarni wrote:
> I cannot. Because I get the SAME errors. Following is the LSF script I
> used to launch the job.
>
> #!/bin/bash
> #BSUB -L /bin/bash
> #BSUB -n 8
> #BSUB -N
> #BSUB -o /data1/visitor/cgaurik/testmpi/helloworld.mympi.mpiexec.%J.out
>
> cd /data1/visitor/cgaurik/testmpi
> /data1/visitor/cgaurik/mympi/bin/mpiexec -np 8 ./helloworld.mympi
>
> The job is NOT parallelized i.e. every process is rank 0. And errors are
> same. Of course, if I change (as, I think, Dave pointed out) to srun
> ./helloworld.mympi in the last line of the script, everything is all
> rosy. My question is, (maybe it's obvious...) that if my mpich2 is
> configured with the options "--with-pmi=slurm --with-pm=no
> --with-slurm=/path/to/slurm/lib", can I still use mpiexec?
>
> Gauri.
> ---------
>
>
> On Wed, Apr 1, 2009 at 11:42 PM, Rajeev Thakur <thakur at mcs.anl.gov
> <mailto:thakur at mcs.anl.gov>> wrote:
>
> You need to use the mpicc and mpiexec from the MPICH2 installation
> that was built to use MPD.
>
> Rajeev
>
>
> ------------------------------------------------------------------------
> *From:* mpich-discuss-bounces at mcs.anl.gov
> <mailto:mpich-discuss-bounces at mcs.anl.gov>
> [mailto:mpich-discuss-bounces at mcs.anl.gov
> <mailto:mpich-discuss-bounces at mcs.anl.gov>] *On Behalf Of *Gauri
> Kulkarni
> *Sent:* Wednesday, April 01, 2009 8:56 AM
> *To:* mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
> *Subject:* [mpich-discuss] What do these errors mean?
>
> Hi,
>
> I am using MPICH2-1.0.7 (I cannot go to 1.0.8 right now) which
> is configured to be used with SLURM. That is, the process
> manager is SLURM and NOT mpd. When I submit my job through bsub
> (bsub [options] srun ./helloworld.mympi), it works perfectly. I
> cannot use mpiexec as it is not the one spawning jobs, I must
> use srun. My question is, can I still use mpiexec from
> command-line? Well.. I tried. Here is the output:
>
> mpiexec -n 2 ./helloworld.mympi
> mpiexec_n53: cannot connect to local mpd
> (/tmp/mpd2.console_cgaurik); possible causes:
> 1. no mpd is running on this host
> 2. an mpd is running but was started without a "console" (-n
> option)
> In case 1, you can start an mpd on this host with:
> mpd &
> and you will be able to run jobs just on this host.
> For more details on starting mpds on a set of hosts, see
> the MPICH2 Installation Guide.
>
> Then:
>
> mpd &
> mpiexec -n 2 ./helloworld.mympi
>
> *Hello world! I'm 0 of 2 on n53*
> Fatal error in MPI_Finalize: Other MPI error, error stack:
> MPI_Finalize(255)...................: MPI_Finalize failed
> MPI_Finalize(154)...................:
> MPID_Finalize(94)...................:
> MPI_Barrier(406)....................:
> MPI_Barrier(comm=0x44000002) failed
> MPIR_Barrier(77)....................:
> MPIC_Sendrecv(120)..................:
> MPID_Isend(103).....................: failure occurred while
> attempting to send an eager message
> MPIDI_CH3_iSend(172)................:
> MPIDI_CH3I_VC_post_sockconnect(1090):
> MPIDI_PG_SetConnInfo(615)...........: PMI_KVS_Get failedFatal
> error in MPI_Finalize: Other MPI error, error stack:
> MPI_Finalize(255)...................: MPI_Finalize failed
> MPI_Finalize(154)...................:
> MPID_Finalize(94)...................:
> MPI_Barrier(406)....................:
> MPI_Barrier(comm=0x44000002) failed
> MPIR_Barrier(77)....................:
> MPIC_Sendrecv(120)..................:
> MPID_Isend(103).....................: failure occurred while
> attempting to send an eager message
> MP*Hello world! I'm 1 of 2 on n53*
> IDI_CH3_iSend(172)................:
> MPIDI_CH3I_VC_post_sockconnect(1090):
> MPIDI_PG_SetConnInfo(615)...........: PMI_KVS_Get failed
>
> The bold text shows that the job gets executed but there is a
> lot of other garbage. It seems to me that I can either configure
> MPICH2 to be used with cluster job scheduler or to be used from
> command line. I cannot have both.
>
> Am I right?
>
> -Gauri.
> ----------
>
>
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
More information about the mpich-discuss
mailing list