[mpich-discuss] What do these errors mean?

Pavan Balaji balaji at mcs.anl.gov
Thu Apr 2 01:07:29 CDT 2009


Gauri,

This might be a repetition of things you already know, but here's some 
brief description of the relationship between PMI and the process 
manager (each process manager has its own mpiexec).

MPI applications talk to process managers to launch them as well as get 
information such as their rank, the size of the job, etc. The protocol 
they use to do this communication is called PMI. Unfortunately, everyone 
went out and implemented their own PMI library, e.g., (a) simple PMI 
(MPICH2's default PMI library), (b) smpd PMI (for linux/windows 
compatibility; will be deprecated soon) and (c) slurm PMI (implemented 
by the slurm guys).

MPD, Gforker, Remshell, Hydra, OSC mpiexec, OSU mpirun and probably many 
other process managers use the simple PMI wire protocol. So, as long as 
the MPI application is linked with the simple PMI library, you can use 
any of these process managers interchangeably. Simple PMI library is 
what you are linked to by default when you build MPICH2 using the 
default options.

srun uses slurm PMI. When you configure MPICH2 using --with-pmi=slurm, 
it should link with the slurm PMI library (I didn't go and check the 
code, but I think this is correct). Only srun is compatible with this 
slurm PMI library, so only that can be used. Note that the slurm folks 
came out with their own "mpiexec" executable, which essentially wraps 
around srun, so that uses the slurm PMI as well and should work. But 
I've never used it.

So, in some sense, mpiexec or srun is just a user interface for you to 
talk in the appropriate PMI wire protocol. If you have a mismatch, the 
MPI process will not be able to detect their rank, the job size, etc., 
so all processes think they are rank 0 (this is what you are seeing).

What is your exact requirement? If your requirement is to use the 
"mpiexec" interface, you can either use MPD (which will require MPICH2's 
default configuration), or try to download slurm's mpiexec (which will 
require MPICH2 to be configured using --with-slurm=<foo>).

On the other hand, if your requirement is to use slurm, but don't care 
about the actual user interface, you can configure MPICH2 using 
--with-slurm=<foo> and just use srun.

As a final note, we are working on defining PMI-2, which will hopefully 
unify all this into one single library and these issues will be gone.

Hope that helps.

  -- Pavan

Gauri Kulkarni wrote:
> I cannot. Because I get the SAME errors. Following is the LSF script I 
> used to launch the job.
> 
> #!/bin/bash
> #BSUB -L /bin/bash
> #BSUB -n 8
> #BSUB -N
> #BSUB -o /data1/visitor/cgaurik/testmpi/helloworld.mympi.mpiexec.%J.out
> 
> cd /data1/visitor/cgaurik/testmpi
> /data1/visitor/cgaurik/mympi/bin/mpiexec -np 8 ./helloworld.mympi
> 
> The job is NOT parallelized i.e. every process is rank 0. And errors are 
> same. Of course, if I change (as, I think, Dave pointed out) to srun 
> ./helloworld.mympi in the last line of the script, everything is all 
> rosy. My question is, (maybe it's obvious...) that if my mpich2 is 
> configured with the options "--with-pmi=slurm --with-pm=no 
> --with-slurm=/path/to/slurm/lib", can I still use mpiexec?
> 
> Gauri.
> ---------
> 
> 
> On Wed, Apr 1, 2009 at 11:42 PM, Rajeev Thakur <thakur at mcs.anl.gov 
> <mailto:thakur at mcs.anl.gov>> wrote:
> 
>     You need to use the mpicc and mpiexec from the MPICH2 installation
>     that was built to use MPD.
>      
>     Rajeev
>      
> 
>         ------------------------------------------------------------------------
>         *From:* mpich-discuss-bounces at mcs.anl.gov
>         <mailto:mpich-discuss-bounces at mcs.anl.gov>
>         [mailto:mpich-discuss-bounces at mcs.anl.gov
>         <mailto:mpich-discuss-bounces at mcs.anl.gov>] *On Behalf Of *Gauri
>         Kulkarni
>         *Sent:* Wednesday, April 01, 2009 8:56 AM
>         *To:* mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
>         *Subject:* [mpich-discuss] What do these errors mean?
> 
>         Hi,
> 
>         I am using MPICH2-1.0.7 (I cannot go to 1.0.8 right now) which
>         is configured to be used with SLURM. That is, the process
>         manager is SLURM and NOT mpd. When I submit my job through bsub
>         (bsub [options] srun ./helloworld.mympi), it works perfectly. I
>         cannot use mpiexec as it is not the one spawning jobs, I must
>         use srun. My question is, can I still use mpiexec from
>         command-line? Well.. I tried. Here is the output:
> 
>         mpiexec -n 2 ./helloworld.mympi
>         mpiexec_n53: cannot connect to local mpd
>         (/tmp/mpd2.console_cgaurik); possible causes:
>           1. no mpd is running on this host
>           2. an mpd is running but was started without a "console" (-n
>         option)
>         In case 1, you can start an mpd on this host with:
>             mpd &
>         and you will be able to run jobs just on this host.
>         For more details on starting mpds on a set of hosts, see
>         the MPICH2 Installation Guide.
> 
>         Then:
> 
>         mpd &
>         mpiexec -n 2 ./helloworld.mympi
> 
>         *Hello world! I'm 0 of 2 on n53*
>         Fatal error in MPI_Finalize: Other MPI error, error stack:
>         MPI_Finalize(255)...................: MPI_Finalize failed
>         MPI_Finalize(154)...................:
>         MPID_Finalize(94)...................:
>         MPI_Barrier(406)....................:
>         MPI_Barrier(comm=0x44000002) failed
>         MPIR_Barrier(77)....................:
>         MPIC_Sendrecv(120)..................:
>         MPID_Isend(103).....................: failure occurred while
>         attempting to send an eager message
>         MPIDI_CH3_iSend(172)................:
>         MPIDI_CH3I_VC_post_sockconnect(1090):
>         MPIDI_PG_SetConnInfo(615)...........: PMI_KVS_Get failedFatal
>         error in MPI_Finalize: Other MPI error, error stack:
>         MPI_Finalize(255)...................: MPI_Finalize failed
>         MPI_Finalize(154)...................:
>         MPID_Finalize(94)...................:
>         MPI_Barrier(406)....................:
>         MPI_Barrier(comm=0x44000002) failed
>         MPIR_Barrier(77)....................:
>         MPIC_Sendrecv(120)..................:
>         MPID_Isend(103).....................: failure occurred while
>         attempting to send an eager message
>         MP*Hello world! I'm 1 of 2 on n53*
>         IDI_CH3_iSend(172)................:
>         MPIDI_CH3I_VC_post_sockconnect(1090):
>         MPIDI_PG_SetConnInfo(615)...........: PMI_KVS_Get failed
> 
>         The bold text shows that the job gets executed but there is a
>         lot of other garbage. It seems to me that I can either configure
>         MPICH2 to be used with cluster job scheduler or to be used from
>         command line. I cannot have both.
> 
>         Am I right?
> 
>         -Gauri.
>         ----------
> 
> 

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list