[mpich-discuss] What do these errors mean?

Gauri Kulkarni gaurivk at gmail.com
Thu Apr 2 05:27:56 CDT 2009


No, I didn't know all this! But yes, it does make sense now. Thanks so much.
Our cluster has slurm version 1.0.15 installed on it. From what I can gather
from this page (https://computing.llnl.gov/linux/slurm/news.html), the
mpiexec wrapper for SLURM is available for versions 1.2 onwards. I will try
to install SLURM 1.2 and see if we can get it to work. Thanks again.

Gauri.
---------


On Thu, Apr 2, 2009 at 11:37 AM, Pavan Balaji <balaji at mcs.anl.gov> wrote:

> Gauri,
>
> This might be a repetition of things you already know, but here's some
> brief description of the relationship between PMI and the process manager
> (each process manager has its own mpiexec).
>
> MPI applications talk to process managers to launch them as well as get
> information such as their rank, the size of the job, etc. The protocol they
> use to do this communication is called PMI. Unfortunately, everyone went out
> and implemented their own PMI library, e.g., (a) simple PMI (MPICH2's
> default PMI library), (b) smpd PMI (for linux/windows compatibility; will be
> deprecated soon) and (c) slurm PMI (implemented by the slurm guys).
>
> MPD, Gforker, Remshell, Hydra, OSC mpiexec, OSU mpirun and probably many
> other process managers use the simple PMI wire protocol. So, as long as the
> MPI application is linked with the simple PMI library, you can use any of
> these process managers interchangeably. Simple PMI library is what you are
> linked to by default when you build MPICH2 using the default options.
>
> srun uses slurm PMI. When you configure MPICH2 using --with-pmi=slurm, it
> should link with the slurm PMI library (I didn't go and check the code, but
> I think this is correct). Only srun is compatible with this slurm PMI
> library, so only that can be used. Note that the slurm folks came out with
> their own "mpiexec" executable, which essentially wraps around srun, so that
> uses the slurm PMI as well and should work. But I've never used it.
>
> So, in some sense, mpiexec or srun is just a user interface for you to talk
> in the appropriate PMI wire protocol. If you have a mismatch, the MPI
> process will not be able to detect their rank, the job size, etc., so all
> processes think they are rank 0 (this is what you are seeing).
>
> What is your exact requirement? If your requirement is to use the "mpiexec"
> interface, you can either use MPD (which will require MPICH2's default
> configuration), or try to download slurm's mpiexec (which will require
> MPICH2 to be configured using --with-slurm=<foo>).
>
> On the other hand, if your requirement is to use slurm, but don't care
> about the actual user interface, you can configure MPICH2 using
> --with-slurm=<foo> and just use srun.
>
> As a final note, we are working on defining PMI-2, which will hopefully
> unify all this into one single library and these issues will be gone.
>
> Hope that helps.
>
>  -- Pavan
>
> Gauri Kulkarni wrote:
>
>> I cannot. Because I get the SAME errors. Following is the LSF script I
>> used to launch the job.
>>
>> #!/bin/bash
>> #BSUB -L /bin/bash
>> #BSUB -n 8
>> #BSUB -N
>> #BSUB -o /data1/visitor/cgaurik/testmpi/helloworld.mympi.mpiexec.%J.out
>>
>> cd /data1/visitor/cgaurik/testmpi
>> /data1/visitor/cgaurik/mympi/bin/mpiexec -np 8 ./helloworld.mympi
>>
>> The job is NOT parallelized i.e. every process is rank 0. And errors are
>> same. Of course, if I change (as, I think, Dave pointed out) to srun
>> ./helloworld.mympi in the last line of the script, everything is all rosy.
>> My question is, (maybe it's obvious...) that if my mpich2 is configured with
>> the options "--with-pmi=slurm --with-pm=no --with-slurm=/path/to/slurm/lib",
>> can I still use mpiexec?
>>
>> Gauri.
>> ---------
>>
>>
>> On Wed, Apr 1, 2009 at 11:42 PM, Rajeev Thakur <thakur at mcs.anl.gov<mailto:
>> thakur at mcs.anl.gov>> wrote:
>>
>>    You need to use the mpicc and mpiexec from the MPICH2 installation
>>    that was built to use MPD.
>>        Rajeev
>>
>>
>>  ------------------------------------------------------------------------
>>        *From:* mpich-discuss-bounces at mcs.anl.gov
>>        <mailto:mpich-discuss-bounces at mcs.anl.gov>
>>        [mailto:mpich-discuss-bounces at mcs.anl.gov
>>        <mailto:mpich-discuss-bounces at mcs.anl.gov>] *On Behalf Of *Gauri
>>        Kulkarni
>>        *Sent:* Wednesday, April 01, 2009 8:56 AM
>>        *To:* mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
>>
>>        *Subject:* [mpich-discuss] What do these errors mean?
>>
>>        Hi,
>>
>>        I am using MPICH2-1.0.7 (I cannot go to 1.0.8 right now) which
>>        is configured to be used with SLURM. That is, the process
>>        manager is SLURM and NOT mpd. When I submit my job through bsub
>>        (bsub [options] srun ./helloworld.mympi), it works perfectly. I
>>        cannot use mpiexec as it is not the one spawning jobs, I must
>>        use srun. My question is, can I still use mpiexec from
>>        command-line? Well.. I tried. Here is the output:
>>
>>        mpiexec -n 2 ./helloworld.mympi
>>        mpiexec_n53: cannot connect to local mpd
>>        (/tmp/mpd2.console_cgaurik); possible causes:
>>          1. no mpd is running on this host
>>          2. an mpd is running but was started without a "console" (-n
>>        option)
>>        In case 1, you can start an mpd on this host with:
>>            mpd &
>>        and you will be able to run jobs just on this host.
>>        For more details on starting mpds on a set of hosts, see
>>        the MPICH2 Installation Guide.
>>
>>        Then:
>>
>>        mpd &
>>        mpiexec -n 2 ./helloworld.mympi
>>
>>        *Hello world! I'm 0 of 2 on n53*
>>        Fatal error in MPI_Finalize: Other MPI error, error stack:
>>        MPI_Finalize(255)...................: MPI_Finalize failed
>>        MPI_Finalize(154)...................:
>>        MPID_Finalize(94)...................:
>>        MPI_Barrier(406)....................:
>>        MPI_Barrier(comm=0x44000002) failed
>>        MPIR_Barrier(77)....................:
>>        MPIC_Sendrecv(120)..................:
>>        MPID_Isend(103).....................: failure occurred while
>>        attempting to send an eager message
>>        MPIDI_CH3_iSend(172)................:
>>        MPIDI_CH3I_VC_post_sockconnect(1090):
>>        MPIDI_PG_SetConnInfo(615)...........: PMI_KVS_Get failedFatal
>>        error in MPI_Finalize: Other MPI error, error stack:
>>        MPI_Finalize(255)...................: MPI_Finalize failed
>>        MPI_Finalize(154)...................:
>>        MPID_Finalize(94)...................:
>>        MPI_Barrier(406)....................:
>>        MPI_Barrier(comm=0x44000002) failed
>>        MPIR_Barrier(77)....................:
>>        MPIC_Sendrecv(120)..................:
>>        MPID_Isend(103).....................: failure occurred while
>>        attempting to send an eager message
>>        MP*Hello world! I'm 1 of 2 on n53*
>>        IDI_CH3_iSend(172)................:
>>        MPIDI_CH3I_VC_post_sockconnect(1090):
>>        MPIDI_PG_SetConnInfo(615)...........: PMI_KVS_Get failed
>>
>>        The bold text shows that the job gets executed but there is a
>>        lot of other garbage. It seems to me that I can either configure
>>        MPICH2 to be used with cluster job scheduler or to be used from
>>        command line. I cannot have both.
>>
>>        Am I right?
>>
>>        -Gauri.
>>        ----------
>>
>>
>>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji <http://www.mcs.anl.gov/%7Ebalaji>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20090402/396d7361/attachment-0001.htm>


More information about the mpich-discuss mailing list