[mpich-discuss] What do these errors mean?
Gauri Kulkarni
gaurivk at gmail.com
Thu Apr 2 05:27:56 CDT 2009
No, I didn't know all this! But yes, it does make sense now. Thanks so much.
Our cluster has slurm version 1.0.15 installed on it. From what I can gather
from this page (https://computing.llnl.gov/linux/slurm/news.html), the
mpiexec wrapper for SLURM is available for versions 1.2 onwards. I will try
to install SLURM 1.2 and see if we can get it to work. Thanks again.
Gauri.
---------
On Thu, Apr 2, 2009 at 11:37 AM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
> Gauri,
>
> This might be a repetition of things you already know, but here's some
> brief description of the relationship between PMI and the process manager
> (each process manager has its own mpiexec).
>
> MPI applications talk to process managers to launch them as well as get
> information such as their rank, the size of the job, etc. The protocol they
> use to do this communication is called PMI. Unfortunately, everyone went out
> and implemented their own PMI library, e.g., (a) simple PMI (MPICH2's
> default PMI library), (b) smpd PMI (for linux/windows compatibility; will be
> deprecated soon) and (c) slurm PMI (implemented by the slurm guys).
>
> MPD, Gforker, Remshell, Hydra, OSC mpiexec, OSU mpirun and probably many
> other process managers use the simple PMI wire protocol. So, as long as the
> MPI application is linked with the simple PMI library, you can use any of
> these process managers interchangeably. Simple PMI library is what you are
> linked to by default when you build MPICH2 using the default options.
>
> srun uses slurm PMI. When you configure MPICH2 using --with-pmi=slurm, it
> should link with the slurm PMI library (I didn't go and check the code, but
> I think this is correct). Only srun is compatible with this slurm PMI
> library, so only that can be used. Note that the slurm folks came out with
> their own "mpiexec" executable, which essentially wraps around srun, so that
> uses the slurm PMI as well and should work. But I've never used it.
>
> So, in some sense, mpiexec or srun is just a user interface for you to talk
> in the appropriate PMI wire protocol. If you have a mismatch, the MPI
> process will not be able to detect their rank, the job size, etc., so all
> processes think they are rank 0 (this is what you are seeing).
>
> What is your exact requirement? If your requirement is to use the "mpiexec"
> interface, you can either use MPD (which will require MPICH2's default
> configuration), or try to download slurm's mpiexec (which will require
> MPICH2 to be configured using --with-slurm=<foo>).
>
> On the other hand, if your requirement is to use slurm, but don't care
> about the actual user interface, you can configure MPICH2 using
> --with-slurm=<foo> and just use srun.
>
> As a final note, we are working on defining PMI-2, which will hopefully
> unify all this into one single library and these issues will be gone.
>
> Hope that helps.
>
> -- Pavan
>
> Gauri Kulkarni wrote:
>
>> I cannot. Because I get the SAME errors. Following is the LSF script I
>> used to launch the job.
>>
>> #!/bin/bash
>> #BSUB -L /bin/bash
>> #BSUB -n 8
>> #BSUB -N
>> #BSUB -o /data1/visitor/cgaurik/testmpi/helloworld.mympi.mpiexec.%J.out
>>
>> cd /data1/visitor/cgaurik/testmpi
>> /data1/visitor/cgaurik/mympi/bin/mpiexec -np 8 ./helloworld.mympi
>>
>> The job is NOT parallelized i.e. every process is rank 0. And errors are
>> same. Of course, if I change (as, I think, Dave pointed out) to srun
>> ./helloworld.mympi in the last line of the script, everything is all rosy.
>> My question is, (maybe it's obvious...) that if my mpich2 is configured with
>> the options "--with-pmi=slurm --with-pm=no --with-slurm=/path/to/slurm/lib",
>> can I still use mpiexec?
>>
>> Gauri.
>> ---------
>>
>>
>> On Wed, Apr 1, 2009 at 11:42 PM, Rajeev Thakur <thakur at mcs.anl.gov<mailto:
>> thakur at mcs.anl.gov>> wrote:
>>
>> You need to use the mpicc and mpiexec from the MPICH2 installation
>> that was built to use MPD.
>> Rajeev
>>
>>
>> ------------------------------------------------------------------------
>> *From:* mpich-discuss-bounces at mcs.anl.gov
>> <mailto:mpich-discuss-bounces at mcs.anl.gov>
>> [mailto:mpich-discuss-bounces at mcs.anl.gov
>> <mailto:mpich-discuss-bounces at mcs.anl.gov>] *On Behalf Of *Gauri
>> Kulkarni
>> *Sent:* Wednesday, April 01, 2009 8:56 AM
>> *To:* mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
>>
>> *Subject:* [mpich-discuss] What do these errors mean?
>>
>> Hi,
>>
>> I am using MPICH2-1.0.7 (I cannot go to 1.0.8 right now) which
>> is configured to be used with SLURM. That is, the process
>> manager is SLURM and NOT mpd. When I submit my job through bsub
>> (bsub [options] srun ./helloworld.mympi), it works perfectly. I
>> cannot use mpiexec as it is not the one spawning jobs, I must
>> use srun. My question is, can I still use mpiexec from
>> command-line? Well.. I tried. Here is the output:
>>
>> mpiexec -n 2 ./helloworld.mympi
>> mpiexec_n53: cannot connect to local mpd
>> (/tmp/mpd2.console_cgaurik); possible causes:
>> 1. no mpd is running on this host
>> 2. an mpd is running but was started without a "console" (-n
>> option)
>> In case 1, you can start an mpd on this host with:
>> mpd &
>> and you will be able to run jobs just on this host.
>> For more details on starting mpds on a set of hosts, see
>> the MPICH2 Installation Guide.
>>
>> Then:
>>
>> mpd &
>> mpiexec -n 2 ./helloworld.mympi
>>
>> *Hello world! I'm 0 of 2 on n53*
>> Fatal error in MPI_Finalize: Other MPI error, error stack:
>> MPI_Finalize(255)...................: MPI_Finalize failed
>> MPI_Finalize(154)...................:
>> MPID_Finalize(94)...................:
>> MPI_Barrier(406)....................:
>> MPI_Barrier(comm=0x44000002) failed
>> MPIR_Barrier(77)....................:
>> MPIC_Sendrecv(120)..................:
>> MPID_Isend(103).....................: failure occurred while
>> attempting to send an eager message
>> MPIDI_CH3_iSend(172)................:
>> MPIDI_CH3I_VC_post_sockconnect(1090):
>> MPIDI_PG_SetConnInfo(615)...........: PMI_KVS_Get failedFatal
>> error in MPI_Finalize: Other MPI error, error stack:
>> MPI_Finalize(255)...................: MPI_Finalize failed
>> MPI_Finalize(154)...................:
>> MPID_Finalize(94)...................:
>> MPI_Barrier(406)....................:
>> MPI_Barrier(comm=0x44000002) failed
>> MPIR_Barrier(77)....................:
>> MPIC_Sendrecv(120)..................:
>> MPID_Isend(103).....................: failure occurred while
>> attempting to send an eager message
>> MP*Hello world! I'm 1 of 2 on n53*
>> IDI_CH3_iSend(172)................:
>> MPIDI_CH3I_VC_post_sockconnect(1090):
>> MPIDI_PG_SetConnInfo(615)...........: PMI_KVS_Get failed
>>
>> The bold text shows that the job gets executed but there is a
>> lot of other garbage. It seems to me that I can either configure
>> MPICH2 to be used with cluster job scheduler or to be used from
>> command line. I cannot have both.
>>
>> Am I right?
>>
>> -Gauri.
>> ----------
>>
>>
>>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji <http://www.mcs.anl.gov/%7Ebalaji>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20090402/396d7361/attachment-0001.htm>
More information about the mpich-discuss
mailing list