[mpich-discuss] problem with mpiexec while running parallel execution

Jeff Hammond jhammond at alcf.anl.gov
Mon Feb 6 08:48:22 CST 2012


you still haven't resolved what version of MPICH2 you are using, nor
have you addressed the issue of using MPD, which is no longer
supported.

your colleague appears to have given you a script to use dynamic
libraries, but this solves nothing and increases the complexity of
debugging your problem.

please refer to my original email and provide the version of MPICH2
you are using.  second, request that your sysadmins install a more
recent version of MPICH2/MVAPICH2 that supports Hydra.

jeff

On Mon, Feb 6, 2012 at 8:33 AM, Elie M <elie.moujaes at hotmail.co.uk> wrote:
> Thanks very much for the help. Actually i could not run mpiexec -info (there
> was no -info option); however a colleague gave me another (rather more
> complicated) script to do the parallel run. Now, I get another error which
> is the following:
>
> "/var/log/slurm/slurmd/job785314/slurm_script: line 95: cd:
> /storage/fis718/ananias/GB72scfPH.785314: Input/output error
>
> /home_cluster/fis718/eliemouj/espresso-4.3.2/bin/./pw.x: symbol lookup
> error: /home_cluster/fis718/eliemouj/espresso-4.3.2/bin/./pw.x: undefined
> symbol: mpi_init_
> /home_cluster/fis718/eliemouj/espresso-4.3.2/bin/./pw.x: symbol lookup
> error: /home_cluster/fis718/eliemouj/espresso-4.3.2/bin/./pw.x: undefined
> symbol: mpi_init_"
>
>
> I have googled that error but could not understand a lot about what the
> possible solution could be. I am sorry to bother you with this but i am
> Linux newbie and these problems are complicated for me at this stage. Can
> you please help me in this by posting a detailed solution of what can be
> done, if possible.
>
>
> Regards
>
>
> Elie
>
>
>> From: jhammond at alcf.anl.gov
>> Date: Sun, 5 Feb 2012 16:37:58 -0600
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] problem with mpiexec while running parallel
>> execution
>
>>
>> You're using mpibull2-1.3.9-18.s, which is not readily identified as a
>> version of MPICH2 (maybe the developers know it), although it seems to
>> be a derivative thereof. Can you run
>> "/opt/mpi/mpibull2-1.3.9-18.s/bin/mpiexec -info" to generate detailed
>> version information on your MPICH2 installation?
>>
>> Regardless of the version of MPICH2 you are using, your problem has to
>> do with MPD, but MPD is no longer supported. You can refer to
>>
>> http://wiki.mcs.anl.gov/mpich2/index.php/Frequently_Asked_Questions#Q:_I_don.27t_like_.3CWHATEVER.3E_about_mpd.2C_or_I.27m_having_a_problem_with_mpdboot.2C_can_you_fix_it.3F
>> for more information.
>>
>> Hydra is the replacement for MPD and it is an excellent process
>> manager. The system administrators at your site should install a more
>> recent version of MPICH2 that will have Hydra as the default process
>> manager. If your machine has Infiniband, recent versions of MVAPICH2
>> (which is derived from MPICH2) will also have Hydra support.
>>
>> Best,
>>
>> Jeff
>>
>>
>> On Sun, Feb 5, 2012 at 4:03 PM, Elie M <elie.moujaes at hotmail.co.uk> wrote:
>> > Dear sir/madam,
>> >
>> >
>> >
>> > I am running a parallel execution (pw.x) on a SLURM LINUX interface  and
>> > once I run the command sbatch filename.srm, the calculation starts
>> > running
>> > and then stops with the follwing error:
>> >
>> >
>> >
>> > "mpiexec_veredas5: cannot connect to local mpd
>> >
>> >  (/tmp/mpd2.console_sushil); possible causes:
>> >
>> >   1. no mpd is running on this host
>> >
>> >  2. an mpd is running but was started without a "console" (-n option)
>> >
>> >  In case 1, you can start an mpd on this host with:
>> >
>> >     mpd &
>> >
>> >  and you will be able to run jobs just on this host.
>> >
>> >  For more details on starting mpds on a set of hosts, see
>> >
>> >  the MPICH2 Installation Guide."
>> >
>> >
>> > The script is executed using the package of quantum Espresso (QE). You
>> > will
>> > find below the script I am using to run the QE: The architecture is an
>> > INTEL
>> > based cluster.
>> >
>> >
>> > " #!/bin/bash
>> >
>> > #SBATCH -o
>> > /home_cluster/fis718/eliemouj/espresso-4.3.2/GB72/GB72-script.scf.out
>> > #SBATCH -N 1
>> > #SBATCH --nodelist=veredas13
>> > #SBATCH -J scf-GB72-ph
>> > #SBATCH --account=fis718
>> > #SBATCH --partition=long
>> > #SBATCH --get-user-env
>> > #SBATCH -e GB72ph.scf.fit.err
>> >
>> > /opt/mpi/mpibull2-1.3.9-18.s/bin/mpiexec
>> > /home_cluster/fis718eliemouj/espresso-4.3-2/bin/pw.x <GB72ph.scf.in
>> >>GB72ph.scf.out
>> >
>> > "
>> >
>> > Please can anyone tell me what might be going wrong and how to fix this.
>> > I
>> > am not that professional in LINUX; I would appreciate a rather detailed
>> > solution for the problem or if possible where can I find such a
>> > solution.
>> > Hope to hear from you soon.
>> >
>> >
>> > Regards
>> >
>> >
>> > Elie Moujaes
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>> > To manage subscription options or unsubscribe:
>> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> >
>>
>>
>>
>> --
>> Jeff Hammond
>> Argonne Leadership Computing Facility
>> University of Chicago Computation Institute
>> jhammond at alcf.anl.gov / (630) 252-5381
>> http://www.linkedin.com/in/jeffhammond
>> https://wiki.alcf.anl.gov/old/index.php/User:Jhammond
>> _______________________________________________
>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>



-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/old/index.php/User:Jhammond


More information about the mpich-discuss mailing list