[mpich-discuss] mpich2 MPI_TEST errors

Samir Khanal skhanal at bgsu.edu
Sun Mar 15 21:24:22 CDT 2009


Hi Pavan
i was doing the mpdboot from my submit machine, (to check if hostname , cpi etc gets executed)
I tried as you had suggested and the program executes but as i said, with more than 1 processor, it just gets stuck.
This was not the case with mpich1.2.7 on x86 and mpich2 on x86
The problem is only with this system on x86_64.
What am i missing?

One thing i know that this application is highly threaded application.
Does that ring a bell?
:-(
samir

________________________________________
From: mpich-discuss-bounces at mcs.anl.gov [mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Pavan Balaji [balaji at mcs.anl.gov]
Sent: Sunday, March 15, 2009 9:30 PM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] mpich2 MPI_TEST errors

Where are you doing the mpdboot from? If you are doing from a node other
than one of the nodes in your mpd.hosts list, your local node will be a
part of the ring. The best way, IMO, is to start and stop the mpdring in
your PBS script itself.

Also, one more quick gotcha (in case you run into it): check the output
of "hostname". For example, if the output for that is
"compute-0-0.somedomain.edu", and your mpd.hosts file contains
"compute-0-0", then MPD will think it's two different nodes.

  -- Pavan

Samir Khanal wrote:
> Hi Pavan
>
> my mpd.hosts file already contains
> the following entries
>
> compute-0-0:4
> compute-0-1:4
> compute-0-2:4
> compute-0-3:4
> compute-0-4:4
> compute-0-5:4
>
> and i have already started mpd on all the nodes
>
> mpdboot -n 7
>
> do i need to specify this again in the PBS submit script?
>
> Again I tried this and
>
> mpiexec -n 1 ./Ring works
> but
> mpiexec -n 2 ./Ring doesnot work.
>
>
> MPICH2 Version:         1.0.8
> MPICH2 Release date:    Unknown, built on Fri Feb 20 12:36:01 EST 2009
> MPICH2 Device:          ch3:nemesis
> MPICH2 configure:       --prefix=/home/skhanal/mpich2 --with-device=ch3:nemesis
> MPICH2 CC:      gcc  -O2
> MPICH2 CXX:     c++  -O2
> MPICH2 F77:     gfortran  -O2
> MPICH2 F90:     f95  -O2
>
> please help
> Samir
> ________________________________________
> From: mpich-discuss-bounces at mcs.anl.gov [mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Pavan Balaji [balaji at mcs.anl.gov]
> Sent: Sunday, March 15, 2009 5:19 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] mpich2 MPI_TEST errors
>
>> I found the Culprit function
>> it was indeed a problem with mpi_test call , i tracked it down, the programs works now.
>
> Great!
>
>> But now i am having a hard time using the same program to run on mpich2 1.0.8/PBS on a x86_64 system.
>> it compiles and runs perfectly as a single process,
>> ie, mpiexec -n 1 ./Ring
>> executes and generates outputs.
>>
>> but as soon as i do mpiexec -n 2 or more , it just waits and eventually the job is thrown out of the queue.
>
> Did you launch your mpd daemons correctly? See section 5.7.1 in the
> MPICH2 users' guide:
> http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-1.0.8-userguide.pdf
>
> PBS uses a slightly different node name representation that MPICH2's
> MPD, but it should be trivial to convert between the two formats.
>
>> Does mpich2 has any special configurations with multiple core machines?
>> Any tips on job submission or compiling,
>> if just used
>>
>> ./configure --with-device=ch3:nemesis
>
> It'll automatically detect multi-core systems and optimize inter-core
> communication.
>
>   -- Pavan
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji

--
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list