[mpich-discuss] mpich2 MPI_TEST errors

Anthony Chan chan at mcs.anl.gov
Mon Mar 16 01:03:17 CDT 2009


> One thing i know that this application is highly threaded application.

Since mpich-1 has only thread_serial support.  I am curious if your
app is really using thread + MPI ?  Or you simply mean your app
can use a lot of MPI processes ?

Not sure if Pavan has mentioned that already, you could use OSC's
mpiexec with PBS to launch mpich2 app,
http://www.osc.edu/~pw/mpiexec/index.php

A.Chan

----- "Samir Khanal" <skhanal at bgsu.edu> wrote:

> Hi Pavan
> i was doing the mpdboot from my submit machine, (to check if hostname
> , cpi etc gets executed)
> I tried as you had suggested and the program executes but as i said,
> with more than 1 processor, it just gets stuck.
> This was not the case with mpich1.2.7 on x86 and mpich2 on x86
> The problem is only with this system on x86_64.
> What am i missing?
> 
> One thing i know that this application is highly threaded
> application.
> Does that ring a bell?
> :-(
> samir
> 
> ________________________________________
> From: mpich-discuss-bounces at mcs.anl.gov
> [mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Pavan Balaji
> [balaji at mcs.anl.gov]
> Sent: Sunday, March 15, 2009 9:30 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] mpich2 MPI_TEST errors
> 
> Where are you doing the mpdboot from? If you are doing from a node
> other
> than one of the nodes in your mpd.hosts list, your local node will be
> a
> part of the ring. The best way, IMO, is to start and stop the mpdring
> in
> your PBS script itself.
> 
> Also, one more quick gotcha (in case you run into it): check the
> output
> of "hostname". For example, if the output for that is
> "compute-0-0.somedomain.edu", and your mpd.hosts file contains
> "compute-0-0", then MPD will think it's two different nodes.
> 
>   -- Pavan
> 
> Samir Khanal wrote:
> > Hi Pavan
> >
> > my mpd.hosts file already contains
> > the following entries
> >
> > compute-0-0:4
> > compute-0-1:4
> > compute-0-2:4
> > compute-0-3:4
> > compute-0-4:4
> > compute-0-5:4
> >
> > and i have already started mpd on all the nodes
> >
> > mpdboot -n 7
> >
> > do i need to specify this again in the PBS submit script?
> >
> > Again I tried this and
> >
> > mpiexec -n 1 ./Ring works
> > but
> > mpiexec -n 2 ./Ring doesnot work.
> >
> >
> > MPICH2 Version:         1.0.8
> > MPICH2 Release date:    Unknown, built on Fri Feb 20 12:36:01 EST
> 2009
> > MPICH2 Device:          ch3:nemesis
> > MPICH2 configure:       --prefix=/home/skhanal/mpich2
> --with-device=ch3:nemesis
> > MPICH2 CC:      gcc  -O2
> > MPICH2 CXX:     c++  -O2
> > MPICH2 F77:     gfortran  -O2
> > MPICH2 F90:     f95  -O2
> >
> > please help
> > Samir
> > ________________________________________
> > From: mpich-discuss-bounces at mcs.anl.gov
> [mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Pavan Balaji
> [balaji at mcs.anl.gov]
> > Sent: Sunday, March 15, 2009 5:19 PM
> > To: mpich-discuss at mcs.anl.gov
> > Subject: Re: [mpich-discuss] mpich2 MPI_TEST errors
> >
> >> I found the Culprit function
> >> it was indeed a problem with mpi_test call , i tracked it down, the
> programs works now.
> >
> > Great!
> >
> >> But now i am having a hard time using the same program to run on
> mpich2 1.0.8/PBS on a x86_64 system.
> >> it compiles and runs perfectly as a single process,
> >> ie, mpiexec -n 1 ./Ring
> >> executes and generates outputs.
> >>
> >> but as soon as i do mpiexec -n 2 or more , it just waits and
> eventually the job is thrown out of the queue.
> >
> > Did you launch your mpd daemons correctly? See section 5.7.1 in the
> > MPICH2 users' guide:
> >
> http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-1.0.8-userguide.pdf
> >
> > PBS uses a slightly different node name representation that
> MPICH2's
> > MPD, but it should be trivial to convert between the two formats.
> >
> >> Does mpich2 has any special configurations with multiple core
> machines?
> >> Any tips on job submission or compiling,
> >> if just used
> >>
> >> ./configure --with-device=ch3:nemesis
> >
> > It'll automatically detect multi-core systems and optimize
> inter-core
> > communication.
> >
> >   -- Pavan
> >
> > --
> > Pavan Balaji
> > http://www.mcs.anl.gov/~balaji
> 
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list