[MPICH] mpich and pbs

Steve Young chemadm at hamilton.edu
Wed Jun 20 13:29:32 CDT 2007


Well in vasp when the job runs in parallel you get output like the
following:

 vasp.4.6.28 25Jul05 complex
 executed on             LinuxIFC date 2007.06.20  10:20:59
 running on    8 nodes
 distr:  one band on    2 nodes,    4 groups

That is what we expect to see as it is showing that the job is using
both of the two nodes it was allocated. 

using the OCS mpiexec I get the following with the same job. I do see
there are 8 processes running but they seem to be 8 serial processes. 


vasp.4.6.28 25Jul05 complex
 executed on             LinuxIFC date 2007.06.20  12:32:48
 running on    1 nodes
 distr:  one band on    1 nodes,    1 groups


for amber if I run a typical job it acts like it is running properly and
I see all the proper processes started. But I believe it is also running
8 serial amber processes. However, when we try running a different part
of amber code it won't work as it complains about:

 Error: specified more groups ( 4 ) than the number of processors
( 1 ) !

This makes me believe that it too is running 8 serial processes. When I
use the mpiexec or mpirun from mpich then the job works fine. I just get
this using mpiexec from OSC. 

I'll try running some of the examples and see what I can come up with. 

-Steve


On Wed, 2007-06-20 at 12:47 -0500, Rajeev Thakur wrote:
> > However, now it appears that the program being run is in serial. 
> > For example, an 8 cpu job gets stared on two nodes (each node
> > has 4 cpu's - 2 dual core opterons). We see all 8 processes running on
> > the nodes. But in looking at the output it appears like a 
> > serial job. I get the same results trying to use vasp and amber. 
> 
> What do you mean by "it appears like a serial job"? Do you mean
> performance-wise?
> 
> Try running the cpi example from the examples directory on 8 processes. If
> you see 4 hostnames from 1 machine and 4 from the other, the job should be
> running ok. It's up to the OS to schedule the 4 processes on each machine.
> MPI doesn't do that.
> 
> Rajeev
> 
>    
> 
> > -----Original Message-----
> > From: owner-mpich-discuss at mcs.anl.gov 
> > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Steve Young
> > Sent: Wednesday, June 20, 2007 10:58 AM
> > To: mpich-discuss at mcs.anl.gov
> > Subject: [MPICH] mpich and pbs
> > 
> > Hello everyone,
> > 	I still seem to be having an issue with getting mpich 
> > to work properly.
> > I have version mpich2-1.0.5 compiled. This works as expected 
> > when I use
> > mpiexec or mpirun. However, the nodes that jobs run on aren't in sync
> > with the nodes that PBS allocates to the job. In posting to the list
> > before I was informed to use the mpiexec from OSC that works 
> > with PBS. I
> > installed that and jobs now get started on the proper nodes that PBS
> > allocates. However, now it appears that the program being run is in
> > serial. For example, an 8 cpu job gets stared on two nodes (each node
> > has 4 cpu's - 2 dual core opterons). We see all 8 processes running on
> > the nodes. But in looking at the output it appears like a 
> > serial job. I
> > get the same results trying to use vasp and amber. So I'm not 
> > sure what
> > I could do to correct this. Any ideas?
> > 
> > -Steve
> > 
> > 
> > 
> 




More information about the mpich-discuss mailing list