[MPICH] mpich and pbs

Steve Young chemadm at hamilton.edu
Wed Jun 20 14:16:07 CDT 2007


Ok I did the cpi test and had the following results:

Process 0 of 1 is on node0038
pi is approximately 3.1415926544231341, Error is 0.0000000008333410
wall clock time = 0.000107
Process 0 of 1 is on node0038
pi is approximately 3.1415926544231341, Error is 0.0000000008333410
wall clock time = 0.000101
Process 0 of 1 is on node0038
pi is approximately 3.1415926544231341, Error is 0.0000000008333410
wall clock time = 0.000118
Process 0 of 1 is on node0038
pi is approximately 3.1415926544231341, Error is 0.0000000008333410
wall clock time = 0.000107
Process 0 of 1 is on node0037
Process 0 of 1 is on node0037
pi is approximately 3.1415926544231341, Error is 0.0000000008333410
wall clock time = 0.000108
pi is approximately 3.1415926544231341, Error is 0.0000000008333410
wall clock time = 0.000126
Process 0 of 1 is on node0037
pi is approximately 3.1415926544231341, Error is 0.0000000008333410
wall clock time = 0.000102
Process 0 of 1 is on node0037
pi is approximately 3.1415926544231341, Error is 0.0000000008333410
wall clock time = 0.000101
mpiexec: Warning: tasks 0-7 exited before completing MPI startup.


Now when I start up a ring on the same two nodes and run it using mpirun
I get:

[clutest at herc0037 ~/cpi_test]% mpirun -np 8 ./cpi
Process 0 of 8 is on node0037
Process 1 of 8 is on node0038
Process 2 of 8 is on node0038
Process 3 of 8 is on node0038
Process 5 of 8 is on node0037
Process 4 of 8 is on node0038
Process 6 of 8 is on node0038
Process 7 of 8 is on node0037
pi is approximately 3.1415926544231247, Error is 0.0000000008333316
wall clock time = 0.010308
[clutest at herc0037 ~/cpi_test]%


I'd expect I would get the same results using mpiexec from the mpich2
distro too. The only issue still remains that using mpirun or mpiexec
from the mpich2 distro won't allow the nodes to be in sync with the
nodes that PBS allocates. 

Any ideas?

-Steve


On Wed, 2007-06-20 at 12:47 -0500, Rajeev Thakur wrote:
> > However, now it appears that the program being run is in serial. 
> > For example, an 8 cpu job gets stared on two nodes (each node
> > has 4 cpu's - 2 dual core opterons). We see all 8 processes running on
> > the nodes. But in looking at the output it appears like a 
> > serial job. I get the same results trying to use vasp and amber. 
> 
> What do you mean by "it appears like a serial job"? Do you mean
> performance-wise?
> 
> Try running the cpi example from the examples directory on 8 processes. If
> you see 4 hostnames from 1 machine and 4 from the other, the job should be
> running ok. It's up to the OS to schedule the 4 processes on each machine.
> MPI doesn't do that.
> 
> Rajeev
> 
>    
> 
> > -----Original Message-----
> > From: owner-mpich-discuss at mcs.anl.gov 
> > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Steve Young
> > Sent: Wednesday, June 20, 2007 10:58 AM
> > To: mpich-discuss at mcs.anl.gov
> > Subject: [MPICH] mpich and pbs
> > 
> > Hello everyone,
> > 	I still seem to be having an issue with getting mpich 
> > to work properly.
> > I have version mpich2-1.0.5 compiled. This works as expected 
> > when I use
> > mpiexec or mpirun. However, the nodes that jobs run on aren't in sync
> > with the nodes that PBS allocates to the job. In posting to the list
> > before I was informed to use the mpiexec from OSC that works 
> > with PBS. I
> > installed that and jobs now get started on the proper nodes that PBS
> > allocates. However, now it appears that the program being run is in
> > serial. For example, an 8 cpu job gets stared on two nodes (each node
> > has 4 cpu's - 2 dual core opterons). We see all 8 processes running on
> > the nodes. But in looking at the output it appears like a 
> > serial job. I
> > get the same results trying to use vasp and amber. So I'm not 
> > sure what
> > I could do to correct this. Any ideas?
> > 
> > -Steve
> > 
> > 
> > 
> 




More information about the mpich-discuss mailing list