[mpich-discuss] Unexpected CPU usage shown by Linux top command

Dave Goodell goodell at mcs.anl.gov
Wed Nov 10 14:20:13 CST 2010


On Nov 10, 2010, at 1:38 PM CST, Jeyapandian Kottalam wrote:

> I see a difference in behavior between MPICH2 versions 1.0.8p1 and 1.2.1p1 (also 1.3) on Linux. Our program alternates between a significant chunk of serial computing and a chunk of parallel computing. During the serial phase the rank > 0 processes are waiting on a mpi_bcast until rank = 0 hits that bcast line. 
> 
> With MPICH2 version 1.0.8p1, the Linux 'top' command shows only one process accumulating cpu time; but, with 1.2.1p1 and with 1.3, all processes do. In both cases I am building from exactly the same code base. With print statements I can verify that the rank > 0 processes are indeed waiting on mpi_bcast.
> 
> Do you have any possible explanation or suggestion?

Starting with MPICH2 1.1 (and later), "nemesis" is the default communications channel.  Nemesis only uses busy polling, which eats up CPU cycles when you call a communication call (like MPI_Bcast or MPI_Wait), even if there is no active communication.  In 1.0.8 and earlier the default was "sock", which does not busy wait.

We intend to modify nemesis to support a blocking mode, but that may still be 6 months or a year away from now.  This is the ticket for tracking: https://trac.mcs.anl.gov/projects/mpich2/ticket/79

-Dave



More information about the mpich-discuss mailing list