[MPICH] An idle communication process use the same CPU as computation process on multi-core chips

Mon Sep 17 14:22:45 CDT 2007

I verified the expected result on the quad-core machine, while this is not the case on both dual-core and eight-core machines, which we are particular interested, as dual-core laptops and 8-core personal clusters become popular. The master uses real CPU no matter how I set the CPU affinity on those machines.

Yusong

----- Original Message -----
From: Darius Buntinas <buntinas at mcs.anl.gov>
Date: Monday, September 17, 2007 12:39 pm
Subject: Re: [MPICH] An idle communication process use the same CPU as computation process on multi-core chips

> 
> I can verify that I saw the same problem Yusong did when starting 
> the 
> master first on a dual quadcore machine.  But assigning each slave 
> to 
> its own core (using taskset) fixed that.
> 
> Interestingly, when there are less than 8 slaves, top shows that 
> the 
> master has 100% usage (when top is in "irix mode", and 12.5% (1/8) 
> when 
> not in irix mode).  When I have 8 slaves, the usage of the master 
> process goes to 0.
> 
> Yusong, I'm betting that if you set the cpu affinity for the 
> slaves, 
> you'll see no impact of the master on the slaves.  Can you try that?
> 
> e.g.,:
>   ./master &
>   for i in `seq 0 3` ; do taskset -c $i ./slave & done
> 
> -d
> 
> On 09/17/2007 02:31 AM, Sylvain Jeaugey wrote:
> > This seems to be the key of the problem. When the master is 
> launched 
> > before others, it takes one CPU and this won't change until for 
> any 
> > scheduling reason he comes to share its CPU (with a slave). It 
> then 
> > falls to 0% and we're saved.
> > 
> > So, to conduct you experiment, you definetely need to taskset 
> your 
> > slaves. Just launch them with
> > taskset -c <cpu> ./slave (1 process per cpu)
> > or use the -p option of taskset to do it after launch and ensure 
> that 
> > each slave _will_ take one CPU. Thus, the master will be obliged 
> to 
> > share the cpu with others and sched_yield() will be effective.
> > 
> > Sylvain
> > 
> > On Sun, 16 Sep 2007, Yusong Wang wrote:
> > 
> >> I did the experiments on  four types of muti-core chips (2 dual-
> core, 
> >> 1 quad-core and 1 eight-core).  All of my tests shows the idle 
> master 
> >> process has a big impact on the other slave processes except for 
> the 
> >> test of the quad-core, in which I found the order does matter: 
> when 
> >> the master was launched after the slave processes were launched, 
> there 
> >> is no affect, while if the master started first, two slaves 
> processes 
> >> would go to the same core and cause the two processes to slow 
> down 
> >> significantly than others.
> >>
> >> Yusong
> >>
> >> ----- Original Message -----
> >> From: Darius Buntinas <buntinas at mcs.anl.gov>
> >> Date: Friday, September 14, 2007 12:55 pm
> >> Subject: Re: [MPICH] An idle communication process use the same 
> CPU as 
> >> computation process on multi-core chips
> >>
> >>>
> >>> It's possible that different versions of the kernel/os/top compute
> >>> %cpu
> >>> differently.  "CPU utilization" is really a nebulous term.  What
> >>> you
> >>> really want to know is whether the master is stealing significant
> >>> cycles
> >>> from the slaves.  A test of this would be to replace Sylvain's
> >>> slave
> >>> code with this:
> >>>
> >>> #include <sys/time.h>
> >>> int main() {
> >>>     while (1) {
> >>>         int i;
> >>>         struct timeval t0,t1;
> >>>         double usec;
> >>>
> >>>         gettimeofday(&t0, 0);
> >>>         for (i = 0; i < 100000000; ++i)
> >>>             ;
> >>>         gettimeofday(&t1, 0);
> >>>
> >>>         usec = (t1.tv_sec * 1e6 + t1.tv_usec) - (t0.tv_sec * 
> 1e6 +
> >>> t0.tv_usec);
> >>>         printf ("%8.0f\n", usec);
> >>>     }
> >>>     return 0;
> >>> }
> >>>
> >>> This will repeatedly time the inner loop.  On an N core system, 
> run>>> N of
> >>> these, and look at the times reported.  Then start the master and
> >>> see if
> >>> the timings change.  If the master does steal significant cycles
> >>> from
> >>> the slaves, then you'll see the timings reported by the slaves
> >>> increase.
> >>>  On my single processor laptop (fc6, 2.6.20), running one 
> slave, I
> >>> see
> >>> no impact from the master.
> >>>
> >>> Please let me know what you find.
> >>>
> >>> As far as slave processes hopping around on processors, you can 
> set>>> processor affinity ( 
> http://www.linuxjournal.com/article/6799 has a
> >>> good
> >>> description) on the slaves.
> >>>
> >>> -d
> >>>
> >>> On 09/14/2007 12:11 PM, Bob Soliday wrote:
> >>>> Sylvain Jeaugey wrote:
> >>>>> That's unfortunate.
> >>>>>
> >>>>> Still, I did two programs. A master :
> >>>>> ----------------------
> >>>>> int main() {
> >>>>>         while (1) {
> >>>>>             sched_yield();
> >>>>>         }
> >>>>>         return 0;
> >>>>> }
> >>>>> ----------------------
> >>>>> and a slave :
> >>>>> ----------------------
> >>>>> int main() {
> >>>>>         while (1);
> >>>>>         return 0;
> >>>>> }
> >>>>> ----------------------
> >>>>>
> >>>>> I launch 4 slaves and 1 master on a bi dual-core machine. Here
> >>> is the
> >>>>> result in top :
> >>>>>
> >>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> >>> COMMAND>> 12361 sylvain   25   0  2376  244  188 R  100  0.0
> >>> 0:18.26 slave
> >>>>> 12362 sylvain   25   0  2376  244  188 R  100  0.0   0:18.12 
> slave>>>>> 12360 sylvain   25   0  2376  244  188 R  100  0.0   
> 0:18.23 slave
> >>>>> 12363 sylvain   25   0  2376  244  188 R  100  0.0   0:18.15 
> slave>>>>> 12364 sylvain   20   0  2376  248  192 R    0  0.0   
> 0:00.00 master
> >>>>> 12365 sylvain   16   0  6280 1120  772 R    0  0.0   0:00.08 top
> >>>>>
> >>>>> If you are seeing 66% each, I guess that your master is not
> >>>>> sched_yield'ing as much as expected. Maybe you should look at
> >>>>> environment variables to force yield when no message is
> >>> available, and
> >>>>> maybe your master isn't so idle after all and has message to
> >>> send
> >>>>> continuously, thus not yield'ing.
> >>>>>
> >>>>
> >>>> On our FC5 nodes with 4 cores we get similar results. But on our
> >>> FC7
> >>>> nodes with 8 cores we don't. The kernel seems to think that 
> all 9
> >>> jobs
> >>>> require 100% and they end up jumping from one core to another.
> >>> Often the
> >>>> master job is left on it's own core while two slaves run on 
> another.>>>>
> >>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P
> >>> COMMAND> 20127 ywang25   20   0  106m  22m 4168 R   68  0.5
> >>> 0:06.84 0 slave
> >>>> 20131 ywang25   20   0  106m  22m 4184 R   73  0.5   0:07.26 1 
> slave>>>> 20133 ywang25   20   0  106m  22m 4196 R   75  0.5   
> 0:07.49 2 slave
> >>>> 20129 ywang25   20   0  106m  22m 4176 R   84  0.5   0:08.44 3 
> slave>>>> 20135 ywang25   20   0  106m  22m 4176 R   73  0.5   
> 0:07.29 4 slave
> >>>> 20132 ywang25   20   0  106m  22m 4188 R   70  0.5   0:07.04 4 
> slave>>>> 20128 ywang25   20   0  106m  22m 4180 R   78  0.5   
> 0:07.79 5 slave
> >>>> 20130 ywang25   20   0  106m  22m 4180 R   74  0.5   0:07.45 6 
> slave>>>> 20134 ywang25   20   0  106m  24m 6708 R   80  0.6   
> 0:07.98 7
> >>> master>
> >>>> 20135 ywang25   20   0  106m  22m 4176 R   75  0.5   0:14.75 0 
> slave>>>> 20132 ywang25   20   0  106m  22m 4188 R   79  0.5   
> 0:14.96 1 slave
> >>>> 20130 ywang25   20   0  106m  22m 4180 R   99  0.5   0:17.32 2 
> slave>>>> 20129 ywang25   20   0  106m  22m 4176 R  100  0.5   
> 0:18.44 3 slave
> >>>> 20127 ywang25   20   0  106m  22m 4168 R   75  0.5   0:14.36 4 
> slave>>>> 20133 ywang25   20   0  106m  22m 4196 R   96  0.5   
> 0:17.09 5 slave
> >>>> 20131 ywang25   20   0  106m  22m 4184 R   78  0.5   0:15.02 6 
> slave>>>> 20128 ywang25   20   0  106m  22m 4180 R   99  0.5   
> 0:17.70 6 slave
> >>>> 20134 ywang25   20   0  106m  24m 6708 R  100  0.6   0:17.97 7
> >>> master>
> >>>> 20130 ywang25   20   0  106m  22m 4180 R   87  0.5   0:25.99 0 
> slave>>>> 20132 ywang25   20   0  106m  22m 4188 R   79  0.5   
> 0:22.83 0 slave
> >>>> 20127 ywang25   20   0  106m  22m 4168 R   75  0.5   0:21.89 1 
> slave>>>> 20133 ywang25   20   0  106m  22m 4196 R   98  0.5   
> 0:26.94 2 slave
> >>>> 20129 ywang25   20   0  106m  22m 4176 R  100  0.5   0:28.45 3 
> slave>>>> 20135 ywang25   20   0  106m  22m 4176 R   74  0.5   
> 0:22.12 4 slave
> >>>> 20134 ywang25   20   0  106m  24m 6708 R   98  0.6   0:27.73 5
> >>> master> 20128 ywang25   20   0  106m  22m 4180 R   90  0.5
> >>> 0:26.72 6 slave
> >>>> 20131 ywang25   20   0  106m  22m 4184 R   99  0.5   0:24.96 7 
> slave>>>>
> >>>> 20133 ywang25   20   0 91440 5756 4852 R   87  0.1   0:44.20 0 
> slave>>>> 20132 ywang25   20   0 91436 5764 4860 R   80  0.1   
> 0:39.32 0
> >>> slave
> >>>>                                                            20134
> >>>> ywang25   20   0  112m  36m  11m R   96  0.9   0:47.35 5 master
> >>>> 20129 ywang25   20   0 91440 5736 4832 R   91  0.1   0:46.84 1 
> slave>>>> 20130 ywang25   20   0 91440 5748 4844 R   83  0.1   
> 0:43.07 3 slave
> >>>> 20131 ywang25   20   0 91432 5744 4840 R   84  0.1   0:41.20 4 
> slave>>>> 20134 ywang25   20   0  112m  36m  11m R   96  0.9   
> 0:47.35 5
> >>> master> 20128 ywang25   20   0 91432 5752 4844 R   93  0.1
> >>> 0:45.36 5 slave
> >>>> 20127 ywang25   20   0 91440 5724 4824 R   94  0.1   0:40.56 6 
> slave>>>> 20135 ywang25   20   0 91440 5736 4832 R   92  0.1   
> 0:39.75 7 slave
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>
> > 
>