[MPICH] An idle communication process use the same CPU as computation process on multi-core chips

Tue Sep 25 13:22:40 CDT 2007

Pretty cool to see my bug report made it all the way up to Linus Torvalds.

Darius Buntinas wrote:
> Hmm.  Maybe things aren't as bad as I thought.  It looks like Linus is 
> pushing for the previous yield() behavior.
> 
> http://kerneltrap.org/Linux/CFS_and_sched_yield
> 
> -d
> 
> On 09/18/2007 01:23 PM, Darius Buntinas wrote:
> 
>>
>>  From the discussion on lkml and the fact that they see programs that 
>> use sched_yield() this way as "fundamentally broken", it seems that 
>> this patch is only temporary, and eventually the pre-2.6.22 kernel 
>> behavior won't be supported.
>>
>> -d
>>
>> On 09/18/2007 12:45 PM, Bob Soliday wrote:
>>
>>> Well I reported the bug and it turns out they already have a patch 
>>> for it that will be included in a future release so that it will be 
>>> possible to emulate the old scheduler.
>>>
>>> https://bugzilla.redhat.com/show_bug.cgi?id=295071
>>>
>>> http://lkml.org/lkml/2007/9/14/157
>>>
>>> --Bob
>>>
>>> Bob Soliday wrote:
>>>
>>>> It turns out the problem is not related to the number of cores. Only 
>>>> the newest versions of the Fedora 7 kernel show the problem. I think 
>>>> it is related to the CFS scheduler in these kernels.
>>>>
>>>> When I run one slave and one master on the same core with 
>>>> kernel-2.6.21-1.3194 using Darius's slave code I see the slave task 
>>>> use 100% of the CPU and see the same timing values as when I run the 
>>>> slave on a different core.
>>>>
>>>> When I do the same test with kernel-2.6.22.4-65 or 
>>>> kernel-2.6.22.5.76 the timing values double as the slave can only 
>>>> get 50% of the CPU time when on the same core.
>>>>
>>>> --Bob
>>>>
>>>> Darius Buntinas wrote:
>>>>
>>>>>
>>>>> I can verify that I saw the same problem Yusong did when starting 
>>>>> the master first on a dual quadcore machine.  But assigning each 
>>>>> slave to its own core (using taskset) fixed that.
>>>>>
>>>>> Interestingly, when there are less than 8 slaves, top shows that 
>>>>> the master has 100% usage (when top is in "irix mode", and 12.5% 
>>>>> (1/8) when not in irix mode).  When I have 8 slaves, the usage of 
>>>>> the master process goes to 0.
>>>>>
>>>>> Yusong, I'm betting that if you set the cpu affinity for the 
>>>>> slaves, you'll see no impact of the master on the slaves.  Can you 
>>>>> try that?
>>>>>
>>>>> e.g.,:
>>>>>   ./master &
>>>>>   for i in `seq 0 3` ; do taskset -c $i ./slave & done
>>>>>
>>>>> -d
>>>>>
>>>>> On 09/17/2007 02:31 AM, Sylvain Jeaugey wrote:
>>>>>
>>>>>> This seems to be the key of the problem. When the master is 
>>>>>> launched before others, it takes one CPU and this won't change 
>>>>>> until for any scheduling reason he comes to share its CPU (with a 
>>>>>> slave). It then falls to 0% and we're saved.
>>>>>>
>>>>>> So, to conduct you experiment, you definetely need to taskset your 
>>>>>> slaves. Just launch them with
>>>>>> taskset -c <cpu> ./slave (1 process per cpu)
>>>>>> or use the -p option of taskset to do it after launch and ensure 
>>>>>> that each slave _will_ take one CPU. Thus, the master will be 
>>>>>> obliged to share the cpu with others and sched_yield() will be 
>>>>>> effective.
>>>>>>
>>>>>> Sylvain
>>>>>>
>>>>>> On Sun, 16 Sep 2007, Yusong Wang wrote:
>>>>>>
>>>>>>> I did the experiments on  four types of muti-core chips (2 
>>>>>>> dual-core, 1 quad-core and 1 eight-core).  All of my tests shows 
>>>>>>> the idle master process has a big impact on the other slave 
>>>>>>> processes except for the test of the quad-core, in which I found 
>>>>>>> the order does matter: when the master was launched after the 
>>>>>>> slave processes were launched, there is no affect, while if the 
>>>>>>> master started first, two slaves processes would go to the same 
>>>>>>> core and cause the two processes to slow down significantly than 
>>>>>>> others.
>>>>>>>
>>>>>>> Yusong
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>> From: Darius Buntinas <buntinas at mcs.anl.gov>
>>>>>>> Date: Friday, September 14, 2007 12:55 pm
>>>>>>> Subject: Re: [MPICH] An idle communication process use the same 
>>>>>>> CPU as computation process on multi-core chips
>>>>>>>
>>>>>>>>
>>>>>>>> It's possible that different versions of the kernel/os/top compute
>>>>>>>> %cpu
>>>>>>>> differently.  "CPU utilization" is really a nebulous term.  What
>>>>>>>> you
>>>>>>>> really want to know is whether the master is stealing significant
>>>>>>>> cycles
>>>>>>>> from the slaves.  A test of this would be to replace Sylvain's
>>>>>>>> slave
>>>>>>>> code with this:
>>>>>>>>
>>>>>>>> #include <sys/time.h>
>>>>>>>> int main() {
>>>>>>>>     while (1) {
>>>>>>>>         int i;
>>>>>>>>         struct timeval t0,t1;
>>>>>>>>         double usec;
>>>>>>>>
>>>>>>>>         gettimeofday(&t0, 0);
>>>>>>>>         for (i = 0; i < 100000000; ++i)
>>>>>>>>             ;
>>>>>>>>         gettimeofday(&t1, 0);
>>>>>>>>
>>>>>>>>         usec = (t1.tv_sec * 1e6 + t1.tv_usec) - (t0.tv_sec * 1e6 +
>>>>>>>> t0.tv_usec);
>>>>>>>>         printf ("%8.0f\n", usec);
>>>>>>>>     }
>>>>>>>>     return 0;
>>>>>>>> }
>>>>>>>>
>>>>>>>> This will repeatedly time the inner loop.  On an N core system, run
>>>>>>>> N of
>>>>>>>> these, and look at the times reported.  Then start the master and
>>>>>>>> see if
>>>>>>>> the timings change.  If the master does steal significant cycles
>>>>>>>> from
>>>>>>>> the slaves, then you'll see the timings reported by the slaves
>>>>>>>> increase.
>>>>>>>>  On my single processor laptop (fc6, 2.6.20), running one slave, I
>>>>>>>> see
>>>>>>>> no impact from the master.
>>>>>>>>
>>>>>>>> Please let me know what you find.
>>>>>>>>
>>>>>>>> As far as slave processes hopping around on processors, you can set
>>>>>>>> processor affinity ( http://www.linuxjournal.com/article/6799 has a
>>>>>>>> good
>>>>>>>> description) on the slaves.
>>>>>>>>
>>>>>>>> -d
>>>>>>>>
>>>>>>>> On 09/14/2007 12:11 PM, Bob Soliday wrote:
>>>>>>>>
>>>>>>>>> Sylvain Jeaugey wrote:
>>>>>>>>>
>>>>>>>>>> That's unfortunate.
>>>>>>>>>>
>>>>>>>>>> Still, I did two programs. A master :
>>>>>>>>>> ----------------------
>>>>>>>>>> int main() {
>>>>>>>>>>         while (1) {
>>>>>>>>>>             sched_yield();
>>>>>>>>>>         }
>>>>>>>>>>         return 0;
>>>>>>>>>> }
>>>>>>>>>> ----------------------
>>>>>>>>>> and a slave :
>>>>>>>>>> ----------------------
>>>>>>>>>> int main() {
>>>>>>>>>>         while (1);
>>>>>>>>>>         return 0;
>>>>>>>>>> }
>>>>>>>>>> ----------------------
>>>>>>>>>>
>>>>>>>>>> I launch 4 slaves and 1 master on a bi dual-core machine. Here
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> is the
>>>>>>>>
>>>>>>>>>> result in top :
>>>>>>>>>>
>>>>>>>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> COMMAND>> 12361 sylvain   25   0  2376  244  188 R  100  0.0
>>>>>>>> 0:18.26 slave
>>>>>>>>
>>>>>>>>>> 12362 sylvain   25   0  2376  244  188 R  100  0.0   0:18.12 
>>>>>>>>>> slave
>>>>>>>>>> 12360 sylvain   25   0  2376  244  188 R  100  0.0   0:18.23 
>>>>>>>>>> slave
>>>>>>>>>> 12363 sylvain   25   0  2376  244  188 R  100  0.0   0:18.15 
>>>>>>>>>> slave
>>>>>>>>>> 12364 sylvain   20   0  2376  248  192 R    0  0.0   0:00.00 
>>>>>>>>>> master
>>>>>>>>>> 12365 sylvain   16   0  6280 1120  772 R    0  0.0   0:00.08 top
>>>>>>>>>>
>>>>>>>>>> If you are seeing 66% each, I guess that your master is not
>>>>>>>>>> sched_yield'ing as much as expected. Maybe you should look at
>>>>>>>>>> environment variables to force yield when no message is
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> available, and
>>>>>>>>
>>>>>>>>>> maybe your master isn't so idle after all and has message to
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> send
>>>>>>>>
>>>>>>>>>> continuously, thus not yield'ing.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On our FC5 nodes with 4 cores we get similar results. But on our
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> FC7
>>>>>>>>
>>>>>>>>> nodes with 8 cores we don't. The kernel seems to think that all 9
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> jobs
>>>>>>>>
>>>>>>>>> require 100% and they end up jumping from one core to another.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Often the
>>>>>>>>
>>>>>>>>> master job is left on it's own core while two slaves run on 
>>>>>>>>> another.
>>>>>>>>>
>>>>>>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> COMMAND> 20127 ywang25   20   0  106m  22m 4168 R   68  0.5
>>>>>>>> 0:06.84 0 slave
>>>>>>>>
>>>>>>>>> 20131 ywang25   20   0  106m  22m 4184 R   73  0.5   0:07.26 1 
>>>>>>>>> slave
>>>>>>>>> 20133 ywang25   20   0  106m  22m 4196 R   75  0.5   0:07.49 2 
>>>>>>>>> slave
>>>>>>>>> 20129 ywang25   20   0  106m  22m 4176 R   84  0.5   0:08.44 3 
>>>>>>>>> slave
>>>>>>>>> 20135 ywang25   20   0  106m  22m 4176 R   73  0.5   0:07.29 4 
>>>>>>>>> slave
>>>>>>>>> 20132 ywang25   20   0  106m  22m 4188 R   70  0.5   0:07.04 4 
>>>>>>>>> slave
>>>>>>>>> 20128 ywang25   20   0  106m  22m 4180 R   78  0.5   0:07.79 5 
>>>>>>>>> slave
>>>>>>>>> 20130 ywang25   20   0  106m  22m 4180 R   74  0.5   0:07.45 6 
>>>>>>>>> slave
>>>>>>>>> 20134 ywang25   20   0  106m  24m 6708 R   80  0.6   0:07.98 7
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> master>
>>>>>>>>
>>>>>>>>> 20135 ywang25   20   0  106m  22m 4176 R   75  0.5   0:14.75 0 
>>>>>>>>> slave
>>>>>>>>> 20132 ywang25   20   0  106m  22m 4188 R   79  0.5   0:14.96 1 
>>>>>>>>> slave
>>>>>>>>> 20130 ywang25   20   0  106m  22m 4180 R   99  0.5   0:17.32 2 
>>>>>>>>> slave
>>>>>>>>> 20129 ywang25   20   0  106m  22m 4176 R  100  0.5   0:18.44 3 
>>>>>>>>> slave
>>>>>>>>> 20127 ywang25   20   0  106m  22m 4168 R   75  0.5   0:14.36 4 
>>>>>>>>> slave
>>>>>>>>> 20133 ywang25   20   0  106m  22m 4196 R   96  0.5   0:17.09 5 
>>>>>>>>> slave
>>>>>>>>> 20131 ywang25   20   0  106m  22m 4184 R   78  0.5   0:15.02 6 
>>>>>>>>> slave
>>>>>>>>> 20128 ywang25   20   0  106m  22m 4180 R   99  0.5   0:17.70 6 
>>>>>>>>> slave
>>>>>>>>> 20134 ywang25   20   0  106m  24m 6708 R  100  0.6   0:17.97 7
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> master>
>>>>>>>>
>>>>>>>>> 20130 ywang25   20   0  106m  22m 4180 R   87  0.5   0:25.99 0 
>>>>>>>>> slave
>>>>>>>>> 20132 ywang25   20   0  106m  22m 4188 R   79  0.5   0:22.83 0 
>>>>>>>>> slave
>>>>>>>>> 20127 ywang25   20   0  106m  22m 4168 R   75  0.5   0:21.89 1 
>>>>>>>>> slave
>>>>>>>>> 20133 ywang25   20   0  106m  22m 4196 R   98  0.5   0:26.94 2 
>>>>>>>>> slave
>>>>>>>>> 20129 ywang25   20   0  106m  22m 4176 R  100  0.5   0:28.45 3 
>>>>>>>>> slave
>>>>>>>>> 20135 ywang25   20   0  106m  22m 4176 R   74  0.5   0:22.12 4 
>>>>>>>>> slave
>>>>>>>>> 20134 ywang25   20   0  106m  24m 6708 R   98  0.6   0:27.73 5
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> master> 20128 ywang25   20   0  106m  22m 4180 R   90  0.5
>>>>>>>> 0:26.72 6 slave
>>>>>>>>
>>>>>>>>> 20131 ywang25   20   0  106m  22m 4184 R   99  0.5   0:24.96 7 
>>>>>>>>> slave
>>>>>>>>>
>>>>>>>>> 20133 ywang25   20   0 91440 5756 4852 R   87  0.1   0:44.20 0 
>>>>>>>>> slave
>>>>>>>>> 20132 ywang25   20   0 91436 5764 4860 R   80  0.1   0:39.32 0
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> slave
>>>>>>>>
>>>>>>>>>                                                            20134
>>>>>>>>> ywang25   20   0  112m  36m  11m R   96  0.9   0:47.35 5 master
>>>>>>>>> 20129 ywang25   20   0 91440 5736 4832 R   91  0.1   0:46.84 1 
>>>>>>>>> slave
>>>>>>>>> 20130 ywang25   20   0 91440 5748 4844 R   83  0.1   0:43.07 3 
>>>>>>>>> slave
>>>>>>>>> 20131 ywang25   20   0 91432 5744 4840 R   84  0.1   0:41.20 4 
>>>>>>>>> slave
>>>>>>>>> 20134 ywang25   20   0  112m  36m  11m R   96  0.9   0:47.35 5
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> master> 20128 ywang25   20   0 91432 5752 4844 R   93  0.1
>>>>>>>> 0:45.36 5 slave
>>>>>>>>
>>>>>>>>> 20127 ywang25   20   0 91440 5724 4824 R   94  0.1   0:40.56 6 
>>>>>>>>> slave
>>>>>>>>> 20135 ywang25   20   0 91440 5736 4832 R   92  0.1   0:39.75 7 
>>>>>>>>> slave
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>>
>>>
>>