[MPICH] An idle communication process use the same CPU as computation process on multi-core chips

Tue Sep 18 11:13:25 CDT 2007

It turns out the problem is not related to the number of cores. Only the 
newest versions of the Fedora 7 kernel show the problem. I think it is 
related to the CFS scheduler in these kernels.

When I run one slave and one master on the same core with 
kernel-2.6.21-1.3194 using Darius's slave code I see the slave task use 
100% of the CPU and see the same timing values as when I run the slave 
on a different core.

When I do the same test with kernel-2.6.22.4-65 or kernel-2.6.22.5.76 
the timing values double as the slave can only get 50% of the CPU time 
when on the same core.

--Bob

Darius Buntinas wrote:
> 
> I can verify that I saw the same problem Yusong did when starting the 
> master first on a dual quadcore machine.  But assigning each slave to 
> its own core (using taskset) fixed that.
> 
> Interestingly, when there are less than 8 slaves, top shows that the 
> master has 100% usage (when top is in "irix mode", and 12.5% (1/8) when 
> not in irix mode).  When I have 8 slaves, the usage of the master 
> process goes to 0.
> 
> Yusong, I'm betting that if you set the cpu affinity for the slaves, 
> you'll see no impact of the master on the slaves.  Can you try that?
> 
> e.g.,:
>   ./master &
>   for i in `seq 0 3` ; do taskset -c $i ./slave & done
> 
> -d
> 
> On 09/17/2007 02:31 AM, Sylvain Jeaugey wrote:
> 
>> This seems to be the key of the problem. When the master is launched 
>> before others, it takes one CPU and this won't change until for any 
>> scheduling reason he comes to share its CPU (with a slave). It then 
>> falls to 0% and we're saved.
>>
>> So, to conduct you experiment, you definetely need to taskset your 
>> slaves. Just launch them with
>> taskset -c <cpu> ./slave (1 process per cpu)
>> or use the -p option of taskset to do it after launch and ensure that 
>> each slave _will_ take one CPU. Thus, the master will be obliged to 
>> share the cpu with others and sched_yield() will be effective.
>>
>> Sylvain
>>
>> On Sun, 16 Sep 2007, Yusong Wang wrote:
>>
>>> I did the experiments on  four types of muti-core chips (2 dual-core, 
>>> 1 quad-core and 1 eight-core).  All of my tests shows the idle master 
>>> process has a big impact on the other slave processes except for the 
>>> test of the quad-core, in which I found the order does matter: when 
>>> the master was launched after the slave processes were launched, 
>>> there is no affect, while if the master started first, two slaves 
>>> processes would go to the same core and cause the two processes to 
>>> slow down significantly than others.
>>>
>>> Yusong
>>>
>>> ----- Original Message -----
>>> From: Darius Buntinas <buntinas at mcs.anl.gov>
>>> Date: Friday, September 14, 2007 12:55 pm
>>> Subject: Re: [MPICH] An idle communication process use the same CPU 
>>> as computation process on multi-core chips
>>>
>>>>
>>>> It's possible that different versions of the kernel/os/top compute
>>>> %cpu
>>>> differently.  "CPU utilization" is really a nebulous term.  What
>>>> you
>>>> really want to know is whether the master is stealing significant
>>>> cycles
>>>> from the slaves.  A test of this would be to replace Sylvain's
>>>> slave
>>>> code with this:
>>>>
>>>> #include <sys/time.h>
>>>> int main() {
>>>>     while (1) {
>>>>         int i;
>>>>         struct timeval t0,t1;
>>>>         double usec;
>>>>
>>>>         gettimeofday(&t0, 0);
>>>>         for (i = 0; i < 100000000; ++i)
>>>>             ;
>>>>         gettimeofday(&t1, 0);
>>>>
>>>>         usec = (t1.tv_sec * 1e6 + t1.tv_usec) - (t0.tv_sec * 1e6 +
>>>> t0.tv_usec);
>>>>         printf ("%8.0f\n", usec);
>>>>     }
>>>>     return 0;
>>>> }
>>>>
>>>> This will repeatedly time the inner loop.  On an N core system, run
>>>> N of
>>>> these, and look at the times reported.  Then start the master and
>>>> see if
>>>> the timings change.  If the master does steal significant cycles
>>>> from
>>>> the slaves, then you'll see the timings reported by the slaves
>>>> increase.
>>>>  On my single processor laptop (fc6, 2.6.20), running one slave, I
>>>> see
>>>> no impact from the master.
>>>>
>>>> Please let me know what you find.
>>>>
>>>> As far as slave processes hopping around on processors, you can set
>>>> processor affinity ( http://www.linuxjournal.com/article/6799 has a
>>>> good
>>>> description) on the slaves.
>>>>
>>>> -d
>>>>
>>>> On 09/14/2007 12:11 PM, Bob Soliday wrote:
>>>>
>>>>> Sylvain Jeaugey wrote:
>>>>>
>>>>>> That's unfortunate.
>>>>>>
>>>>>> Still, I did two programs. A master :
>>>>>> ----------------------
>>>>>> int main() {
>>>>>>         while (1) {
>>>>>>             sched_yield();
>>>>>>         }
>>>>>>         return 0;
>>>>>> }
>>>>>> ----------------------
>>>>>> and a slave :
>>>>>> ----------------------
>>>>>> int main() {
>>>>>>         while (1);
>>>>>>         return 0;
>>>>>> }
>>>>>> ----------------------
>>>>>>
>>>>>> I launch 4 slaves and 1 master on a bi dual-core machine. Here
>>>>
>>>> is the
>>>>
>>>>>> result in top :
>>>>>>
>>>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>>>>
>>>> COMMAND>> 12361 sylvain   25   0  2376  244  188 R  100  0.0
>>>> 0:18.26 slave
>>>>
>>>>>> 12362 sylvain   25   0  2376  244  188 R  100  0.0   0:18.12 slave
>>>>>> 12360 sylvain   25   0  2376  244  188 R  100  0.0   0:18.23 slave
>>>>>> 12363 sylvain   25   0  2376  244  188 R  100  0.0   0:18.15 slave
>>>>>> 12364 sylvain   20   0  2376  248  192 R    0  0.0   0:00.00 master
>>>>>> 12365 sylvain   16   0  6280 1120  772 R    0  0.0   0:00.08 top
>>>>>>
>>>>>> If you are seeing 66% each, I guess that your master is not
>>>>>> sched_yield'ing as much as expected. Maybe you should look at
>>>>>> environment variables to force yield when no message is
>>>>
>>>> available, and
>>>>
>>>>>> maybe your master isn't so idle after all and has message to
>>>>
>>>> send
>>>>
>>>>>> continuously, thus not yield'ing.
>>>>>>
>>>>>
>>>>> On our FC5 nodes with 4 cores we get similar results. But on our
>>>>
>>>> FC7
>>>>
>>>>> nodes with 8 cores we don't. The kernel seems to think that all 9
>>>>
>>>> jobs
>>>>
>>>>> require 100% and they end up jumping from one core to another.
>>>>
>>>> Often the
>>>>
>>>>> master job is left on it's own core while two slaves run on another.
>>>>>
>>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P
>>>>
>>>> COMMAND> 20127 ywang25   20   0  106m  22m 4168 R   68  0.5
>>>> 0:06.84 0 slave
>>>>
>>>>> 20131 ywang25   20   0  106m  22m 4184 R   73  0.5   0:07.26 1 slave
>>>>> 20133 ywang25   20   0  106m  22m 4196 R   75  0.5   0:07.49 2 slave
>>>>> 20129 ywang25   20   0  106m  22m 4176 R   84  0.5   0:08.44 3 slave
>>>>> 20135 ywang25   20   0  106m  22m 4176 R   73  0.5   0:07.29 4 slave
>>>>> 20132 ywang25   20   0  106m  22m 4188 R   70  0.5   0:07.04 4 slave
>>>>> 20128 ywang25   20   0  106m  22m 4180 R   78  0.5   0:07.79 5 slave
>>>>> 20130 ywang25   20   0  106m  22m 4180 R   74  0.5   0:07.45 6 slave
>>>>> 20134 ywang25   20   0  106m  24m 6708 R   80  0.6   0:07.98 7
>>>>
>>>> master>
>>>>
>>>>> 20135 ywang25   20   0  106m  22m 4176 R   75  0.5   0:14.75 0 slave
>>>>> 20132 ywang25   20   0  106m  22m 4188 R   79  0.5   0:14.96 1 slave
>>>>> 20130 ywang25   20   0  106m  22m 4180 R   99  0.5   0:17.32 2 slave
>>>>> 20129 ywang25   20   0  106m  22m 4176 R  100  0.5   0:18.44 3 slave
>>>>> 20127 ywang25   20   0  106m  22m 4168 R   75  0.5   0:14.36 4 slave
>>>>> 20133 ywang25   20   0  106m  22m 4196 R   96  0.5   0:17.09 5 slave
>>>>> 20131 ywang25   20   0  106m  22m 4184 R   78  0.5   0:15.02 6 slave
>>>>> 20128 ywang25   20   0  106m  22m 4180 R   99  0.5   0:17.70 6 slave
>>>>> 20134 ywang25   20   0  106m  24m 6708 R  100  0.6   0:17.97 7
>>>>
>>>> master>
>>>>
>>>>> 20130 ywang25   20   0  106m  22m 4180 R   87  0.5   0:25.99 0 slave
>>>>> 20132 ywang25   20   0  106m  22m 4188 R   79  0.5   0:22.83 0 slave
>>>>> 20127 ywang25   20   0  106m  22m 4168 R   75  0.5   0:21.89 1 slave
>>>>> 20133 ywang25   20   0  106m  22m 4196 R   98  0.5   0:26.94 2 slave
>>>>> 20129 ywang25   20   0  106m  22m 4176 R  100  0.5   0:28.45 3 slave
>>>>> 20135 ywang25   20   0  106m  22m 4176 R   74  0.5   0:22.12 4 slave
>>>>> 20134 ywang25   20   0  106m  24m 6708 R   98  0.6   0:27.73 5
>>>>
>>>> master> 20128 ywang25   20   0  106m  22m 4180 R   90  0.5
>>>> 0:26.72 6 slave
>>>>
>>>>> 20131 ywang25   20   0  106m  22m 4184 R   99  0.5   0:24.96 7 slave
>>>>>
>>>>> 20133 ywang25   20   0 91440 5756 4852 R   87  0.1   0:44.20 0 slave
>>>>> 20132 ywang25   20   0 91436 5764 4860 R   80  0.1   0:39.32 0
>>>>
>>>> slave
>>>>
>>>>>                                                            20134
>>>>> ywang25   20   0  112m  36m  11m R   96  0.9   0:47.35 5 master
>>>>> 20129 ywang25   20   0 91440 5736 4832 R   91  0.1   0:46.84 1 slave
>>>>> 20130 ywang25   20   0 91440 5748 4844 R   83  0.1   0:43.07 3 slave
>>>>> 20131 ywang25   20   0 91432 5744 4840 R   84  0.1   0:41.20 4 slave
>>>>> 20134 ywang25   20   0  112m  36m  11m R   96  0.9   0:47.35 5
>>>>
>>>> master> 20128 ywang25   20   0 91432 5752 4844 R   93  0.1
>>>> 0:45.36 5 slave
>>>>
>>>>> 20127 ywang25   20   0 91440 5724 4824 R   94  0.1   0:40.56 6 slave
>>>>> 20135 ywang25   20   0 91440 5736 4832 R   92  0.1   0:39.75 7 slave
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>