Slow speed after changing from serial to parallel (with ex2f.F)

Barry Smith bsmith at mcs.anl.gov
Wed Apr 16 09:17:18 CDT 2008


On Apr 16, 2008, at 8:44 AM, Ben Tay wrote:
> Hi,
>
> Am I right to say that despite all the hype about multi-core  
> processors, they can't speed up solving of linear eqns? It's not  
> possible to get a 2x speedup when using 2 cores. And is this true  
> for all types of linear equation solver besides PETSc?

    It will basically be the same for any iterative solver package.

> What about parallel direct solvers (e.g. MUMPS)

    direct solvers are a bit less memory bandwidth limited, so scaling  
will be a bit better. But the time spent for problems where
iterative solvers work well will likely be much higher for direct  
solver.

> or those which uses openmp instead of mpich?

    openmp will give no benefit, this is a hardware limitation, not  
software.

> Well, I just can't help feeling disappointed if that's the case...

    If you are going to do parallel computing you need to get use to  
disappointment. At this point in time (especially first generation
dual/quad core systems) memory bandwidth is the fundamental limitation  
(not number of flops your hardware can do)
to speed.

    Barry

>
>
> Also, with a smart enough LSF scheduler, I will be assured of  
> getting separate processors ie 1 core from each different processor  
> instead of 2-4 cores from just 1 processor. In that case, if I use 1  
> core from processor A and 1 core from processor B, I should be able  
> to get a decent speedup of more than 1, is that so?

    So long as your iterative solver ALGORITHM scales well, then you  
should see very good speedup (and most people do). Algorithm scaling  
means
if you increase the number of processes the number of iterations  
should not increase much.


> This option is also better than using 2 or even 4 cores from the  
> same processor.

   Two cores out of the four will likely not be so bad either; all  
four will be bad.

   Barry


>
>
> Thank you very much.
>
> Satish Balay wrote:
>> On Wed, 16 Apr 2008, Ben Tay wrote:
>>
>>
>>> Hi Satish, thank you very much for helping me run the ex2f.F code.
>>>
>>> I think I've a clearer picture now. I believe I'm running on Dual- 
>>> Core Intel
>>> Xeon 5160. The quad core is only on atlas3-01 to 04 and there's  
>>> only 4 of
>>> them. I guess that the lower peak is because I'm using Xeon 5160,  
>>> while you
>>> are using Xeon X5355.
>>>
>>
>> I'm still a bit puzzled. I just ran the same binary on a 2 dualcore
>> xeon 5130 machine [which should be similar to your 5160 machine] and
>> get the following:
>>
>> [balay at n001 ~]$ grep MatMult log*
>> log.1:MatMult             1192 1.0 1.0591e+01 1.0 3.86e+09 1.0 0.0e 
>> +00 0.0e+00 0.0e+00 14 11  0  0  0  14 11  0  0  0   364
>> log.2:MatMult             1217 1.0 6.3982e+00 1.0 1.97e+09 1.0 2.4e 
>> +03 4.8e+03 0.0e+00 14 11100100  0  14 11100100  0   615
>> log.4:MatMult              969 1.0 4.7780e+00 1.0 7.84e+08 1.0 5.8e 
>> +03 4.8e+03 0.0e+00 14 11100100  0  14 11100100  0   656
>> [balay at n001 ~]$
>>
>>> You mention about the speedups for MatMult and compare between  
>>> KSPSolve. Are
>>> these the only things we have to look at? Because I see that some  
>>> other event
>>> such as VecMAXPY also takes up a sizable % of the time. To get an  
>>> accurate
>>> speedup, do I just compare the time taken by KSPSolve between  
>>> different no. of
>>> processors or do I have to look at other events such as MatMult as  
>>> well?
>>>
>>
>> Sometimes we look at individual components like MatMult() VecMAXPY()
>> to understand whats hapenning in each stage - and at KSPSolve() to
>> look at the agregate performance for the whole solve [which includes
>> MatMult VecMAXPY etc..]. Perhaps I should have also looked at
>> VecMDot() aswell - at 48% of runtime - its the biggest contributor to
>> KSPSolve() for your run.
>>
>> Its easy to get lost in the details of log_summary. Looking for
>> anamolies is one thing. Plotting scalability charts for the solver is
>> something else..
>>
>>
>>> In summary, due to load imbalance, my speedup is quite bad. So  
>>> maybe I'll just
>>> send your results to my school's engineer and see if they could do  
>>> anything.
>>> For my part, I guess I'll just 've to wait?
>>>
>>
>> Yes - load imbalance at MatMult level is bad. On 4 proc run you have
>> ratio = 3.6 . This implies - there is one of the mpi-tasks is 3.6
>> times slower than the other task [so all speedup is lost here]
>>
>> You could try the latest mpich2 [1.0.7] - just for this SMP
>> experiment, and see if it makes a difference. I've built mpich2 with
>> [default gcc/gfortran and]:
>>
>> ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker
>>
>> There could be something else going on on this machine thats messing
>> up load-balance for basic petsc example..
>>
>> Satish
>>
>>
>>
>




More information about the petsc-users mailing list