[petsc-users] Investigate parallel code to improve parallelism

Tue Mar 1 10:56:17 CST 2016

  If you have access to a different cluster you can try there and see if the communication is any better. Likely you would get better speedup on an IBM BlueGene since it has a good network relative to the processing power. So best to run on IBM.

  Barry

> On Mar 1, 2016, at 2:03 AM, TAY wee-beng <zonexo at gmail.com> wrote:
> 
> 
> On 29/2/2016 11:21 AM, Barry Smith wrote:
>>> On Feb 28, 2016, at 8:26 PM, TAY wee-beng <zonexo at gmail.com> wrote:
>>> 
>>> 
>>> On 29/2/2016 9:41 AM, Barry Smith wrote:
>>>>> On Feb 28, 2016, at 7:08 PM, TAY Wee Beng <zonexo at gmail.com> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I've attached the files for x cells running y procs. hypre is called natively I'm not sure if PETSc catches it.
>>>>   So you are directly creating hypre matrices and calling the hypre solver in another piece of your code?
>>> Yes because I'm using the simple structure (struct) layout for Cartesian grids. It's about twice as fast compared to BoomerAMG
>>    Understood
>> 
>>> . I can't create PETSc matrix and use the hypre struct layout, right?
>>>>    In the PETSc part of the code if you compare the 2x_y to the x_y you see that doubling the problem size resulted in 2.2 as much time for the KSPSolve. Most of this large increase is due to the increased time in the scatter which went up to 150/54.  = 2.7777777777777777  but the amount of data transferred only increased by 1e5/6.4e4 = 1.5625  Normally I would not expect to see this behavior and would not expect such a large increase in the communication time.
>>>> 
>>>> Barry
>>>> 
>>>> 
>>>> 
>>> So ideally it should be 2 instead of 2.2, is that so?
>>   Ideally
>> 
>>> May I know where are you looking at? Because I can't find the nos.
>>   The column labeled Avg len tells the average length of messages which increases from 6.4e4 to 1e5 while the time max increase by 2.77 (I took the sum of the VecScatterBegin and VecScatter End rows.
>> 
>>> So where do you think the error comes from?
>>   It is not really an error it is just that it is taking more time then one would hope it would take.
>>> Or how can I troubleshoot further?
>> 
>>    If you run the same problem several times how much different are the numerical timings for each run?
> Hi,
> 
> I have re-done x_y and 2x_y again. I have attached the files with _2 for the 2nd run. They're exactly the same.
> 
> Should I try running on another cluster?
> 
> I also tried running the same problem with more cells and more time steps (to reduce start up effects) on another cluster. But I forgot to run it with -log_summary. Anyway, the results show:
> 
> 1. Using 1.5 million cells with 48 procs and 3M with 96p took 65min and 69min. Using the weak scaling formula I attached earlier, it gives about 88% efficiency
> 
> 2. Using 3 million cells with 48 procs and 6M with 96p took 114min and 121min. Using the weak scaling formula I attached earlier, it gives about 88% efficiency
> 
> 3. Using 3.75 million cells with 48 procs and 7.5M with 96p took 134min and 143min. Using the weak scaling formula I attached earlier, it gives about 87% efficiency
> 
> 4. Using 4.5 million cells with 48 procs and 9M with 96p took 160min and 176min (extrapolated). Using the weak scaling formula I attached earlier, it gives about 80% efficiency
> 
> So it seems that I should run with 3.75 million cells with 48 procs and scale along this ratio. Beyond that, my efficiency decreases. Is that so? Maybe I should also run with -log_summary to get better estimate...
> 
> Thanks.
>> 
>> 
>>> Thanks
>>>>> Thanks
>>>>> 
>>>>> On 29/2/2016 1:11 AM, Barry Smith wrote:
>>>>>>   As I said before, send the -log_summary output for the two processor sizes and we'll look at where it is spending its time and how it could possibly be improved.
>>>>>> 
>>>>>>   Barry
>>>>>> 
>>>>>>> On Feb 28, 2016, at 10:29 AM, TAY wee-beng <zonexo at gmail.com> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> On 27/2/2016 12:53 AM, Barry Smith wrote:
>>>>>>>>> On Feb 26, 2016, at 10:27 AM, TAY wee-beng <zonexo at gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 26/2/2016 11:32 PM, Barry Smith wrote:
>>>>>>>>>>> On Feb 26, 2016, at 9:28 AM, TAY wee-beng <zonexo at gmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> I have got a 3D code. When I ran with 48 procs and 11 million cells, it runs for 83 min. When I ran with 96 procs and 22 million cells, it ran for 99 min.
>>>>>>>>>>    This is actually pretty good!
>>>>>>>>> But if I'm not wrong, if I increase the no. of cells, the parallelism will keep on decreasing. I hope it scales up to maybe 300 - 400 procs.
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I think I may have mentioned this before, that is, I need to submit a proposal to request for computing nodes. In the proposal, I'm supposed to run some simulations to estimate the time it takes to run my code. Then an excel file will use my input to estimate the efficiency when I run my code with more cells. They use 2 mtds to estimate:
>>>>>>> 
>>>>>>> 1. strong scaling, whereby I run 2 cases - 1st with n cells and x procs, then with n cells and 2x procs. From there, they can estimate my expected efficiency when I have y procs. The formula is attached in the pdf.
>>>>>>> 
>>>>>>> 2. weak scaling, whereby I run 2 cases - 1st with n cells and x procs, then with 2n cells and 2x procs. From there, they can estimate my expected efficiency when I have y procs. The formula is attached in the pdf.
>>>>>>> 
>>>>>>> So if I use 48 and 96 procs and get maybe 80% efficiency, by the time I hit 800 procs, I get 32% efficiency for strong scaling. They expect at least 50% efficiency for my code. To reach that, I need to achieve 89% efficiency when I use 48 and 96 procs.
>>>>>>> 
>>>>>>> So now my qn is how accurate is this type of calculation, especially wrt to PETSc?
>>>>>>> 
>>>>>>> Similarly, for weak scaling, is it accurate?
>>>>>>> 
>>>>>>> Can I argue that this estimation is not suitable for PETSc or hypre?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> 
>>>>>>> 
>>>>>>>>>>> So it's not that parallel. I want to find out which part of the code I need to improve. Also if PETsc and hypre is working well in parallel. What's the best way to do it?
>>>>>>>>>>   Run both with -log_summary and send the output for each case. This will show where the time is being spent and which parts are scaling less well.
>>>>>>>>>> 
>>>>>>>>>>    Barry
>>>>>>>>> That's only for the PETSc part, right? So for other parts of the code, including hypre part, I will not be able to find out. If so, what can I use to check these parts?
>>>>>>>>    You will still be able to see what percentage of the time is spent in hypre and if it increases with the problem size and how much. So the information will still be useful.
>>>>>>>> 
>>>>>>>>   Barry
>>>>>>>> 
>>>>>>>>>>> I thought of doing profiling but if the code is optimized, I wonder if it still works well.
>>>>>>>>>>> 
>>>>>>>>>>> -- 
>>>>>>>>>>> Thank you.
>>>>>>>>>>> 
>>>>>>>>>>> Yours sincerely,
>>>>>>>>>>> 
>>>>>>>>>>> TAY wee-beng
>>>>>>>>>>> 
>>>>>>> <temp.pdf>
>>>>> -- 
>>>>> Thank you
>>>>> 
>>>>> Yours sincerely,
>>>>> 
>>>>> TAY wee-beng
>>>>> 
>>>>> <2x_2y.txt><2x_y.txt><4x_2y.txt><x_y.txt>
> 
> <x_y.txt><2x_y_2.txt><x_y_2.txt><2x_y.txt>