[petsc-dev] PETSc Meeting errata

Sat Jun 15 04:53:47 CDT 2019

On 6/15/19 12:46 AM, Hapla Vaclav wrote:
>
>
>> On 14 Jun 2019, at 21:53, Jakub Kruzik <jakub.kruzik at vsb.cz 
>> <mailto:jakub.kruzik at vsb.cz>> wrote:
>>
>> The problem is that you need to write the file with an optimal stripe 
>> count/size in the first place. An unaware user who just uses 
>> something like cp will end up with the default stripe count which is 
>> usually 1.
>
> Sure. This is clear I guess. I should add that it can be a bit 
> challenging to "defeat" the linux page cache. E.g. writing a file and 
> reading it right away can result in ridiculously high read rate as it 
> is actually read from RAM :-)
As far as I know, Lustre does not use the linux page cache (on the 
server-side). Since version 2.9 it has a server-side cache, but that is 
supposed to be used for small files only. You can try to use lfs ladvise 
-a dontneed <file>, but there is no guarantee that if the file is in the 
cache, it will be cleared. See 
http://doc.lustre.org/lustre_manual.xhtml#idm140012896072288
>
> What I'm doing to cope with both issues, I always
> 1) remove data.striped.h5
> 2) set the stripe settings to the non-existing data.striped.h5, which 
> creates new data.striped.h5 with zero size
> 3) copy the file over from original data.h5 stored somewhere else to 
> that data.striped.h5
>
>>
>> For large files, you should just set the stripe count to the number 
>> of OSTs. Your results seem to support this.
>
> Sure. Would be cool to have some clear limit for "large" ;-) But in 
> these case it's definitely better to overshoot the number of stripes 
> rather than underestimate.
Agreed. I would say a large file is of a size where you actually care 
how fast you are reading :)
>
>>
>> For the small mesh and 64 nodes, you are reading just 2 MiB per 
>> process. I think that collective I/O should give you a significant 
>> improvement.
>
> OK, I'm giving it another shot now when the results with 
> non-collective look credible. I'm curious about that "significant" ;-)
>
> But even if you are right, it's kind of tricky to say when this toggle 
> should be turned on, or even decide it automatically in petsc...
Note that the default number of aggregators is usually equal to the 
number of OSTs (or stripe count?). I would try setting cb_nodes to a 
multiple of the number of OSTs close to the number of nodes used.
>>
>> Also, it would be interesting to know what performance you get from a 
>> single process reading from a single OST. I think you should be able 
>> to get 0.5-2.5 GiB/s which is what you are getting from 36 OSTs (~70 
>> MiB/s per OST).
>
> Wait, if you look at the table, it's a bit outdated (before Atlanta), 
> sorry for confusion. The new graphs on slide 18 show the rate of 
> approx. 10.5/3.5 = 3 GiB/s for the 128M mesh.
>
> Here are graphs showing load time for 3 different stripe counts and 
> several different cpu counts.
> 128M elements: https://polybox.ethz.ch/index.php/s/kBC4ZY6bWOAWCMY
> 256M elements: https://polybox.ethz.ch/index.php/s/F7SvNWuCiBUKiIz
>
> For the 256M one I got up to ~4.5 GiB/s.
>
> It's slowing down with growing number of cpus. I wonder whether it 
> could be further improved, but it's not a big deal for now.
>
For 12k processes, you are trying to read less than 2 MiB by each 
process, and each OST has more than 340 clients. In this case, you 
should read on a subset of processes and then distribute - effectively 
what should collective I/O do, if the settings are correct.

>>
>> BTW, since you also used Salomon for testing, I found some old tests 
>> I did there with pure MPI I/O, and I was able to get 18.5 GiB/s read 
>> for 1 GiB file on 108 processes / 54 nodes, 54 OSTs, 4 MiB stripe.
>
> OK, but it's probably not a good time to try to reproduce these just 
> now. The current greeting message:
>
> Planned Salomon /Scratch Maintanance From 2019-06-18 09:00 Till 
> 2019-06-21 13:00
>                             (2019-06-11 08:58:35)
>
> We plan to upgrade Lustre stack. We hope to resolve some performance 
> issues
> with SCRATCH.
>
>
> Thanks,
> Vaclav
>
>>
>> Best,
>>
>> Jakub
>>
>>
>> On 6/14/19 12:31 PM, Hapla Vaclav via petsc-dev wrote:
>>> I take back one thing I mentioned in my talk in Atlanta. I think I 
>>> said that Lustre striping does not really influence the read 
>>> performance. With my latest results in hand, I must point out this 
>>> is not true. I might have been confused by some former Piz Daint 
>>> Lustre performance issues and/or HDF5 library issues I mentioned.
>>>
>>> Here are my latest slides from PASC19.
>>> https://polybox.ethz.ch/index.php/s/PPZLSyZOKo3UXPS
>>>
>>> On slide 18, there is some comparison for different stripe settings. 
>>> I can now see a speed-up of ~4 for 1 vs 12 stripes (which is 
>>> actually the number of cores per node) for the mesh with 128M 
>>> elements. The times are very similar for 8 and 64 computation nodes.
>>>
>>> Toby, could you maybe forward this message to the meeting attendees? 
>>> I don't want to leave anybody confused.
>>>
>>> Thanks,
>>> Vaclav
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190615/7947a444/attachment-0001.html>