[petsc-dev] PETSc Meeting errata
Jakub Kruzik
jakub.kruzik at vsb.cz
Sat Jun 15 04:53:47 CDT 2019
On 6/15/19 12:46 AM, Hapla Vaclav wrote:
>
>
>> On 14 Jun 2019, at 21:53, Jakub Kruzik <jakub.kruzik at vsb.cz
>> <mailto:jakub.kruzik at vsb.cz>> wrote:
>>
>> The problem is that you need to write the file with an optimal stripe
>> count/size in the first place. An unaware user who just uses
>> something like cp will end up with the default stripe count which is
>> usually 1.
>
> Sure. This is clear I guess. I should add that it can be a bit
> challenging to "defeat" the linux page cache. E.g. writing a file and
> reading it right away can result in ridiculously high read rate as it
> is actually read from RAM :-)
As far as I know, Lustre does not use the linux page cache (on the
server-side). Since version 2.9 it has a server-side cache, but that is
supposed to be used for small files only. You can try to use lfs ladvise
-a dontneed <file>, but there is no guarantee that if the file is in the
cache, it will be cleared. See
http://doc.lustre.org/lustre_manual.xhtml#idm140012896072288
>
> What I'm doing to cope with both issues, I always
> 1) remove data.striped.h5
> 2) set the stripe settings to the non-existing data.striped.h5, which
> creates new data.striped.h5 with zero size
> 3) copy the file over from original data.h5 stored somewhere else to
> that data.striped.h5
>
>>
>> For large files, you should just set the stripe count to the number
>> of OSTs. Your results seem to support this.
>
> Sure. Would be cool to have some clear limit for "large" ;-) But in
> these case it's definitely better to overshoot the number of stripes
> rather than underestimate.
Agreed. I would say a large file is of a size where you actually care
how fast you are reading :)
>
>>
>> For the small mesh and 64 nodes, you are reading just 2 MiB per
>> process. I think that collective I/O should give you a significant
>> improvement.
>
> OK, I'm giving it another shot now when the results with
> non-collective look credible. I'm curious about that "significant" ;-)
>
> But even if you are right, it's kind of tricky to say when this toggle
> should be turned on, or even decide it automatically in petsc...
Note that the default number of aggregators is usually equal to the
number of OSTs (or stripe count?). I would try setting cb_nodes to a
multiple of the number of OSTs close to the number of nodes used.
>>
>> Also, it would be interesting to know what performance you get from a
>> single process reading from a single OST. I think you should be able
>> to get 0.5-2.5 GiB/s which is what you are getting from 36 OSTs (~70
>> MiB/s per OST).
>
> Wait, if you look at the table, it's a bit outdated (before Atlanta),
> sorry for confusion. The new graphs on slide 18 show the rate of
> approx. 10.5/3.5 = 3 GiB/s for the 128M mesh.
>
> Here are graphs showing load time for 3 different stripe counts and
> several different cpu counts.
> 128M elements: https://polybox.ethz.ch/index.php/s/kBC4ZY6bWOAWCMY
> 256M elements: https://polybox.ethz.ch/index.php/s/F7SvNWuCiBUKiIz
>
> For the 256M one I got up to ~4.5 GiB/s.
>
> It's slowing down with growing number of cpus. I wonder whether it
> could be further improved, but it's not a big deal for now.
>
For 12k processes, you are trying to read less than 2 MiB by each
process, and each OST has more than 340 clients. In this case, you
should read on a subset of processes and then distribute - effectively
what should collective I/O do, if the settings are correct.
>>
>> BTW, since you also used Salomon for testing, I found some old tests
>> I did there with pure MPI I/O, and I was able to get 18.5 GiB/s read
>> for 1 GiB file on 108 processes / 54 nodes, 54 OSTs, 4 MiB stripe.
>
> OK, but it's probably not a good time to try to reproduce these just
> now. The current greeting message:
>
> Planned Salomon /Scratch Maintanance From 2019-06-18 09:00 Till
> 2019-06-21 13:00
> (2019-06-11 08:58:35)
>
> We plan to upgrade Lustre stack. We hope to resolve some performance
> issues
> with SCRATCH.
>
>
> Thanks,
> Vaclav
>
>>
>> Best,
>>
>> Jakub
>>
>>
>> On 6/14/19 12:31 PM, Hapla Vaclav via petsc-dev wrote:
>>> I take back one thing I mentioned in my talk in Atlanta. I think I
>>> said that Lustre striping does not really influence the read
>>> performance. With my latest results in hand, I must point out this
>>> is not true. I might have been confused by some former Piz Daint
>>> Lustre performance issues and/or HDF5 library issues I mentioned.
>>>
>>> Here are my latest slides from PASC19.
>>> https://polybox.ethz.ch/index.php/s/PPZLSyZOKo3UXPS
>>>
>>> On slide 18, there is some comparison for different stripe settings.
>>> I can now see a speed-up of ~4 for 1 vs 12 stripes (which is
>>> actually the number of cores per node) for the mesh with 128M
>>> elements. The times are very similar for 8 and 64 computation nodes.
>>>
>>> Toby, could you maybe forward this message to the meeting attendees?
>>> I don't want to leave anybody confused.
>>>
>>> Thanks,
>>> Vaclav
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190615/7947a444/attachment-0001.html>
More information about the petsc-dev
mailing list