[petsc-dev] PETSc Meeting errata

Hapla Vaclav vaclav.hapla at erdw.ethz.ch
Fri Jun 14 17:46:35 CDT 2019

On 14 Jun 2019, at 21:53, Jakub Kruzik <jakub.kruzik at vsb.cz<mailto:jakub.kruzik at vsb.cz>> wrote:

The problem is that you need to write the file with an optimal stripe count/size in the first place. An unaware user who just uses something like cp will end up with the default stripe count which is usually 1.

Sure. This is clear I guess. I should add that it can be a bit challenging to "defeat" the linux page cache. E.g. writing a file and reading it right away can result in ridiculously high read rate as it is actually read from RAM :-)

What I'm doing to cope with both issues, I always
1) remove data.striped.h5
2) set the stripe settings to the non-existing data.striped.h5, which creates new data.striped.h5 with zero size
3) copy the file over from original data.h5 stored somewhere else to that data.striped.h5

For large files, you should just set the stripe count to the number of OSTs. Your results seem to support this.

Sure. Would be cool to have some clear limit for "large" ;-) But in these case it's definitely better to overshoot the number of stripes rather than underestimate.

For the small mesh and 64 nodes, you are reading just 2 MiB per process. I think that collective I/O should give you a significant improvement.

OK, I'm giving it another shot now when the results with non-collective look credible. I'm curious about that "significant" ;-)

But even if you are right, it's kind of tricky to say when this toggle should be turned on, or even decide it automatically in petsc...

Also, it would be interesting to know what performance you get from a single process reading from a single OST. I think you should be able to get 0.5-2.5 GiB/s which is what you are getting from 36 OSTs (~70 MiB/s per OST).

Wait, if you look at the table, it's a bit outdated (before Atlanta), sorry for confusion. The new graphs on slide 18 show the rate of approx. 10.5/3.5 = 3 GiB/s for the 128M mesh.

Here are graphs showing load time for 3 different stripe counts and several different cpu counts.
128M elements: https://polybox.ethz.ch/index.php/s/kBC4ZY6bWOAWCMY
256M elements: https://polybox.ethz.ch/index.php/s/F7SvNWuCiBUKiIz

For the 256M one I got up to ~4.5 GiB/s.

It's slowing down with growing number of cpus. I wonder whether it could be further improved, but it's not a big deal for now.

BTW, since you also used Salomon for testing, I found some old tests I did there with pure MPI I/O, and I was able to get 18.5 GiB/s read for 1 GiB file on 108 processes / 54 nodes, 54 OSTs, 4 MiB stripe.

OK, but it's probably not a good time to try to reproduce these just now. The current greeting message:

Planned Salomon /Scratch Maintanance From 2019-06-18 09:00 Till 2019-06-21 13:00
                            (2019-06-11 08:58:35)

We plan to upgrade Lustre stack. We hope to resolve some performance issues




On 6/14/19 12:31 PM, Hapla Vaclav via petsc-dev wrote:
I take back one thing I mentioned in my talk in Atlanta. I think I said that Lustre striping does not really influence the read performance. With my latest results in hand, I must point out this is not true. I might have been confused by some former Piz Daint Lustre performance issues and/or HDF5 library issues I mentioned.

Here are my latest slides from PASC19.

On slide 18, there is some comparison for different stripe settings. I can now see a speed-up of ~4 for 1 vs 12 stripes (which is actually the number of cores per node) for the mesh with 128M elements. The times are very similar for 8 and 64 computation nodes.

Toby, could you maybe forward this message to the meeting attendees? I don't want to leave anybody confused.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190614/d93bc526/attachment-0001.html>

More information about the petsc-dev mailing list