<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p><br>
</p>
<div class="moz-cite-prefix">On 6/15/19 12:46 AM, Hapla Vaclav
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:D8F24703-A4DC-4C94-B1CC-7D2CC7235D32@erdw.ethz.ch">
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
<div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode:
space; line-break: after-white-space;" class="">
<br class="">
<div><br class="">
<blockquote type="cite" class="">
<div class="">On 14 Jun 2019, at 21:53, Jakub Kruzik <<a
href="mailto:jakub.kruzik@vsb.cz" class=""
moz-do-not-send="true">jakub.kruzik@vsb.cz</a>>
wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<div class="">The problem is that you need to write the
file with an optimal stripe count/size in the first
place. An unaware user who just uses something like cp
will end up with the default stripe count which is
usually 1.<br class="">
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>Sure. This is clear I guess. I should add that it can be
a bit challenging to "defeat" the linux page cache. E.g.
writing a file and reading it right away can result in
ridiculously high read rate as it is actually read from RAM
:-) <br>
</div>
</div>
</div>
</blockquote>
As far as I know, Lustre does not use the linux page cache (on the
server-side). Since version 2.9 it has a server-side cache, but that
is supposed to be used for small files only. You can try to use lfs
ladvise -a dontneed <file>, but there is no guarantee that if
the file is in the cache, it will be cleared. See
<a class="moz-txt-link-freetext" href="http://doc.lustre.org/lustre_manual.xhtml#idm140012896072288">http://doc.lustre.org/lustre_manual.xhtml#idm140012896072288</a><br>
<blockquote type="cite"
cite="mid:D8F24703-A4DC-4C94-B1CC-7D2CC7235D32@erdw.ethz.ch">
<div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode:
space; line-break: after-white-space;" class="">
<div>
<div><br class="">
</div>
<div>What I'm doing to cope with both issues, I always</div>
<div>1) remove data.striped.h5</div>
<div>2) set the stripe settings to the
non-existing data.striped.h5, which creates
new data.striped.h5 with zero size</div>
<div>3) copy the file over from original data.h5 stored
somewhere else to that data.striped.h5</div>
<br class="">
<blockquote type="cite" class="">
<div class="">
<div class=""><br class="">
For large files, you should just set the stripe count to
the number of OSTs. Your results seem to support this.<br
class="">
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>Sure. Would be cool to have some clear limit for "large"
;-) But in these case it's definitely better to overshoot
the number of stripes rather than underestimate.</div>
</div>
</div>
</blockquote>
Agreed. I would say a large file is of a size where you actually
care how fast you are reading :)<br>
<blockquote type="cite"
cite="mid:D8F24703-A4DC-4C94-B1CC-7D2CC7235D32@erdw.ethz.ch">
<div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode:
space; line-break: after-white-space;" class="">
<div>
<br class="">
<blockquote type="cite" class="">
<div class="">
<div class=""><br class="">
For the small mesh and 64 nodes, you are reading just 2
MiB per process. I think that collective I/O should give
you a significant improvement.<br class="">
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>OK, I'm giving it another shot now when the results with
non-collective look credible. I'm curious about that
"significant" ;-)</div>
<div><br class="">
</div>
<div>But even if you are right, it's kind of tricky to say
when this toggle should be turned on, or even decide it
automatically in petsc...</div>
</div>
</div>
</blockquote>
<div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode:
space; line-break: after-white-space;" class="">
<div>Note that the default number of aggregators is usually equal
to the number of OSTs (or stripe count?). I would try setting
cb_nodes to a multiple of the number of OSTs close to the number
of nodes used.<br>
</div>
</div>
<blockquote type="cite"
cite="mid:D8F24703-A4DC-4C94-B1CC-7D2CC7235D32@erdw.ethz.ch">
<div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode:
space; line-break: after-white-space;" class="">
<div>
<blockquote type="cite" class="">
<div class="">
<div class=""><br class="">
Also, it would be interesting to know what performance
you get from a single process reading from a single OST.
I think you should be able to get 0.5-2.5 GiB/s which is
what you are getting from 36 OSTs (~70 MiB/s per OST).<br
class="">
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>Wait, if you look at the table, it's a bit outdated
(before Atlanta), sorry for confusion. The new graphs on
slide 18 show the rate of approx. 10.5/3.5 = 3 GiB/s for the
128M mesh.</div>
<div><br class="">
</div>
<div>Here are graphs showing load time for 3 different stripe
counts and several different cpu counts.</div>
<div>128M elements: <a
href="https://polybox.ethz.ch/index.php/s/kBC4ZY6bWOAWCMY"
class="" moz-do-not-send="true">https://polybox.ethz.ch/index.php/s/kBC4ZY6bWOAWCMY</a></div>
<div>256M elements: <a
href="https://polybox.ethz.ch/index.php/s/F7SvNWuCiBUKiIz"
class="" moz-do-not-send="true">https://polybox.ethz.ch/index.php/s/F7SvNWuCiBUKiIz</a></div>
<div><br class="">
</div>
<div>For the 256M one I got up to ~4.5 GiB/s.</div>
<div><br class="">
</div>
<div>It's slowing down with growing number of cpus. I wonder
whether it could be further improved, but it's not a big
deal for now.</div>
<br class="">
</div>
</div>
</blockquote>
<p>For 12k processes, you are trying to read less than 2 MiB by each
process, and each OST has more than 340 clients. In this case, you
should read on a subset of processes and then distribute -
effectively what should collective I/O do, if the settings are
correct.</p>
<blockquote type="cite"
cite="mid:D8F24703-A4DC-4C94-B1CC-7D2CC7235D32@erdw.ethz.ch">
<div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode:
space; line-break: after-white-space;" class="">
<div>
<blockquote type="cite" class="">
<div class="">
<div class=""><br class="">
BTW, since you also used Salomon for testing, I found
some old tests I did there with pure MPI I/O, and I was
able to get 18.5 GiB/s read for 1 GiB file on 108
processes / 54 nodes, 54 OSTs, 4 MiB stripe.<br class="">
</div>
</div>
</blockquote>
<div><br class="">
</div>
OK, but it's probably not a good time to try to reproduce
these just now. The current greeting message:</div>
<div><br class="">
</div>
<div>Planned Salomon /Scratch Maintanance From 2019-06-18 09:00
Till 2019-06-21 13:00<br class="">
(2019-06-11 08:58:35)
<br class="">
<br class="">
We plan to upgrade Lustre stack. We hope to resolve some
performance issues<br class="">
with SCRATCH.</div>
<div><br class="">
</div>
<div><br class="">
</div>
<div>Thanks,</div>
<div>Vaclav</div>
<div><br class="">
<blockquote type="cite" class="">
<div class="">
<div class=""><br class="">
Best,<br class="">
</div>
</div>
</blockquote>
<blockquote type="cite" class="">
<div class="">
<div class=""><br class="">
Jakub<br class="">
<br class="">
<br class="">
On 6/14/19 12:31 PM, Hapla Vaclav via petsc-dev wrote:<br
class="">
<blockquote type="cite" class="">I take back one thing I
mentioned in my talk in Atlanta. I think I said that
Lustre striping does not really influence the read
performance. With my latest results in hand, I must
point out this is not true. I might have been confused
by some former Piz Daint Lustre performance issues
and/or HDF5 library issues I mentioned.<br class="">
<br class="">
Here are my latest slides from PASC19.<br class="">
<a
href="https://polybox.ethz.ch/index.php/s/PPZLSyZOKo3UXPS"
class="" moz-do-not-send="true">https://polybox.ethz.ch/index.php/s/PPZLSyZOKo3UXPS</a><br
class="">
<br class="">
On slide 18, there is some comparison for different
stripe settings. I can now see a speed-up of ~4 for 1
vs 12 stripes (which is actually the number of cores
per node) for the mesh with 128M elements. The times
are very similar for 8 and 64 computation nodes.<br
class="">
<br class="">
Toby, could you maybe forward this message to the
meeting attendees? I don't want to leave anybody
confused.<br class="">
<br class="">
Thanks,<br class="">
Vaclav<br class="">
</blockquote>
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</blockquote>
</body>
</html>