<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
<br>
<br>
Mihael Hategan wrote:
<blockquote cite="mid:1256597730.10196.49.camel@localhost" type="cite">
<pre wrap="">On Mon, 2009-10-26 at 16:36 -0500, Ioan Raicu wrote:
</pre>
<blockquote type="cite">
<blockquote type="cite">
<pre wrap="">
</pre>
</blockquote>
<pre wrap="">Here were our experiences with running scripts from GPFS. The #s below
represents the throughput for invoking scripts (a bash script that
invoked a sleep 0) from GPFS on 4 workers, 256 workers, and 2048
workers.
Number of Processors
Invoke script throughput (ops/sec)
4
125.214
256
109.3272
2048
823.0374
</pre>
</blockquote>
<pre wrap=""><!---->
Looks right. What I saw was that things were getting shitty at around
10000 cores. Lower if info writing, directory making, and file copying
was involved.
</pre>
</blockquote>
Right.<br>
<blockquote cite="mid:1256597730.10196.49.camel@localhost" type="cite">
<pre wrap="">
</pre>
<blockquote type="cite">
<blockquote type="cite">
<pre wrap="">[...]
</pre>
</blockquote>
<pre wrap="">In our experience with Falkon, the limit came much sooner than 64K. In
Falkon, using the C worker code (which runs on the BG/P), each worker
consumes 2 TCP/IP connections to the Falkon service.
</pre>
</blockquote>
<pre wrap=""><!---->
Well, the coaster workers use only one connection.
</pre>
</blockquote>
Its 1 connection per core? or per node? Zhao tried to reduce to 1
connection per node, but the worker was not stable, so we left it alone
in the interest of time. The last time I looked at it, the workers used
2 connections per core, or 8 connections per node. Quite inefficient at
scale, but not an issue given that each service only handles 256 cores.
<br>
<blockquote cite="mid:1256597730.10196.49.camel@localhost" type="cite">
<pre wrap="">
</pre>
<blockquote type="cite">
<pre wrap=""> In the centralized Falkon service version, this racks up connections
pretty quick. I don't recall at exactly what point we started having
issues, but it was somewhere in the range of 10K~20K CPU cores.
Essentially, we could establish all the connections (20K~40K TCP
connections), but when the experiment would actually start, and data
needed to flow over these connections, all sort of weird stuff started
happening, TCP connection would get reset, workers were failing (e.g.
their TCP connection was being severed and not being re-established),
etc. I want to say that 8K (maybe 16K) cores was the largest tests we
made on the BG/P with a centralized Falkon service, that were stable
and successful.
</pre>
</blockquote>
<pre wrap=""><!---->
Possible. I haven't properly tested above 12k workers. I was just
mentioning a theoretical limitation that doesn't seem possible to beat
without having things distributed.
[...]
</pre>
<blockquote type="cite">
<pre wrap="">For the BG/P specifically, I think the distribution of the Falkon
service to the I/O nodes gave us a low maintanance, robust, and
scalable solution!
</pre>
</blockquote>
<pre wrap=""><!---->
Lower than if you only had to run one service on the head node?
</pre>
</blockquote>
Yes, in fact it was for Falkon. If we ran Falkon on the head node, the
user would have to start it manually, on an available port, and then
shut it down when finished. Running things on the I/O nodes was tougher
at the beginning, but once we got it all configured and running, it was
great! The Falkon service starts up on I/O node boot time, on a
specific port (no need to check if its available as the I/O node is
dedicated to the user), all compute nodes can easily find their
respective I/O nodes at the same location (some 192.xxx private
address), and when the run is over, the I/O nodes terminate and the
services stop all on their own. At least for Falkon, it really made the
difference between having a turn-key solution that always works, and
one that would require constant tinkering (starting and stopping) and
configuration (e.g. ports). <br>
<br>
Again, the downside to the distributed one, was the overhead of
implementing and testing it, and also the load-balancing that required
a bit of fine tunning in Swift to get just right.<br>
<br>
Ioan<br>
<blockquote cite="mid:1256597730.10196.49.camel@localhost" type="cite">
<pre wrap="">
</pre>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
=================================================================
Ioan Raicu, Ph.D.
NSF/CRA Computing Innovation Fellow
=================================================================
Center for Ultra-scale Computing and Information Security (CUCIS)
Department of Electrical Engineering and Computer Science
Northwestern University
2145 Sheridan Rd, Tech M384
Evanston, IL 60208-3118
=================================================================
Cel: 1-847-722-0876
Tel: 1-847-491-8163
Email: <a class="moz-txt-link-abbreviated" href="mailto:iraicu@eecs.northwestern.edu">iraicu@eecs.northwestern.edu</a>
Web: <a class="moz-txt-link-freetext" href="http://www.eecs.northwestern.edu/~iraicu/">http://www.eecs.northwestern.edu/~iraicu/</a>
<a class="moz-txt-link-freetext" href="https://wiki.cucis.eecs.northwestern.edu/">https://wiki.cucis.eecs.northwestern.edu/</a>
=================================================================
=================================================================
</pre>
</body>
</html>