<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

<br>

<br>

Mihael Hategan wrote:

<blockquote cite="mid:1256795077.20508.16.camel@localhost" type="cite">

  <pre wrap="">On Wed, 2009-10-28 at 23:11 -0500, Ioan Raicu wrote:

  </pre>

  <blockquote type="cite">

    <pre wrap="">Mihael,

Did you figure out why I am seeing 8K and 12K active tasks, when we

only had 4K and 6K CPU cores?

    </pre>

  </blockquote>

  <pre wrap=""><!---->

Haven't tried.

  </pre>

  <blockquote type="cite">

    <pre wrap=""> Were there really 128K tasks in the workflow?

    </pre>

  </blockquote>

  <pre wrap=""><!---->

Nope. 64K.

  </pre>

</blockquote>

OK, it would be good to look at why we have double the # of tasks. It

must be my filtering of the Swift log. Here was my filtered log:<br>

<a class="moz-txt-link-freetext" href="http://www.ece.northwestern.edu/~iraicu/scratch/logs/dc-4000-active-completed.txt">http://www.ece.northwestern.edu/~iraicu/scratch/logs/dc-4000-active-completed.txt</a><br>

<br>

This filtered log was generated by:<br>

cat dc-4000.log | grep "JOB_SUBMISSION" | grep "TaskImpl" | grep

"Active" > dc-4000-active-completed.txt<br>

cat dc-4000.log | grep "JOB_SUBMISSION" | grep "TaskImpl" | grep

"Completed" >> dc-4000-active-completed.txt<br>

<blockquote cite="mid:1256795077.20508.16.camel@localhost" type="cite">

  <pre wrap="">

  </pre>

  <blockquote type="cite">

    <pre wrap=""> Just want to make sure the log conversion worked as expected.

Also, assuming there were really 128K tasks of 60 sec each, and 8K

CPUs, the ideal time to complete the run 4K would be 960 sec.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

That's one calculation that won't be bothered by doubling everything.

But no, there were 64k tasks.

  </pre>

</blockquote>

If there were 64K tasks and 4K CPUs, then the ideal time will be the

same, 960 sec.<br>

<blockquote cite="mid:1256795077.20508.16.camel@localhost" type="cite">

  <pre wrap=""></pre>

  <blockquote type="cite">

    <pre wrap=""> Run4K ran in 1183 sec, giving us an end-to-end efficiency of 81%.

For the run6K, the ideal time was 640 sec, so with an actual time of

884, we got an end-to-end efficiency of 72%.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

It depends whether you count from the time the partition boots or from

the time swift starts. We could count the queue/partition boot time, but

that doesn't tell us much about swift. On the other hand, if we don't

there's still some submission happening during that time, so that

counts.

  </pre>

</blockquote>

I count from where the log starts. There is about 20 seconds of

inactivity at the beginning of the log, but at around 20 sec in one

log, and 24 sec in the other log, 1 job is submitted and running. At

about 120 second into the run, the floodgate is opened and many jobs

are submitted and start running. So, should we count from time 0, 20,

or 120? I guess its all about what you are trying to measure and show.

In all cases, I think the workers were provisioned, its just a matter

of how much of the Swift overhead you want to take into account I think.<br>

<blockquote cite="mid:1256795077.20508.16.camel@localhost" type="cite">

  <pre wrap="">

The numbers for Falkon, were the workers started already?

  </pre>

</blockquote>

Yes, the workers were already provisioned in that case. <br>

<blockquote cite="mid:1256795077.20508.16.camel@localhost" type="cite">

  <pre wrap=""></pre>

  <blockquote type="cite">

    <pre wrap="">Not quite the 90%+ efficiencies when looking at a per task level, but

still quite good! 

    </pre>

  </blockquote>

  <pre wrap=""><!---->

I'm not quite sure what's happening. Maybe I wasn't clear. Though I was.

Is there some misunderstanding here about the different things being

measured and how?

  </pre>

</blockquote>

No. The real way to compute efficiency is to use the end-to-end time of

the real run compared to the ideal run. The other efficiency I

sometimes throw out is the per task efficiency, where you take the

average real run time of all tasks, and compare it to the ideal time of

a task. This second measure of efficiency is usually optimistic, but it

allows us to measure efficiency between various different runs that

might be too difficult to compare using the traditional efficiency

metric.<br>

<br>

Ioan<br>

<blockquote cite="mid:1256795077.20508.16.camel@localhost" type="cite">

  <pre wrap="">

  </pre>

</blockquote>

<br>

<pre class="moz-signature" cols="72">-- 

=================================================================

Ioan Raicu, Ph.D.

NSF/CRA Computing Innovation Fellow

=================================================================

Center for Ultra-scale Computing and Information Security (CUCIS)

Department of Electrical Engineering and Computer Science

Northwestern University

2145 Sheridan Rd, Tech M384 

Evanston, IL 60208-3118

=================================================================

Cel:   1-847-722-0876

Tel:   1-847-491-8163

Email: <a class="moz-txt-link-abbreviated" href="mailto:iraicu@eecs.northwestern.edu">iraicu@eecs.northwestern.edu</a>

Web:   <a class="moz-txt-link-freetext" href="http://www.eecs.northwestern.edu/~iraicu/">http://www.eecs.northwestern.edu/~iraicu/</a>

       <a class="moz-txt-link-freetext" href="https://wiki.cucis.eecs.northwestern.edu/">https://wiki.cucis.eecs.northwestern.edu/</a>

=================================================================

=================================================================

</pre>

</body>

</html>