[Swift-devel] Swift and BGP plots

Ioan Raicu iraicu at cs.uchicago.edu
Mon Oct 26 16:36:00 CDT 2009


Hi Mihael,
This is interesting stuff!

Here is what I understood from the following figures (given the summary 
of the execute2 tab):

Mihael Hategan wrote:
> 16k jobs, scratch on GPFS:
> http://www.mcs.anl.gov/~hategan/report-bgp-plain/
>   
Shortest event (s): 17.1009998321533
Longest event (s): 521.828999996185
Mean event duration (s): 358.316546020797
Standard deviation of event duration (s): 109.087010337575
> 16k jobs, scratch on compute node:
> http://www.mcs.anl.gov/~hategan/report-bgp-scratch/
>   
Shortest event (s): 11.606999874115
Longest event (s): 376.588999986649
Mean event duration (s): 176.960421203068
Standard deviation of event duration (s): 75.0991380521202
> 16k jobs, scratch on compute node, status through provider:
> http://www.mcs.anl.gov/~hategan/report-bgp-scratch-provider/
>   
Shortest event (s): 11.8900001049042
Longest event (s): 223.809000015259
Mean event duration (s): 135.653097596136
Standard deviation of event duration (s): 62.1117594571245

For all the above stats, I don't understand why the minimum time is 
10~20 seconds, when the jobs are sleep 60? Or perhaps you were doing 
sleep 0 here?
> 64k jobs, 4000 workers:
> http://www.mcs.anl.gov/~hategan/report-dc-4000/
>   
Shortest event (s): 106.119999885559
Longest event (s): 1246.60699987411
Mean event duration (s): 334.987874176266
Standard deviation of event duration (s): 290.212811366649

Efficiency: 18%?
> 64k jobs, 6000 workers:
> http://www.mcs.anl.gov/~hategan/report-dc-6000/
>   
Shortest event (s): 108.671000003815
Longest event (s): 940.963999986649
Mean event duration (s): 255.069875579873
Standard deviation of event duration (s): 130.231747714145

Efficiency: 23.5%?

I assume this is all with "sleep 60" jobs, right?

Here are some comparisons of raw Falkon:
20K jobs, 2K workers, sleep 32, single Falkon service running on login6
Mean event duration (s): 32.6798
Efficiency: 98.3%

40K jobs, 4K workers, sleep 32, distributed Falkon service running on 16 
I/O nodes
Mean event duration (s): 34.659
Efficiency: 92.3%

40K jobs, 4K workers, sleep 64, distributed Falkon service running on 16 
I/O nodes
Mean event duration (s): 67.02667
Efficiency: 95.5%

1M jobs (983,040), 160K workers, sleep 64, distributed Falkon service 
running on 640 I/O nodes
Mean event duration (s): 70.71823
Efficiency: 90.5%

I did a search through my inbox for an old email from Zhao Zhang. Here 
is the summary of a run he made:
Total number of events: 512
Shortest event (s): 30
Longest event (s): 33
Mean event duration (s): 30.986328125
Standard deviation of event duration (s): 0.784249612581342

Efficiency: 96.8%

I believe in this run, he had 512 jobs of sleep 30, running on 256 
workers via Falkon, using Swift. This small scale run, got a 96.8% 
efficiency, which seemed great! He used to have some logs online at 
http://www.ci.uchicago.edu/~zzhang/report-sleep-20081016-1808-96bsfgec/, 
but they are not there anymore. Perhaps Zhao still has these plots. I 
believe he might have been using some of his CIO (collective I/O) 
optimizations in these runs. I can't seem to find larger scale runs with 
these optimizations.

These numbers that I posted above, are bits and pieces I found through 
various experiments I found we ran over the last year, so their direct 
comparisons is not apples to apples, as things have evolved, Swift, 
Falkon, including the file system GPFS on the BG/P.

I certainly think it would be useful to have a detailed comparison of 
Swift+Coaster and Swift+Falkon using the latest Swift.

Cheers,
Ioan

>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>   

-- 
=================================================================
Ioan Raicu, Ph.D.
NSF/CRA Computing Innovation Fellow
=================================================================
Center for Ultra-scale Computing and Information Security (CUCIS)
Department of Electrical Engineering and Computer Science
Northwestern University
2145 Sheridan Rd, Tech M384 
Evanston, IL 60208-3118
=================================================================
Cel:   1-847-722-0876
Tel:   1-847-491-8163
Email: iraicu at eecs.northwestern.edu
Web:   http://www.eecs.northwestern.edu/~iraicu/
       https://wiki.cucis.eecs.northwestern.edu/
=================================================================
=================================================================





More information about the Swift-devel mailing list