From benc at hawaga.org.uk  Wed Sep 10 08:33:14 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 10 Sep 2008 13:33:14 +0000 (GMT)
Subject: [Swift-devel] swift+glite
In-Reply-To: <48C516EE.8000406@mcs.anl.gov>
References: <Pine.LNX.4.64.0809081132030.4811@dildano.hawaga.org.uk>
	<48C516EE.8000406@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0809101327260.23403@dildano.hawaga.org.uk>


added swift-devel because they might be interested.

On Mon, 8 Sep 2008, Michael Wilde wrote:

> The Swift-glite integration is of interest for Uri's work as well. Do 
> you know how much work that will be to do, and support? Or how much it 
> would take ot find out?

I interacted with Emidio Giorgio at INFN about this. It sounds like mostly 
the existing GT2 code will work; glite sites apparently have no cluster 
shared fs so some fiddling will be necessary there (that I'm working on at 
the moment) but this is generally in a direction that I think is useful 
anyway. voms proxies are also apparently necessary, but the pacman 
packaging that I did the other month (referred to by swift bug 146) can 
provide that too.

So I'm going to play with this for a few days and see what happens.

-- 


From wilde at mcs.anl.gov  Wed Sep 10 09:53:33 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 10 Sep 2008 09:53:33 -0500
Subject: [Swift-devel] Re: swift+glite
In-Reply-To: <Pine.LNX.4.64.0809101327260.23403@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0809081132030.4811@dildano.hawaga.org.uk>
	<48C516EE.8000406@mcs.anl.gov>
	<Pine.LNX.4.64.0809101327260.23403@dildano.hawaga.org.uk>
Message-ID: <48C7DF6D.6000301@mcs.anl.gov>

This sounds good, Ben.

Please post info on what is needed in IO changes. I think this fits in 
with generalizing and improving the performance of the Swift IO strategy 
for other environments as well.

What is the data transfer strategy when there is no cluster file system? 
Does the compute node need to pull data down as a gridftp client?

On 9/10/08 8:33 AM, Ben Clifford wrote:
> added swift-devel because they might be interested.
> 
> On Mon, 8 Sep 2008, Michael Wilde wrote:
> 
>> The Swift-glite integration is of interest for Uri's work as well. Do 
>> you know how much work that will be to do, and support? Or how much it 
>> would take ot find out?
> 
> I interacted with Emidio Giorgio at INFN about this. It sounds like mostly 
> the existing GT2 code will work; glite sites apparently have no cluster 
> shared fs so some fiddling will be necessary there (that I'm working on at 
> the moment) but this is generally in a direction that I think is useful 
> anyway. voms proxies are also apparently necessary, but the pacman 
> packaging that I did the other month (referred to by swift bug 146) can 
> provide that too.
> 
> So I'm going to play with this for a few days and see what happens.

Great!

- Mike


From benc at hawaga.org.uk  Wed Sep 10 10:20:10 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 10 Sep 2008 15:20:10 +0000 (GMT)
Subject: [Swift-devel] Re: swift+glite
In-Reply-To: <48C7DF6D.6000301@mcs.anl.gov>
References: <Pine.LNX.4.64.0809081132030.4811@dildano.hawaga.org.uk>
	<48C516EE.8000406@mcs.anl.gov>
	<Pine.LNX.4.64.0809101327260.23403@dildano.hawaga.org.uk>
	<48C7DF6D.6000301@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0809101518000.23403@dildano.hawaga.org.uk>


On Wed, 10 Sep 2008, Michael Wilde wrote:

> Please post info on what is needed in IO changes. I think this fits in with
> generalizing and improving the performance of the Swift IO strategy for other
> environments as well.

> What is the data transfer strategy when there is no cluster file system? Does
> the compute node need to pull data down as a gridftp client?

That's what I'm trying to do at the moment, using a site-local ftp server 
(that apparently doesn't share fs with the compute nodes).

That seems relatively straightforward.

There's a separate problem with no-shared-fs in that the wrapper script is 
also placed there prior to execution on the worker nodes, so a different 
mechanism for bootstrapping that on a worker node is necessary.

At the moment, I'm concentrating specifically on glite-specific 
mechanisms, not making a higher abstraction.

-- 


From smartin at mcs.anl.gov  Wed Sep 10 10:37:40 2008
From: smartin at mcs.anl.gov (Stuart Martin)
Date: Wed, 10 Sep 2008 10:37:40 -0500
Subject: [Swift-devel] Re: swift+glite
In-Reply-To: <Pine.LNX.4.64.0809101518000.23403@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0809081132030.4811@dildano.hawaga.org.uk>
	<48C516EE.8000406@mcs.anl.gov>
	<Pine.LNX.4.64.0809101327260.23403@dildano.hawaga.org.uk>
	<48C7DF6D.6000301@mcs.anl.gov>
	<Pine.LNX.4.64.0809101518000.23403@dildano.hawaga.org.uk>
Message-ID: <82E68256-4B1A-47C6-8918-B362EDCD3688@mcs.anl.gov>

Are you going through gram for this?  How will you get the delegated  
user proxy on the worker node without a shared FS?  You'll need to get  
that in order to do gridftp client commands on the worker node.

On Sep 10, 2008, at Sep 10, 10:20 AM, Ben Clifford wrote:

>
> On Wed, 10 Sep 2008, Michael Wilde wrote:
>
>> Please post info on what is needed in IO changes. I think this fits  
>> in with
>> generalizing and improving the performance of the Swift IO strategy  
>> for other
>> environments as well.
>
>> What is the data transfer strategy when there is no cluster file  
>> system? Does
>> the compute node need to pull data down as a gridftp client?
>
> That's what I'm trying to do at the moment, using a site-local ftp  
> server
> (that apparently doesn't share fs with the compute nodes).
>
> That seems relatively straightforward.
>
> There's a separate problem with no-shared-fs in that the wrapper  
> script is
> also placed there prior to execution on the worker nodes, so a  
> different
> mechanism for bootstrapping that on a worker node is necessary.
>
> At the moment, I'm concentrating specifically on glite-specific
> mechanisms, not making a higher abstraction.
>
> -- 
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From benc at hawaga.org.uk  Wed Sep 10 10:59:31 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 10 Sep 2008 15:59:31 +0000 (GMT)
Subject: [Swift-devel] Re: swift+glite
In-Reply-To: <82E68256-4B1A-47C6-8918-B362EDCD3688@mcs.anl.gov>
References: <Pine.LNX.4.64.0809081132030.4811@dildano.hawaga.org.uk>
	<48C516EE.8000406@mcs.anl.gov>
	<Pine.LNX.4.64.0809101327260.23403@dildano.hawaga.org.uk>
	<48C7DF6D.6000301@mcs.anl.gov>
	<Pine.LNX.4.64.0809101518000.23403@dildano.hawaga.org.uk>
	<82E68256-4B1A-47C6-8918-B362EDCD3688@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0809101558390.23403@dildano.hawaga.org.uk>


On Wed, 10 Sep 2008, Stuart Martin wrote:

> Are you going through gram for this?  How will you get the delegated user
> proxy on the worker node without a shared FS?  You'll need to get that in
> order to do gridftp client commands on the worker node.

Its going through glite's GRAM2 fork. They seem to think its ok for me to 
run commands from worker nodes. Maybe they do have a secret shared fs 
after all; maybe they have some other mechanism for that.

-- 


From smartin at mcs.anl.gov  Wed Sep 10 11:09:28 2008
From: smartin at mcs.anl.gov (Stuart Martin)
Date: Wed, 10 Sep 2008 11:09:28 -0500
Subject: [Swift-devel] Re: swift+glite
In-Reply-To: <Pine.LNX.4.64.0809101558390.23403@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0809081132030.4811@dildano.hawaga.org.uk>
	<48C516EE.8000406@mcs.anl.gov>
	<Pine.LNX.4.64.0809101327260.23403@dildano.hawaga.org.uk>
	<48C7DF6D.6000301@mcs.anl.gov>
	<Pine.LNX.4.64.0809101518000.23403@dildano.hawaga.org.uk>
	<82E68256-4B1A-47C6-8918-B362EDCD3688@mcs.anl.gov>
	<Pine.LNX.4.64.0809101558390.23403@dildano.hawaga.org.uk>
Message-ID: <13A63339-3974-4819-804D-B85666E0D239@mcs.anl.gov>

Some LRMs have file transfer directives in the job submission script.   
Condor for one.  Maybe they add the DUP file and rely on this.

On Sep 10, 2008, at Sep 10, 10:59 AM, Ben Clifford wrote:

>
> On Wed, 10 Sep 2008, Stuart Martin wrote:
>
>> Are you going through gram for this?  How will you get the  
>> delegated user
>> proxy on the worker node without a shared FS?  You'll need to get  
>> that in
>> order to do gridftp client commands on the worker node.
>
> Its going through glite's GRAM2 fork. They seem to think its ok for  
> me to
> run commands from worker nodes. Maybe they do have a secret shared fs
> after all; maybe they have some other mechanism for that.
>
> -- 
>


From hategan at mcs.anl.gov  Wed Sep 10 11:14:11 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 10 Sep 2008 11:14:11 -0500
Subject: [Swift-devel] Re: swift+glite
In-Reply-To: <82E68256-4B1A-47C6-8918-B362EDCD3688@mcs.anl.gov>
References: <Pine.LNX.4.64.0809081132030.4811@dildano.hawaga.org.uk>
	<48C516EE.8000406@mcs.anl.gov>
	<Pine.LNX.4.64.0809101327260.23403@dildano.hawaga.org.uk>
	<48C7DF6D.6000301@mcs.anl.gov>
	<Pine.LNX.4.64.0809101518000.23403@dildano.hawaga.org.uk>
	<82E68256-4B1A-47C6-8918-B362EDCD3688@mcs.anl.gov>
Message-ID: <1221063251.16815.1.camel@localhost>

On Wed, 2008-09-10 at 10:37 -0500, Stuart Martin wrote:
> Are you going through gram for this?  How will you get the delegated  
> user proxy on the worker node without a shared FS?

None of the problems I've seen expressed with Swift required the apps
running on worker nodes to have a delegated proxy.


From benc at hawaga.org.uk  Wed Sep 10 11:16:07 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 10 Sep 2008 16:16:07 +0000 (GMT)
Subject: [Swift-devel] Re: swift+glite
In-Reply-To: <1221063251.16815.1.camel@localhost>
References: <Pine.LNX.4.64.0809081132030.4811@dildano.hawaga.org.uk> 
	<48C516EE.8000406@mcs.anl.gov>
	<Pine.LNX.4.64.0809101327260.23403@dildano.hawaga.org.uk>
	<48C7DF6D.6000301@mcs.anl.gov>
	<Pine.LNX.4.64.0809101518000.23403@dildano.hawaga.org.uk>
	<82E68256-4B1A-47C6-8918-B362EDCD3688@mcs.anl.gov>
	<1221063251.16815.1.camel@localhost>
Message-ID: <Pine.LNX.4.64.0809101615320.23403@dildano.hawaga.org.uk>


On Wed, 10 Sep 2008, Mihael Hategan wrote:

> None of the problems I've seen expressed with Swift required the apps
> running on worker nodes to have a delegated proxy.

glites intra-site data transfer mechanisms do, though, it seems.

-- 


From bugzilla-daemon at mcs.anl.gov  Thu Sep 18 12:13:29 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Thu, 18 Sep 2008 12:13:29 -0500 (CDT)
Subject: [Swift-devel] [Bug 155] New: ability to specify username on remote
	site
Message-ID: <bug-155-21@http.bugzilla.mcs.anl.gov/swift/>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=155

           Summary: ability to specify username on remote site
           Product: Swift
           Version: unspecified
          Platform: All
        OS/Version: Mac OS
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: General
        AssignedTo: hategan at mcs.anl.gov
        ReportedBy: benc at hawaga.org.uk
                CC: benc at hawaga.org.uk


gram allows username on remote site to be specified through rsl entries.
gridftp allows that too.

In swift, gram can be set so using profile entries but gridftp cannot.

There should probably be a higher abstraction so that a profile key sets this
for gram2, gram4 and gridftp using a single profile entry.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
You reported the bug, or are watching the reporter.


From benc at hawaga.org.uk  Thu Sep 25 08:10:04 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 25 Sep 2008 13:10:04 +0000 (GMT)
Subject: [Swift-devel] lots of very small files vs gridftp
Message-ID: <Pine.LNX.4.64.0809251259200.7706@dildano.hawaga.org.uk>


Here are some notes about lots of very small files vs gridftp:

The cnari workflow that skenny works on has lots of very small files, 
where very in this context means smaller than the GridFTP Lots Of Small 
Files work likes to handle.

In the CNARI runs that we've been making recently, there are 65535 input 
files, each roughly of a kilobyte, and a corresponding number of output 
files each of around 10 bytes.

The limiting factor in these runs at the moment is staging of files - from 
UC to Ranger, throughput appears to be around 7 files per second, which is 
quite poor.

Buzz and I did some informal measuring of very small file transfer using 
gt4.2 globus-url-copy between communicado and the UC/ANL TeraGrid site to 
get a feel for what to expect.

To transfer 1000 files:

   # concurrent conncetions  |   duration of copy (seconds, multiple runs)
                       16          7, 16, 16
                        4         14, 14, 14
                        2         26, 25
                        1         48, 52

Assuming (perhaps incorrectly) that 65k files would take 65 x that, then 
transferring 65k files would take 455 ( = 7 * 65) seconds using the best 
result above.

To transfer a 65mb single file between the two sites takes 9s.

So from a raw transfer perspective, transferring as a single GridFTP 
transfer rather than as separate files is very good.

However, there is some (possibly large) file system overhead at both ends, 
as 65000 file opens can take some time. Tarring up 65000 files of 1k each 
took around 60 seconds when Buzz tried it.

I also haven't investigated the ranger filesystem performance. I'm hoping 
to get some wrapper logs from a run today to see what is happening there. 
The remote filesystem on Ranger is Lustre which I have minimal experience 
with; however the input files for the CNARI runs are laid out in a way 
that would almost definitely cause trouble if the shared space was GPFS 
(in that they are all in a single directory). Results of investigating 
this should be available in a day or so.

-- 


From benc at hawaga.org.uk  Thu Sep 25 09:11:02 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 25 Sep 2008 14:11:02 +0000 (GMT)
Subject: [Swift-devel] examining the plots of a 65535 job CNARI run
Message-ID: <Pine.LNX.4.64.0809251343020.7706@dildano.hawaga.org.uk>


On Wednesday, skenny ran a 65535 run which mostly finished.

The plots are here:
http://www.ci.uchicago.edu/~benc/tmp/report-modelproc-20080924-1226-pkzripi7/

The rest of this email is rambling commentary on some of the things I see 
there.

The run mostly finishes, with some number (985 according to the totals of 
unfinished procedure calls, 8 according to the execute2 chart, and 11 
according to the karajan statuses) of activities outstanding.

Looking at this chart, which is karajan job submission tasks, 
http://www.ci.uchicago.edu/~benc/tmp/report-modelproc-20080924-1226-pkzripi7/karatasks.JOB_SUBMISSION.sorted-start.png

there are strange things with karajan job duration. The majority of tasks 
run very quickly (a few pixels width, which is a few seconds). That's 
expected.

A large number though take what looks to be about 2000 seconds to end (and 
seemingly all are about the same duration, which maybe means its a timeout 
on the task itself);

and a few (about 9?) never finish (those are the lines that extend all the 
way from their respective start times all the way to the far right of the 
graph)

The tasks that take about 2000 seconds look like they're going into Queued 
state - looking at the plot of karajan job submission tasks in queued 
state, they appear there too:

http://www.ci.uchicago.edu/~benc/tmp/report-modelproc-20080924-1226-pkzripi7/karatasks.JOB_SUBMISSION.Queue.sorted-start.png


There are a couple of interesting things here that I haven't seen before:

1. stagein/stageout oscillation

Coasters are providing a plenty of cores for running tasks, with very low 
scheduling latency.

In this run, the execution rate is limited by the rate at which files can 
be staged in.

There is a fixed load for file staging, which is shared between stageins 
and stageouts.

Once a file has been staged in, the corresponding task will be executed 
almost instantly, and two seconds later a stageout task will go on the 
queue.

This seems to be causing a pretty-looking oscillation in the stageout and 
stagein graphs. Maybe that's a bad thing, maybe it doesn't matter.

2. Execution peaks at coaster restart time.

When no coaster workers are running, stageins still happen. So when 
coaster workers start up when there have been none running, there are 
plenty of tasks to run. The coaster workers die every 1h45m (6300 seconds) 
(due to wall time specification) and are restarted, which then is subject 
to gram+sge scheduling delay.

So every 6300s in the run there is a section of the active tasks graph 
where the nuber of active tasks drops to 0 for a bit and then shoots high 
up to 400 tasks active at once for a very short period of time.

In the present run, I don't think this is causing any actual delay in the 
total runtime of the workflow because coasters are not causing any rate 
limit. In other runs with other applications, maybe that will have some 
effect that might be significant.

Coasters are able to run 400 tasks at once because of what I regard as a 
bug in the way that multiple cores are supported in coasters - far too 
many (16 x too many in this case) cores are allocated which means where 
there is a sudden peak in job submissions there are lots of cores
available. This shouldn't happen.

However even if that was fixed so that it only allocated the right number 
of cores, rather than the wrong number of nodes, I think if there is a 
sudden peak in jobs as happens when the coaster workers all die around the 
same time due to walltime, then the worker manager will still end up 
trying to allocate enough workers to cover that peak, even though the peak 
is very unusual. So this will result in basically wasted coaster worker 
runs.

-- 


From hategan at mcs.anl.gov  Thu Sep 25 09:38:14 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 25 Sep 2008 09:38:14 -0500
Subject: [Swift-devel] Re: examining the plots of a 65535 job CNARI run
In-Reply-To: <Pine.LNX.4.64.0809251343020.7706@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0809251343020.7706@dildano.hawaga.org.uk>
Message-ID: <1222353494.7226.0.camel@localhost>

On Thu, 2008-09-25 at 14:11 +0000, Ben Clifford wrote:

> A large number though take what looks to be about 2000 seconds to end (and 
> seemingly all are about the same duration, which maybe means its a timeout 
> on the task itself);

That's probably when no worker is allocated, so it includes the queue
time.


From benc at hawaga.org.uk  Thu Sep 25 09:41:56 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 25 Sep 2008 14:41:56 +0000 (GMT)
Subject: [Swift-devel] Re: examining the plots of a 65535 job CNARI run
In-Reply-To: <1222353494.7226.0.camel@localhost>
References: <Pine.LNX.4.64.0809251343020.7706@dildano.hawaga.org.uk>
	<1222353494.7226.0.camel@localhost>
Message-ID: <Pine.LNX.4.64.0809251439020.7706@dildano.hawaga.org.uk>


On Thu, 25 Sep 2008, Mihael Hategan wrote:

> > A large number though take what looks to be about 2000 seconds to end (and 
> > seemingly all are about the same duration, which maybe means its a timeout 
> > on the task itself);
> 
> That's probably when no worker is allocated, so it includes the queue
> time.

My understanding of the allocation pattern for workers is that there 
should be plenty of spare workers most/all of the time.

These tasks aren't appearing at the every-6300s restart-the-workers points 
- they seem scattered fairly evenly throughout the execution.

The time looks like it is mostly queue time.

Look at this plot:

http://www.ci.uchicago.edu/~benc/tmp/report-modelproc-20080924-1226-pkzripi7    
/karatasks.JOB_SUBMISSION.Queue.sorted-start.png 

Each task is represented by a red line with the left end being when it 
goes into Submitted state and the right end being when it leaves submitted 
state (eg to fail or become active).

-- 


From hategan at mcs.anl.gov  Thu Sep 25 09:58:50 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 25 Sep 2008 09:58:50 -0500
Subject: [Swift-devel] Re: examining the plots of a 65535 job CNARI run
In-Reply-To: <Pine.LNX.4.64.0809251439020.7706@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0809251343020.7706@dildano.hawaga.org.uk>
	<1222353494.7226.0.camel@localhost>
	<Pine.LNX.4.64.0809251439020.7706@dildano.hawaga.org.uk>
Message-ID: <1222354730.7737.1.camel@localhost>

On Thu, 2008-09-25 at 14:41 +0000, Ben Clifford wrote:
> 
> On Thu, 25 Sep 2008, Mihael Hategan wrote:
> 
> > > A large number though take what looks to be about 2000 seconds to end (and 
> > > seemingly all are about the same duration, which maybe means its a timeout 
> > > on the task itself);
> > 
> > That's probably when no worker is allocated, so it includes the queue
> > time.
> 
> My understanding of the allocation pattern for workers is that there 
> should be plenty of spare workers most/all of the time.
> 
> These tasks aren't appearing at the every-6300s restart-the-workers points 
> - they seem scattered fairly evenly throughout the execution.

When a task needs a new worker it becomes bound to the request for the
new worker. It does not go to another worker if it becomes available
while its worker is queued.


From benc at hawaga.org.uk  Thu Sep 25 10:06:59 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 25 Sep 2008 15:06:59 +0000 (GMT)
Subject: [Swift-devel] Re: examining the plots of a 65535 job CNARI run
In-Reply-To: <1222354730.7737.1.camel@localhost>
References: <Pine.LNX.4.64.0809251343020.7706@dildano.hawaga.org.uk> 
	<1222353494.7226.0.camel@localhost>
	<Pine.LNX.4.64.0809251439020.7706@dildano.hawaga.org.uk>
	<1222354730.7737.1.camel@localhost>
Message-ID: <Pine.LNX.4.64.0809251504430.7706@dildano.hawaga.org.uk>


On Thu, 25 Sep 2008, Mihael Hategan wrote:

> When a task needs a new worker it becomes bound to the request for the
> new worker. It does not go to another worker if it becomes available
> while its worker is queued.

Could be that, but ~2000s seems a long time for that - the every 6300s 
trough/peak periods where the coaster workers get restarted are only a 
couple hundred seconds long.

-- 


From hategan at mcs.anl.gov  Thu Sep 25 10:16:27 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 25 Sep 2008 10:16:27 -0500
Subject: [Swift-devel] Re: examining the plots of a 65535 job CNARI run
In-Reply-To: <Pine.LNX.4.64.0809251504430.7706@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0809251343020.7706@dildano.hawaga.org.uk>
	<1222353494.7226.0.camel@localhost>
	<Pine.LNX.4.64.0809251439020.7706@dildano.hawaga.org.uk>
	<1222354730.7737.1.camel@localhost>
	<Pine.LNX.4.64.0809251504430.7706@dildano.hawaga.org.uk>
Message-ID: <1222355787.7965.6.camel@localhost>

On Thu, 2008-09-25 at 15:06 +0000, Ben Clifford wrote:
> On Thu, 25 Sep 2008, Mihael Hategan wrote:
> 
> > When a task needs a new worker it becomes bound to the request for the
> > new worker. It does not go to another worker if it becomes available
> > while its worker is queued.
> 
> Could be that, but ~2000s seems a long time for that - the every 6300s 
> trough/peak periods where the coaster workers get restarted are only a 
> couple hundred seconds long.

Maybe unrelated, but there were these workers that, as far as the
queuing system was concerned, were running, without having produced any
logs. They kept "running" after things stopped happening, despite the
fact that they should have shut down for being idle for too long.

So I suspect there is a problem there. If replication was on, the long
jobs may be early replicas that happened to go to such a funny worker,
and which were eventually canceled after another job went through the
whole pipe.

I will add some code to cancel workers if no registration is received a
certain time after the respective job goes into running state.

> 


From hategan at mcs.anl.gov  Thu Sep 25 11:22:34 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 25 Sep 2008 11:22:34 -0500
Subject: [Swift-devel] Re: examining the plots of a 65535 job CNARI run
In-Reply-To: <Pine.LNX.4.64.0809251439020.7706@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0809251343020.7706@dildano.hawaga.org.uk>
	<1222353494.7226.0.camel@localhost>
	<Pine.LNX.4.64.0809251439020.7706@dildano.hawaga.org.uk>
Message-ID: <1222359754.9096.4.camel@localhost>

On Thu, 2008-09-25 at 14:41 +0000, Ben Clifford wrote:

> Look at this plot:
> 
> http://www.ci.uchicago.edu/~benc/tmp/report-modelproc-20080924-1226-pkzripi7    
> /karatasks.JOB_SUBMISSION.Queue.sorted-start.png 
> 
> Each task is represented by a red line with the left end being when it 
> goes into Submitted state and the right end being when it leaves submitted 
> state (eg to fail or become active).

Any chance the colors can be made different for the various conditions?


From benc at hawaga.org.uk  Thu Sep 25 13:06:19 2008
From: benc at hawaga.org.uk (=?ISO-8859-1?Q?Ben_Clifford?=)
Date: 25 Sep 2008 19:06:19 +0100
Subject: [Swift-devel] RE: Re: examining the plots of a 65535 job CNARI run
Message-ID: <200809251807.m8PI7DS3002757@dildano.hawaga.org.uk>


Could be done but that graph is not 65535 pixels deep so there's lots of overlap. I'll have a play round with some other summarisation methods and see what i can come up with. 

 ---- Message d'origine ---- 
De?: Mihael Hategan <hategan at mcs.anl.gov>
Envoy??: 25 Sep 2008 11:21 -05:00
A?: Ben Clifford <benc at hawaga.org.uk>
Cc?:  <swift-devel at ci.uchicago.edu>,  <skenny at uchicago.edu>
Objet?: Re: examining the plots of a 65535 job CNARI run

On Thu, 2008-09-25 at 14:41 +0000, Ben Clifford wrote:

> Look at this plot:
> 
> http://www.ci.uchicago.edu/~benc/tmp/report-modelproc-20080924-1226-pkzripi7    
> /karatasks.JOB_SUBMISSION.Queue.sorted-start.png 
> 
> Each task is represented by a red line with the left end being when it 
> goes into Submitted state and the right end being when it leaves submitted 
> state (eg to fail or become active).

Any chance the colors can be made different for the various conditions?


--


From hategan at mcs.anl.gov  Thu Sep 25 13:49:09 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 25 Sep 2008 13:49:09 -0500
Subject: [Swift-devel] RE: Re: examining the plots of a 65535 job CNARI run
In-Reply-To: <200809251807.m8PI7DS3002757@dildano.hawaga.org.uk>
References: <200809251807.m8PI7DS3002757@dildano.hawaga.org.uk>
Message-ID: <1222368549.13382.1.camel@localhost>

On Thu, 2008-09-25 at 19:06 +0100, Ben Clifford wrote:
> Could be done but that graph is not 65535 pixels deep so there's lots of overlap.

Though if the long jobs end because of the same thing, there would be
some clear visual cue there.


From benc at hawaga.org.uk  Fri Sep 26 14:45:25 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 26 Sep 2008 19:45:25 +0000 (GMT)
Subject: [Swift-devel] status files
Message-ID: <Pine.LNX.4.64.0809261934490.7706@dildano.hawaga.org.uk>


For providers that return exit codes from jobs correctly, I think it is 
safe to not use status files and instead use the returned exit code.

I think that's the case for gram4 and local execution and either is or can 
be the case for coasters and falkon.

I'm specifically interested in two cases: one with falkon on the bg/p 
where status file munging seems to take some time; and the other with 
skenny's cnari app where file-related activity seems to dominate - it 
might help or might not.

-- 


From hategan at mcs.anl.gov  Fri Sep 26 14:51:39 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 26 Sep 2008 14:51:39 -0500
Subject: [Swift-devel] status files
In-Reply-To: <Pine.LNX.4.64.0809261934490.7706@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0809261934490.7706@dildano.hawaga.org.uk>
Message-ID: <1222458699.31268.2.camel@localhost>

On Fri, 2008-09-26 at 19:45 +0000, Ben Clifford wrote:
> For providers that return exit codes from jobs correctly, I think it is 
> safe to not use status files and instead use the returned exit code.
> 
> I think that's the case for gram4 and local execution and either is or can 
> be the case for coasters and falkon.
> 
> I'm specifically interested in two cases: one with falkon on the bg/p 
> where status file munging seems to take some time; and the other with 
> skenny's cnari app where file-related activity seems to dominate - it 
> might help or might not.

The exit code test is fast, since only a deletion is done in normal
circumstances.

I do not believe that the grain in performance is worth making the code
more complex than it needs to be, but it may be worth a try.


From benc at hawaga.org.uk  Fri Sep 26 15:08:07 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 26 Sep 2008 20:08:07 +0000 (GMT)
Subject: [Swift-devel] status files
In-Reply-To: <1222458699.31268.2.camel@localhost>
References: <Pine.LNX.4.64.0809261934490.7706@dildano.hawaga.org.uk>
	<1222458699.31268.2.camel@localhost>
Message-ID: <Pine.LNX.4.64.0809262007440.7706@dildano.hawaga.org.uk>


> I do not believe that the grain in performance is worth making the code
> more complex than it needs to be, but it may be worth a try.

I'm not particularly convinced either way, but hopefully I can get some 
numbers that show one way or the other.

-- 


From benc at hawaga.org.uk  Fri Sep 26 19:49:14 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 27 Sep 2008 00:49:14 +0000 (GMT)
Subject: [Swift-devel] status files
In-Reply-To: <Pine.LNX.4.64.0809262007440.7706@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0809261934490.7706@dildano.hawaga.org.uk>
	<1222458699.31268.2.camel@localhost>
	<Pine.LNX.4.64.0809262007440.7706@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0809270047240.7706@dildano.hawaga.org.uk>


On Fri, 26 Sep 2008, Ben Clifford wrote:

> > I do not believe that the grain in performance is worth making the code
> > more complex than it needs to be, but it may be worth a try.
> 
> I'm not particularly convinced either way, but hopefully I can get some 
> numbers that show one way or the other.

On my rough mockup of the cnari application, it looks like this doesn't 
really have much effect when running 1000 jobs at the load I'm getting.

It might be interesting to try on falkon+gpfs on the BG/P at Argonne, 
where some of the worker node wrapper logs suggest that a bunch of time is 
being consumed by status file munging.

-- 


From benc at hawaga.org.uk  Sun Sep 28 12:14:22 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 28 Sep 2008 17:14:22 +0000 (GMT)
Subject: [Swift-devel] fakecnari on ranger without gridftp
Message-ID: <Pine.LNX.4.64.0809281703300.7706@dildano.hawaga.org.uk>


I have an app 'fakecnari' that behaves somewhat like the CNARI app that 
skenny has been working on, in order to make it easier for me to look at 
bottlenecks.

So far, that's had similar problems to skenny's real runs where input 
files cannot be staged in to ranger fast enough from UC - this limits the 
number of cores that can be used at any one time on Ranger to around 15.

So I thought in order to see what other bottlenecks might be found, I'd 
make a run with swift running directly on a ranger headnode, submitting 
through coasters and with the input and output files moved around using 
the local copy file provider (the same as happens when you use the default 
local site).

This looks like it manages to use over 100 cores quite a lot. The speedup 
for the run is (including allocation time for coaster workers, which is 
a significant part of this run) about 50000s worth of sleep done in 800s, 
which is sleeping 62 times as fast as on a single core. Discounting 
worker allocation time, this takes about 590s which is sleeping about 84 
times as fast.

Even with local copies instead of ftp, file transfers (limited to 4 at 
once) appear to be a rate limiting factor.

There are full plots here:

http://www.ci.uchicago.edu/~benc/tmp/report-fakecnari-20080928-1134-herl17vf/

-- 


From benc at hawaga.org.uk  Sun Sep 28 14:24:28 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 28 Sep 2008 19:24:28 +0000 (GMT)
Subject: [Swift-devel] Re: fakecnari on ranger without gridftp
In-Reply-To: <Pine.LNX.4.64.0809281703300.7706@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0809281703300.7706@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0809281923330.6429@dildano.hawaga.org.uk>


On Sun, 28 Sep 2008, Ben Clifford wrote:

> There are full plots here:
> 
> http://www.ci.uchicago.edu/~benc/tmp/report-fakecnari-20080928-1134-herl17vf/

A second run with same parameters looks somewhat different visually but 
still seems to take around the same amount of time - more cores in use at 
peak, for example, but longer gaps with much fewer (sometimes none) in 
use.

http://www.ci.uchicago.edu/~benc/tmp/report-fakecnari-20080928-1345-i8fzjcla/

-- 


From foster at mcs.anl.gov  Sun Sep 28 14:30:19 2008
From: foster at mcs.anl.gov (Ian Foster)
Date: Sun, 28 Sep 2008 14:30:19 -0500
Subject: [Swift-devel] fakecnari on ranger without gridftp
In-Reply-To: <Pine.LNX.4.64.0809281703300.7706@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0809281703300.7706@dildano.hawaga.org.uk>
Message-ID: <D71D7F7F-C713-49D6-AB0B-639490B18583@mcs.anl.gov>

Ben:

I like the idea of being able to sleep faster. That could help a lot  
sometimes :)

Zhao and others have been working a lot on BG/P, as you probably know.  
The challenges inherent in file transfers there are in some ways  
comparable. Their solution has been to develop collective I/O  
operations that (a) multicast replicated input files more efficiently  
than multiple independent reads, and (b) apply two-stage methods to  
bundled inputs and outputs, moving them from local disks to the global  
file system via an intermediate file system. I wonder whether similar  
methods could be applied here?

Independently of that, it would be useful to develop a performance  
model for the whole end-to-end system to determine where the  
bottlenecks are, and the peak performance that could be expected for  
each stage (including data movement) so we can see where there are  
opportunities for improvement, and where we are limited by fundamental  
limits.

Ian.


On Sep 28, 2008, at 12:14 PM, Ben Clifford wrote:

>
> I have an app 'fakecnari' that behaves somewhat like the CNARI app  
> that
> skenny has been working on, in order to make it easier for me to  
> look at
> bottlenecks.
>
> So far, that's had similar problems to skenny's real runs where input
> files cannot be staged in to ranger fast enough from UC - this  
> limits the
> number of cores that can be used at any one time on Ranger to around  
> 15.
>
> So I thought in order to see what other bottlenecks might be found,  
> I'd
> make a run with swift running directly on a ranger headnode,  
> submitting
> through coasters and with the input and output files moved around  
> using
> the local copy file provider (the same as happens when you use the  
> default
> local site).
>
> This looks like it manages to use over 100 cores quite a lot. The  
> speedup
> for the run is (including allocation time for coaster workers, which  
> is
> a significant part of this run) about 50000s worth of sleep done in  
> 800s,
> which is sleeping 62 times as fast as on a single core. Discounting
> worker allocation time, this takes about 590s which is sleeping  
> about 84
> times as fast.
>
> Even with local copies instead of ftp, file transfers (limited to 4 at
> once) appear to be a rate limiting factor.
>
> There are full plots here:
>
> http://www.ci.uchicago.edu/~benc/tmp/report-fakecnari-20080928-1134-herl17vf/
>
> -- 
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From benc at hawaga.org.uk  Sun Sep 28 14:45:10 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 28 Sep 2008 19:45:10 +0000 (GMT)
Subject: [Swift-devel] fakecnari on ranger without gridftp
In-Reply-To: <D71D7F7F-C713-49D6-AB0B-639490B18583@mcs.anl.gov>
References: <Pine.LNX.4.64.0809281703300.7706@dildano.hawaga.org.uk>
	<D71D7F7F-C713-49D6-AB0B-639490B18583@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0809281934030.6429@dildano.hawaga.org.uk>


On Sun, 28 Sep 2008, Ian Foster wrote:

> Zhao and others have been working a lot on BG/P, as you probably know. The
> challenges inherent in file transfers there are in some ways comparable. Their
> solution has been to develop collective I/O operations that (a) multicast
> replicated input files more efficiently than multiple independent reads, and
> (b) apply two-stage methods to bundled inputs and outputs, moving them from
> local disks to the global file system via an intermediate file system. I
> wonder whether similar methods could be applied here?

yes, I imagine methods that help in falkon+bg/p will help here too, at 
least in the abstract.

> Independently of that, it would be useful to develop a performance model for
> the whole end-to-end system to determine where the bottlenecks are, and the
> peak performance that could be expected for each stage (including data
> movement) so we can see where there are opportunities for improvement, and
> where we are limited by fundamental limits.

right.

certainly I know that transferring the same amount of data as a tarball in 
this particular app seems much much faster (9s vs 450s) - see Subject: 
[Swift-devel] lots of very small files vs gridftp.

>From a raw CPU perspective, there's 65535 parallel tasks each of a few 
seconds long each; there's a fairly obvious, if naive, target there of 65k 
x speedup (mmm).

I still don't really have a feel for what the filesystem on Ranger 
(lustre) will do - its been behaving fairly well so far, I think, but I 
imagine thats because it hasn't been terribly heavily loaded in my 
testing.

-- 


From hategan at mcs.anl.gov  Sun Sep 28 15:28:53 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 28 Sep 2008 15:28:53 -0500
Subject: [Swift-devel] fakecnari on ranger without gridftp
In-Reply-To: <Pine.LNX.4.64.0809281934030.6429@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0809281703300.7706@dildano.hawaga.org.uk>
	<D71D7F7F-C713-49D6-AB0B-639490B18583@mcs.anl.gov>
	<Pine.LNX.4.64.0809281934030.6429@dildano.hawaga.org.uk>
Message-ID: <1222633733.18444.4.camel@localhost>

On Sun, 2008-09-28 at 19:45 +0000, Ben Clifford wrote:

> certainly I know that transferring the same amount of data as a tarball in 
> this particular app seems much much faster (9s vs 450s) - see Subject: 
> [Swift-devel] lots of very small files vs gridftp.

Though one thing that is missing there is the part where lots of small
files are created on the filesystem, which, on a shared FS, may be a
considerable time.

So I think we should probably define clearly what we mean by
transferring lots of files, and whether that includes opening each,
reading from each, sending data, creating each, and writing to each of
the files involved.


From iraicu at cs.uchicago.edu  Sun Sep 28 22:00:51 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sun, 28 Sep 2008 22:00:51 -0500
Subject: [Swift-devel] fakecnari on ranger without gridftp
In-Reply-To: <Pine.LNX.4.64.0809281703300.7706@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0809281703300.7706@dildano.hawaga.org.uk>
Message-ID: <48E044E3.7080505@cs.uchicago.edu>

I see the following stats for this run:

Total number of events: 10002
Shortest event (s): 3.4300000667572
Longest event (s): 753.21799993515
Total duration of all events (s): 53898.3449883461
Mean event duration (s): 5.38875674748511
Standard deviation of event duration (s): 7.48318471316593
Maximum number of events at one time: 113

What inherently limits the run to 113 events at a time?  Is it the fact 
that Coaster only allocated 113 (maybe a few more) CPU-cores?  How many 
CPU-cores did coaster allocate?  With 113 CPU-cores and 5.38 sec tasks, 
that means a throughput of ~21 tasks/sec.  Is this the bottleneck?  Its 
probably not the file system (in terms of the app accessing the 
input/output data), as if it were, task execution times would simply 
increase with load... but it could be the file system being slow in 
getting the input data in the right place for the app to start 
computing, as in staging it in.

BTW, do the times above include wait queue times?  I see the longest 
task being 753 sec, but below you say that the workload takes 590 sec 
not including queue time.  Do you have a plot of the number of CPUs in 
relation to the number of active tasks?  Are all available CPUs kept 
busy?  The speedup is one story, 84X out of 113X possible (this 113X 
should really be the number of CPU-cores), but sometimes the workload 
characteristics limit the maximum possible speedup... and in that case, 
its good to look at the CPU-core utilization.  Is it possible to draw a 
graph that has this info?  # of CPU-cores, number of active tasks, and 
throughput of completed tasks?

Ioan

Ben Clifford wrote:
> I have an app 'fakecnari' that behaves somewhat like the CNARI app that 
> skenny has been working on, in order to make it easier for me to look at 
> bottlenecks.
>
> So far, that's had similar problems to skenny's real runs where input 
> files cannot be staged in to ranger fast enough from UC - this limits the 
> number of cores that can be used at any one time on Ranger to around 15.
>
> So I thought in order to see what other bottlenecks might be found, I'd 
> make a run with swift running directly on a ranger headnode, submitting 
> through coasters and with the input and output files moved around using 
> the local copy file provider (the same as happens when you use the default 
> local site).
>
> This looks like it manages to use over 100 cores quite a lot. The speedup 
> for the run is (including allocation time for coaster workers, which is 
> a significant part of this run) about 50000s worth of sleep done in 800s, 
> which is sleeping 62 times as fast as on a single core. Discounting 
> worker allocation time, this takes about 590s which is sleeping about 84 
> times as fast.
>
> Even with local copies instead of ftp, file transfers (limited to 4 at 
> once) appear to be a rate limiting factor.
>
> There are full plots here:
>
> http://www.ci.uchicago.edu/~benc/tmp/report-fakecnari-20080928-1134-herl17vf/
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From iraicu at cs.uchicago.edu  Sun Sep 28 22:29:36 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sun, 28 Sep 2008 22:29:36 -0500
Subject: [Swift-devel] fakecnari on ranger without gridftp
In-Reply-To: <Pine.LNX.4.64.0809281934030.6429@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0809281703300.7706@dildano.hawaga.org.uk>	<D71D7F7F-C713-49D6-AB0B-639490B18583@mcs.anl.gov>
	<Pine.LNX.4.64.0809281934030.6429@dildano.hawaga.org.uk>
Message-ID: <48E04BA0.600@cs.uchicago.edu>

Hi,

Ben Clifford wrote:
> On Sun, 28 Sep 2008, Ian Foster wrote:
>
> ...
>
> >From a raw CPU perspective, there's 65535 parallel tasks each of a few 
> seconds long each; there's a fairly obvious, if naive, target there of 65k 
> x speedup (mmm).
>
>   
At that point, the main bottlenecks will be scalability of the 
mechanisms you use to drive the execution framework  (i.e. 
Coaster/Falkon), the time it takes to bootstrap your the execution 
framework, the throughput you can dispatch tasks and receive results, 
and the speed of the file system that it can read inputs and write 
outputs.  Assuming the execution framework scales, the rest can be 
computed as long as you understand the performance of the machine and 
the execution framework.

For example, here are two runs we made with Falkon on the BG/P recently 
that might be similar to the fakecnari workload.  We had 32K CPU-cores, 
128K tasks, and each task involved sleeping for 4 sec, and writing 1KB 
of data.  In an ideal world with 0 costs for the execution framework, 
and 0 costs of I/O, the workload time would have been 16 seconds 
(128K/32K*4sec), which would equate to 524288 CPU seconds.  Running this 
workload on GPFS directly took 180 seconds (2912X), and running the same 
workload through the collective I/O framework we have took 61 seconds 
(8594X).  The bottleneck in the GPFS case was the rate that we could 
create files and write the 1KB file, in the context of 32K CPUs doing 
this concurrently.  The bottleneck for the collective I/O was the 
dispatch rate of Falkon, which in this case was 2148 tasks/sec.  Once 
you understand the performance of the file system, and execution 
framework, these large scale numbers can be estimated quite nicely.
> I still don't really have a feel for what the filesystem on Ranger 
> (lustre) will do - its been behaving fairly well so far, I think, but I 
> imagine thats because it hasn't been terribly heavily loaded in my 
> testing.
>   
And if it was built to support the entire machine at full scale (64K 
CPU-cores), then I'd imagine that you'll need at least 1000s, if not 
10Ks of CPU-cores to saturate the file system with small files.  Once of 
these days, we'll probably start testing some of our BG/P apps on Ranger 
as well, so then, we can exchange notes better on each other's 
experiences and problems we are each facing.

Ioan

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From hategan at mcs.anl.gov  Mon Sep 29 18:37:50 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 29 Sep 2008 18:37:50 -0500
Subject: [Swift-devel] lots of very small files vs gridftp
In-Reply-To: <Pine.LNX.4.64.0809251259200.7706@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0809251259200.7706@dildano.hawaga.org.uk>
Message-ID: <1222731470.23121.21.camel@localhost>

On Thu, 2008-09-25 at 13:10 +0000, Ben Clifford wrote:

> To transfer 1000 files:
> 
>    # concurrent conncetions  |   duration of copy (seconds, multiple runs)
>                        16          7, 16, 16
>                         4         14, 14, 14
>                         2         26, 25
>                         1         48, 52
> 

I tried a similar experiment, this time with the java libraries, to see
how that works.

The setup was transfer 1024 files of 1024 bytes each with parallelism
(at the karajan level, though this should cause corresponding gridftp
connection parallelism) of 1 to 16 in powers of 2.

I got this for Ranger (in ms):
1: 242030
2: 121916
4: 61787
8: 31903
16: died (probably trying to start too many connections concurrently)

Then UC:
1: 212192
2: 106872
4: 54790
8: 28838
16: 18166

Then I made a quick file provider for coasters, which sends the data
over the same connection (and upped the parallelism):
UC-coaster
1: 102624
2: 31388
4: 18042
8: 8823
16: 5510
32: 5053
64: 6686
128: 5551

Then I ran the same, but instead of transferring to a nfs directory,
things went to /dev/null:
1: 93997
2: 35694
4: 16269
8: 7349
16: 4462
32: 1865
64: 1332
128: 1304

I suppose the bad speed with coasters is because things go up on an
encrypted connection, but it may be something else.

So otherwise, if files are small, one can look at this as the task of
sending (acknowledged) messages from one side to the other, where the
communication lag is the problem and the way to solve it is by
increasing parallelism (which essentially is what tarring things up
does). That and whatever FS limitations the remote side has.

Mihael


From foster at mcs.anl.gov  Mon Sep 29 20:34:18 2008
From: foster at mcs.anl.gov (Ian Foster)
Date: Mon, 29 Sep 2008 20:34:18 -0500
Subject: [Swift-devel] lots of very small files vs gridftp
In-Reply-To: <1222731470.23121.21.camel@localhost>
References: <Pine.LNX.4.64.0809251259200.7706@dildano.hawaga.org.uk>
	<1222731470.23121.21.camel@localhost>
Message-ID: <2C47C9EB-948D-4E3E-BC83-402111B6C45E@mcs.anl.gov>

Remind me again why we aren't just using TAR and GridFTP?

Ian.

On Sep 29, 2008, at 6:37 PM, Mihael Hategan wrote:

> On Thu, 2008-09-25 at 13:10 +0000, Ben Clifford wrote:
>
>> To transfer 1000 files:
>>
>>   # concurrent conncetions  |   duration of copy (seconds, multiple  
>> runs)
>>                       16          7, 16, 16
>>                        4         14, 14, 14
>>                        2         26, 25
>>                        1         48, 52
>>
>
> I tried a similar experiment, this time with the java libraries, to  
> see
> how that works.
>
> The setup was transfer 1024 files of 1024 bytes each with parallelism
> (at the karajan level, though this should cause corresponding gridftp
> connection parallelism) of 1 to 16 in powers of 2.
>
> I got this for Ranger (in ms):
> 1: 242030
> 2: 121916
> 4: 61787
> 8: 31903
> 16: died (probably trying to start too many connections concurrently)
>
> Then UC:
> 1: 212192
> 2: 106872
> 4: 54790
> 8: 28838
> 16: 18166
>
> Then I made a quick file provider for coasters, which sends the data
> over the same connection (and upped the parallelism):
> UC-coaster
> 1: 102624
> 2: 31388
> 4: 18042
> 8: 8823
> 16: 5510
> 32: 5053
> 64: 6686
> 128: 5551
>
> Then I ran the same, but instead of transferring to a nfs directory,
> things went to /dev/null:
> 1: 93997
> 2: 35694
> 4: 16269
> 8: 7349
> 16: 4462
> 32: 1865
> 64: 1332
> 128: 1304
>
> I suppose the bad speed with coasters is because things go up on an
> encrypted connection, but it may be something else.
>
> So otherwise, if files are small, one can look at this as the task of
> sending (acknowledged) messages from one side to the other, where the
> communication lag is the problem and the way to solve it is by
> increasing parallelism (which essentially is what tarring things up
> does). That and whatever FS limitations the remote side has.
>
> Mihael
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From hategan at mcs.anl.gov  Tue Sep 30 00:23:39 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 30 Sep 2008 00:23:39 -0500
Subject: [Swift-devel] lots of very small files vs gridftp
In-Reply-To: <2C47C9EB-948D-4E3E-BC83-402111B6C45E@mcs.anl.gov>
References: <Pine.LNX.4.64.0809251259200.7706@dildano.hawaga.org.uk>
	<1222731470.23121.21.camel@localhost>
	<2C47C9EB-948D-4E3E-BC83-402111B6C45E@mcs.anl.gov>
Message-ID: <1222752219.29016.11.camel@localhost>

On Mon, 2008-09-29 at 20:34 -0500, Ian Foster wrote:
> Remind me again why we aren't just using TAR and GridFTP?

I don't think we're using anything at this point, hence all the testing
and exploring.

I think there is some complexity in figuring out dynamically what
exactly to tar up and how to untar on the remote site. So more (or more
complex) code than the other choice.

But I'm not otherwise opposed to anything in particular. I suppose
taring/untaring could be done manually, at the expense of messing the
abstractness of swift.

> 
> Ian.
> 
> On Sep 29, 2008, at 6:37 PM, Mihael Hategan wrote:
> 
> > On Thu, 2008-09-25 at 13:10 +0000, Ben Clifford wrote:
> >
> >> To transfer 1000 files:
> >>
> >>   # concurrent conncetions  |   duration of copy (seconds, multiple  
> >> runs)
> >>                       16          7, 16, 16
> >>                        4         14, 14, 14
> >>                        2         26, 25
> >>                        1         48, 52
> >>
> >
> > I tried a similar experiment, this time with the java libraries, to  
> > see
> > how that works.
> >
> > The setup was transfer 1024 files of 1024 bytes each with parallelism
> > (at the karajan level, though this should cause corresponding gridftp
> > connection parallelism) of 1 to 16 in powers of 2.
> >
> > I got this for Ranger (in ms):
> > 1: 242030
> > 2: 121916
> > 4: 61787
> > 8: 31903
> > 16: died (probably trying to start too many connections concurrently)
> >
> > Then UC:
> > 1: 212192
> > 2: 106872
> > 4: 54790
> > 8: 28838
> > 16: 18166
> >
> > Then I made a quick file provider for coasters, which sends the data
> > over the same connection (and upped the parallelism):
> > UC-coaster
> > 1: 102624
> > 2: 31388
> > 4: 18042
> > 8: 8823
> > 16: 5510
> > 32: 5053
> > 64: 6686
> > 128: 5551
> >
> > Then I ran the same, but instead of transferring to a nfs directory,
> > things went to /dev/null:
> > 1: 93997
> > 2: 35694
> > 4: 16269
> > 8: 7349
> > 16: 4462
> > 32: 1865
> > 64: 1332
> > 128: 1304
> >
> > I suppose the bad speed with coasters is because things go up on an
> > encrypted connection, but it may be something else.
> >
> > So otherwise, if files are small, one can look at this as the task of
> > sending (acknowledged) messages from one side to the other, where the
> > communication lag is the problem and the way to solve it is by
> > increasing parallelism (which essentially is what tarring things up
> > does). That and whatever FS limitations the remote side has.
> >
> > Mihael
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From benc at hawaga.org.uk  Tue Sep 30 19:02:05 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 1 Oct 2008 00:02:05 +0000 (GMT)
Subject: [Swift-devel] lots of very small files vs gridftp
In-Reply-To: <1222752219.29016.11.camel@localhost>
References: <Pine.LNX.4.64.0809251259200.7706@dildano.hawaga.org.uk> 
	<1222731470.23121.21.camel@localhost>
	<2C47C9EB-948D-4E3E-BC83-402111B6C45E@mcs.anl.gov>
	<1222752219.29016.11.camel@localhost>
Message-ID: <Pine.LNX.4.64.0809302347170.6429@dildano.hawaga.org.uk>


On Tue, 30 Sep 2008, Mihael Hategan wrote:

> But I'm not otherwise opposed to anything in particular. I suppose
> taring/untaring could be done manually, at the expense of messing the
> abstractness of swift.

I played some making Swift do tar/untar of stageins automatically (so no 
modifications are needed to the SwiftScript code).

Theres a plot here 
http://www.ci.uchicago.edu/~benc/tmp/report-fakecnari-20080930-1820-0nmtamxg/

Basically the first 600s are taken up allocating coaster workers, and the 
remaining time uses quite a lot of cores at once. So the total duration of 
run doesn't seem that different; but I think that the behaviour as number 
of jobs increases will be better- the 600s startup is a fixed cost (which 
I also think can be massively reduced in a couple of ways) and the bit 
that is proportional to the number of jobs is the remaining three hundred 
seconds.


This is a fairly dirty hack - there's no clustering for stageouts; there 
is fairly crude decision of whether to cluster transfers or not 
(basically, queue file transfers for 30s and after that, if there's more 
than one, make a cluster).

The initial startup is slow, I think, because the initial startup of 
coaster workers is done based on a malformed job submission caused by the 
low quality of this clustering code - it doesn't pass through the 
coastersPerNode parameter for initial jobs so the initial coaster worker 
is very slow.

-- 


From hategan at mcs.anl.gov  Tue Sep 30 19:45:44 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 30 Sep 2008 19:45:44 -0500
Subject: [Swift-devel] lots of very small files vs gridftp
In-Reply-To: <Pine.LNX.4.64.0809302347170.6429@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0809251259200.7706@dildano.hawaga.org.uk>
	<1222731470.23121.21.camel@localhost>
	<2C47C9EB-948D-4E3E-BC83-402111B6C45E@mcs.anl.gov>
	<1222752219.29016.11.camel@localhost>
	<Pine.LNX.4.64.0809302347170.6429@dildano.hawaga.org.uk>
Message-ID: <1222821944.6731.24.camel@localhost>

On Wed, 2008-10-01 at 00:02 +0000, Ben Clifford wrote:
> On Tue, 30 Sep 2008, Mihael Hategan wrote:
> 
> > But I'm not otherwise opposed to anything in particular. I suppose
> > taring/untaring could be done manually, at the expense of messing the
> > abstractness of swift.
> 
> I played some making Swift do tar/untar of stageins automatically (so no 
> modifications are needed to the SwiftScript code).

?This reminds me of clustering of jobs we did initially with swift
versus Falkon.

> 
> Theres a plot here 
> http://www.ci.uchicago.edu/~benc/tmp/report-fakecnari-20080930-1820-0nmtamxg/
> 
> Basically the first 600s are taken up allocating coaster workers, and the 
> remaining time uses quite a lot of cores at once. So the total duration of 
> run doesn't seem that different; but I think that the behaviour as number 
> of jobs increases will be better- the 600s startup is a fixed cost (which 
> I also think can be massively reduced in a couple of ways) and the bit 
> that is proportional to the number of jobs is the remaining three hundred 
> seconds.

Could the untar job be done with fork straight through gram?

> 
> 
> This is a fairly dirty hack - there's no clustering for stageouts;

Though that could be done in a similar way, right?

>  there 
> is fairly crude decision of whether to cluster transfers or not 
> (basically, queue file transfers for 30s and after that, if there's more 
> than one, make a cluster).
> 
> The initial startup is slow, I think, because the initial startup of 
> coaster workers is done based on a malformed job submission caused by the 
> low quality of this clustering code - it doesn't pass through the 
> coastersPerNode parameter for initial jobs so the initial coaster worker 
> is very slow.

Using fork would probably solve this, too.

Now, not to leave the other side of the argument on its own, running
other fileops (mkdir, ls, etc.) through coasters does offer the added
benefit of parallelizing the RTT. We do at least one such operation per
job. In the 2^16 jobs case, and with parallelism of 2^7 vs. 2^3, one
would get, in the theoretical case, an improvement of (2^16/2^3 -
2^16/2^7)RTTs. Or (8192 - 512)RTTs. Or 7680s for an RTT of 1s.

But then that could probably also be done in GridFTP with pipelining.


From benc at hawaga.org.uk  Tue Sep 30 19:56:54 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 1 Oct 2008 00:56:54 +0000 (GMT)
Subject: [Swift-devel] lots of very small files vs gridftp
In-Reply-To: <1222821944.6731.24.camel@localhost>
References: <Pine.LNX.4.64.0809251259200.7706@dildano.hawaga.org.uk> 
	<1222731470.23121.21.camel@localhost>
	<2C47C9EB-948D-4E3E-BC83-402111B6C45E@mcs.anl.gov>
	<1222752219.29016.11.camel@localhost>
	<Pine.LNX.4.64.0809302347170.6429@dildano.hawaga.org.uk>
	<1222821944.6731.24.camel@localhost>
Message-ID: <Pine.LNX.4.64.0810010052130.6429@dildano.hawaga.org.uk>


On Tue, 30 Sep 2008, Mihael Hategan wrote:

> ?This reminds me of clustering of jobs we did initially with swift
> versus Falkon.

yes. its half a cut-and-paste job of that code.

> Could the untar job be done with fork straight through gram?

yes. though my feeling at the moment when there is something like coasters 
around with spare nodes is that this wouldn't change much. In the case of 
non-coaster runs using something like gram4 that can still take a 
reasonable number of jobs, getting the untar pushed ahead of queued 'work' 
jobs is probably good.

-- 

From hategan at mcs.anl.gov  Tue Sep 30 20:03:50 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 30 Sep 2008 20:03:50 -0500
Subject: [Swift-devel] lots of very small files vs gridftp
In-Reply-To: <Pine.LNX.4.64.0810010052130.6429@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0809251259200.7706@dildano.hawaga.org.uk>
	<1222731470.23121.21.camel@localhost>
	<2C47C9EB-948D-4E3E-BC83-402111B6C45E@mcs.anl.gov>
	<1222752219.29016.11.camel@localhost>
	<Pine.LNX.4.64.0809302347170.6429@dildano.hawaga.org.uk>
	<1222821944.6731.24.camel@localhost>
	<Pine.LNX.4.64.0810010052130.6429@dildano.hawaga.org.uk>
Message-ID: <1222823030.7845.0.camel@localhost>

On Wed, 2008-10-01 at 00:56 +0000, Ben Clifford wrote:
> On Tue, 30 Sep 2008, Mihael Hategan wrote:
> 
> > ?This reminds me of clustering of jobs we did initially with swift
> > versus Falkon.
> 
> yes. its half a cut-and-paste job of that code.

I was mostly thinking about the conceptual part and the implications. At
least for small jobs, Falkon turned out to be the better option.

> 
> > Could the untar job be done with fork straight through gram?
> 
> yes. though my feeling at the moment when there is something like coasters 
> around with spare nodes is that this wouldn't change much. In the case of 
> non-coaster runs using something like gram4 that can still take a 
> reasonable number of jobs, getting the untar pushed ahead of queued 'work' 
> jobs is probably good.
> 


From zhaozhang at uchicago.edu  Tue Sep 30 21:40:20 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Tue, 30 Sep 2008 21:40:20 -0500
Subject: [Swift-devel] could swift use return code from falkon as a success
	notification?
Message-ID: <48E2E314.6040809@uchicago.edu>

Hi, All

I am trying to optimize the swift performance on BGP, I finished it for 
the input phase,
but suffering the poor performance at the output phase, which is exactly 
the status file
creation process, as you could tell from the following picture. In this 
test, I ran sleep_30
jobs, which is expected to finish in 30 seconds.

I am wondering if we could use falkon return code instead of the status 
file? Thanks.

zhao
-------------- next part --------------
A non-text attachment was scrubbed...
Name: wrapper.JPG
Type: image/jpeg
Size: 33697 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080930/698181fd/attachment.jpe>

From hategan at mcs.anl.gov  Tue Sep 30 22:01:21 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 30 Sep 2008 22:01:21 -0500
Subject: [Swift-devel] could swift use return code from falkon as a
	success notification?
In-Reply-To: <48E2E314.6040809@uchicago.edu>
References: <48E2E314.6040809@uchicago.edu>
Message-ID: <1222830081.9463.4.camel@localhost>

On Tue, 2008-09-30 at 21:40 -0500, Zhao Zhang wrote:
> Hi, All
> 
> I am trying to optimize the swift performance on BGP, I finished it for 
> the input phase,
> but suffering the poor performance at the output phase, which is exactly 
> the status file
> creation process, as you could tell from the following picture. In this 
> test, I ran sleep_30
> jobs, which is expected to finish in 30 seconds.
> 
> I am wondering if we could use falkon return code instead of the status 
> file? Thanks.

Yes you could.

You would have to do the following:
1. Remove the relevant part from the wrapper (touching of the success
file and sticking failure info in the failure file)
2. Comment out the checkJobStatus() call in vdl-int.k (around line 415)
3. Make the deef provider set a fault on the task (should be a
JobException) when the exit code is not 0
4. Make the wrapper exit with a non-zero exit code when there is a
problem

If this is too brief, let me know, and I'll give you more details.