From benc at hawaga.org.uk Wed Sep 10 08:33:14 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 10 Sep 2008 13:33:14 +0000 (GMT) Subject: [Swift-devel] swift+glite In-Reply-To: <48C516EE.8000406@mcs.anl.gov> References: <48C516EE.8000406@mcs.anl.gov> Message-ID: added swift-devel because they might be interested. On Mon, 8 Sep 2008, Michael Wilde wrote: > The Swift-glite integration is of interest for Uri's work as well. Do > you know how much work that will be to do, and support? Or how much it > would take ot find out? I interacted with Emidio Giorgio at INFN about this. It sounds like mostly the existing GT2 code will work; glite sites apparently have no cluster shared fs so some fiddling will be necessary there (that I'm working on at the moment) but this is generally in a direction that I think is useful anyway. voms proxies are also apparently necessary, but the pacman packaging that I did the other month (referred to by swift bug 146) can provide that too. So I'm going to play with this for a few days and see what happens. -- From wilde at mcs.anl.gov Wed Sep 10 09:53:33 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 10 Sep 2008 09:53:33 -0500 Subject: [Swift-devel] Re: swift+glite In-Reply-To: References: <48C516EE.8000406@mcs.anl.gov> Message-ID: <48C7DF6D.6000301@mcs.anl.gov> This sounds good, Ben. Please post info on what is needed in IO changes. I think this fits in with generalizing and improving the performance of the Swift IO strategy for other environments as well. What is the data transfer strategy when there is no cluster file system? Does the compute node need to pull data down as a gridftp client? On 9/10/08 8:33 AM, Ben Clifford wrote: > added swift-devel because they might be interested. > > On Mon, 8 Sep 2008, Michael Wilde wrote: > >> The Swift-glite integration is of interest for Uri's work as well. Do >> you know how much work that will be to do, and support? Or how much it >> would take ot find out? > > I interacted with Emidio Giorgio at INFN about this. It sounds like mostly > the existing GT2 code will work; glite sites apparently have no cluster > shared fs so some fiddling will be necessary there (that I'm working on at > the moment) but this is generally in a direction that I think is useful > anyway. voms proxies are also apparently necessary, but the pacman > packaging that I did the other month (referred to by swift bug 146) can > provide that too. > > So I'm going to play with this for a few days and see what happens. Great! - Mike From benc at hawaga.org.uk Wed Sep 10 10:20:10 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 10 Sep 2008 15:20:10 +0000 (GMT) Subject: [Swift-devel] Re: swift+glite In-Reply-To: <48C7DF6D.6000301@mcs.anl.gov> References: <48C516EE.8000406@mcs.anl.gov> <48C7DF6D.6000301@mcs.anl.gov> Message-ID: On Wed, 10 Sep 2008, Michael Wilde wrote: > Please post info on what is needed in IO changes. I think this fits in with > generalizing and improving the performance of the Swift IO strategy for other > environments as well. > What is the data transfer strategy when there is no cluster file system? Does > the compute node need to pull data down as a gridftp client? That's what I'm trying to do at the moment, using a site-local ftp server (that apparently doesn't share fs with the compute nodes). That seems relatively straightforward. There's a separate problem with no-shared-fs in that the wrapper script is also placed there prior to execution on the worker nodes, so a different mechanism for bootstrapping that on a worker node is necessary. At the moment, I'm concentrating specifically on glite-specific mechanisms, not making a higher abstraction. -- From smartin at mcs.anl.gov Wed Sep 10 10:37:40 2008 From: smartin at mcs.anl.gov (Stuart Martin) Date: Wed, 10 Sep 2008 10:37:40 -0500 Subject: [Swift-devel] Re: swift+glite In-Reply-To: References: <48C516EE.8000406@mcs.anl.gov> <48C7DF6D.6000301@mcs.anl.gov> Message-ID: <82E68256-4B1A-47C6-8918-B362EDCD3688@mcs.anl.gov> Are you going through gram for this? How will you get the delegated user proxy on the worker node without a shared FS? You'll need to get that in order to do gridftp client commands on the worker node. On Sep 10, 2008, at Sep 10, 10:20 AM, Ben Clifford wrote: > > On Wed, 10 Sep 2008, Michael Wilde wrote: > >> Please post info on what is needed in IO changes. I think this fits >> in with >> generalizing and improving the performance of the Swift IO strategy >> for other >> environments as well. > >> What is the data transfer strategy when there is no cluster file >> system? Does >> the compute node need to pull data down as a gridftp client? > > That's what I'm trying to do at the moment, using a site-local ftp > server > (that apparently doesn't share fs with the compute nodes). > > That seems relatively straightforward. > > There's a separate problem with no-shared-fs in that the wrapper > script is > also placed there prior to execution on the worker nodes, so a > different > mechanism for bootstrapping that on a worker node is necessary. > > At the moment, I'm concentrating specifically on glite-specific > mechanisms, not making a higher abstraction. > > -- > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Wed Sep 10 10:59:31 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 10 Sep 2008 15:59:31 +0000 (GMT) Subject: [Swift-devel] Re: swift+glite In-Reply-To: <82E68256-4B1A-47C6-8918-B362EDCD3688@mcs.anl.gov> References: <48C516EE.8000406@mcs.anl.gov> <48C7DF6D.6000301@mcs.anl.gov> <82E68256-4B1A-47C6-8918-B362EDCD3688@mcs.anl.gov> Message-ID: On Wed, 10 Sep 2008, Stuart Martin wrote: > Are you going through gram for this? How will you get the delegated user > proxy on the worker node without a shared FS? You'll need to get that in > order to do gridftp client commands on the worker node. Its going through glite's GRAM2 fork. They seem to think its ok for me to run commands from worker nodes. Maybe they do have a secret shared fs after all; maybe they have some other mechanism for that. -- From smartin at mcs.anl.gov Wed Sep 10 11:09:28 2008 From: smartin at mcs.anl.gov (Stuart Martin) Date: Wed, 10 Sep 2008 11:09:28 -0500 Subject: [Swift-devel] Re: swift+glite In-Reply-To: References: <48C516EE.8000406@mcs.anl.gov> <48C7DF6D.6000301@mcs.anl.gov> <82E68256-4B1A-47C6-8918-B362EDCD3688@mcs.anl.gov> Message-ID: <13A63339-3974-4819-804D-B85666E0D239@mcs.anl.gov> Some LRMs have file transfer directives in the job submission script. Condor for one. Maybe they add the DUP file and rely on this. On Sep 10, 2008, at Sep 10, 10:59 AM, Ben Clifford wrote: > > On Wed, 10 Sep 2008, Stuart Martin wrote: > >> Are you going through gram for this? How will you get the >> delegated user >> proxy on the worker node without a shared FS? You'll need to get >> that in >> order to do gridftp client commands on the worker node. > > Its going through glite's GRAM2 fork. They seem to think its ok for > me to > run commands from worker nodes. Maybe they do have a secret shared fs > after all; maybe they have some other mechanism for that. > > -- > From hategan at mcs.anl.gov Wed Sep 10 11:14:11 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 10 Sep 2008 11:14:11 -0500 Subject: [Swift-devel] Re: swift+glite In-Reply-To: <82E68256-4B1A-47C6-8918-B362EDCD3688@mcs.anl.gov> References: <48C516EE.8000406@mcs.anl.gov> <48C7DF6D.6000301@mcs.anl.gov> <82E68256-4B1A-47C6-8918-B362EDCD3688@mcs.anl.gov> Message-ID: <1221063251.16815.1.camel@localhost> On Wed, 2008-09-10 at 10:37 -0500, Stuart Martin wrote: > Are you going through gram for this? How will you get the delegated > user proxy on the worker node without a shared FS? None of the problems I've seen expressed with Swift required the apps running on worker nodes to have a delegated proxy. From benc at hawaga.org.uk Wed Sep 10 11:16:07 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 10 Sep 2008 16:16:07 +0000 (GMT) Subject: [Swift-devel] Re: swift+glite In-Reply-To: <1221063251.16815.1.camel@localhost> References: <48C516EE.8000406@mcs.anl.gov> <48C7DF6D.6000301@mcs.anl.gov> <82E68256-4B1A-47C6-8918-B362EDCD3688@mcs.anl.gov> <1221063251.16815.1.camel@localhost> Message-ID: On Wed, 10 Sep 2008, Mihael Hategan wrote: > None of the problems I've seen expressed with Swift required the apps > running on worker nodes to have a delegated proxy. glites intra-site data transfer mechanisms do, though, it seems. -- From bugzilla-daemon at mcs.anl.gov Thu Sep 18 12:13:29 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 18 Sep 2008 12:13:29 -0500 (CDT) Subject: [Swift-devel] [Bug 155] New: ability to specify username on remote site Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=155 Summary: ability to specify username on remote site Product: Swift Version: unspecified Platform: All OS/Version: Mac OS Status: NEW Severity: enhancement Priority: P2 Component: General AssignedTo: hategan at mcs.anl.gov ReportedBy: benc at hawaga.org.uk CC: benc at hawaga.org.uk gram allows username on remote site to be specified through rsl entries. gridftp allows that too. In swift, gram can be set so using profile entries but gridftp cannot. There should probably be a higher abstraction so that a profile key sets this for gram2, gram4 and gridftp using a single profile entry. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. You reported the bug, or are watching the reporter. From benc at hawaga.org.uk Thu Sep 25 08:10:04 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 25 Sep 2008 13:10:04 +0000 (GMT) Subject: [Swift-devel] lots of very small files vs gridftp Message-ID: Here are some notes about lots of very small files vs gridftp: The cnari workflow that skenny works on has lots of very small files, where very in this context means smaller than the GridFTP Lots Of Small Files work likes to handle. In the CNARI runs that we've been making recently, there are 65535 input files, each roughly of a kilobyte, and a corresponding number of output files each of around 10 bytes. The limiting factor in these runs at the moment is staging of files - from UC to Ranger, throughput appears to be around 7 files per second, which is quite poor. Buzz and I did some informal measuring of very small file transfer using gt4.2 globus-url-copy between communicado and the UC/ANL TeraGrid site to get a feel for what to expect. To transfer 1000 files: # concurrent conncetions | duration of copy (seconds, multiple runs) 16 7, 16, 16 4 14, 14, 14 2 26, 25 1 48, 52 Assuming (perhaps incorrectly) that 65k files would take 65 x that, then transferring 65k files would take 455 ( = 7 * 65) seconds using the best result above. To transfer a 65mb single file between the two sites takes 9s. So from a raw transfer perspective, transferring as a single GridFTP transfer rather than as separate files is very good. However, there is some (possibly large) file system overhead at both ends, as 65000 file opens can take some time. Tarring up 65000 files of 1k each took around 60 seconds when Buzz tried it. I also haven't investigated the ranger filesystem performance. I'm hoping to get some wrapper logs from a run today to see what is happening there. The remote filesystem on Ranger is Lustre which I have minimal experience with; however the input files for the CNARI runs are laid out in a way that would almost definitely cause trouble if the shared space was GPFS (in that they are all in a single directory). Results of investigating this should be available in a day or so. -- From benc at hawaga.org.uk Thu Sep 25 09:11:02 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 25 Sep 2008 14:11:02 +0000 (GMT) Subject: [Swift-devel] examining the plots of a 65535 job CNARI run Message-ID: On Wednesday, skenny ran a 65535 run which mostly finished. The plots are here: http://www.ci.uchicago.edu/~benc/tmp/report-modelproc-20080924-1226-pkzripi7/ The rest of this email is rambling commentary on some of the things I see there. The run mostly finishes, with some number (985 according to the totals of unfinished procedure calls, 8 according to the execute2 chart, and 11 according to the karajan statuses) of activities outstanding. Looking at this chart, which is karajan job submission tasks, http://www.ci.uchicago.edu/~benc/tmp/report-modelproc-20080924-1226-pkzripi7/karatasks.JOB_SUBMISSION.sorted-start.png there are strange things with karajan job duration. The majority of tasks run very quickly (a few pixels width, which is a few seconds). That's expected. A large number though take what looks to be about 2000 seconds to end (and seemingly all are about the same duration, which maybe means its a timeout on the task itself); and a few (about 9?) never finish (those are the lines that extend all the way from their respective start times all the way to the far right of the graph) The tasks that take about 2000 seconds look like they're going into Queued state - looking at the plot of karajan job submission tasks in queued state, they appear there too: http://www.ci.uchicago.edu/~benc/tmp/report-modelproc-20080924-1226-pkzripi7/karatasks.JOB_SUBMISSION.Queue.sorted-start.png There are a couple of interesting things here that I haven't seen before: 1. stagein/stageout oscillation Coasters are providing a plenty of cores for running tasks, with very low scheduling latency. In this run, the execution rate is limited by the rate at which files can be staged in. There is a fixed load for file staging, which is shared between stageins and stageouts. Once a file has been staged in, the corresponding task will be executed almost instantly, and two seconds later a stageout task will go on the queue. This seems to be causing a pretty-looking oscillation in the stageout and stagein graphs. Maybe that's a bad thing, maybe it doesn't matter. 2. Execution peaks at coaster restart time. When no coaster workers are running, stageins still happen. So when coaster workers start up when there have been none running, there are plenty of tasks to run. The coaster workers die every 1h45m (6300 seconds) (due to wall time specification) and are restarted, which then is subject to gram+sge scheduling delay. So every 6300s in the run there is a section of the active tasks graph where the nuber of active tasks drops to 0 for a bit and then shoots high up to 400 tasks active at once for a very short period of time. In the present run, I don't think this is causing any actual delay in the total runtime of the workflow because coasters are not causing any rate limit. In other runs with other applications, maybe that will have some effect that might be significant. Coasters are able to run 400 tasks at once because of what I regard as a bug in the way that multiple cores are supported in coasters - far too many (16 x too many in this case) cores are allocated which means where there is a sudden peak in job submissions there are lots of cores available. This shouldn't happen. However even if that was fixed so that it only allocated the right number of cores, rather than the wrong number of nodes, I think if there is a sudden peak in jobs as happens when the coaster workers all die around the same time due to walltime, then the worker manager will still end up trying to allocate enough workers to cover that peak, even though the peak is very unusual. So this will result in basically wasted coaster worker runs. -- From hategan at mcs.anl.gov Thu Sep 25 09:38:14 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 25 Sep 2008 09:38:14 -0500 Subject: [Swift-devel] Re: examining the plots of a 65535 job CNARI run In-Reply-To: References: Message-ID: <1222353494.7226.0.camel@localhost> On Thu, 2008-09-25 at 14:11 +0000, Ben Clifford wrote: > A large number though take what looks to be about 2000 seconds to end (and > seemingly all are about the same duration, which maybe means its a timeout > on the task itself); That's probably when no worker is allocated, so it includes the queue time. From benc at hawaga.org.uk Thu Sep 25 09:41:56 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 25 Sep 2008 14:41:56 +0000 (GMT) Subject: [Swift-devel] Re: examining the plots of a 65535 job CNARI run In-Reply-To: <1222353494.7226.0.camel@localhost> References: <1222353494.7226.0.camel@localhost> Message-ID: On Thu, 25 Sep 2008, Mihael Hategan wrote: > > A large number though take what looks to be about 2000 seconds to end (and > > seemingly all are about the same duration, which maybe means its a timeout > > on the task itself); > > That's probably when no worker is allocated, so it includes the queue > time. My understanding of the allocation pattern for workers is that there should be plenty of spare workers most/all of the time. These tasks aren't appearing at the every-6300s restart-the-workers points - they seem scattered fairly evenly throughout the execution. The time looks like it is mostly queue time. Look at this plot: http://www.ci.uchicago.edu/~benc/tmp/report-modelproc-20080924-1226-pkzripi7 /karatasks.JOB_SUBMISSION.Queue.sorted-start.png Each task is represented by a red line with the left end being when it goes into Submitted state and the right end being when it leaves submitted state (eg to fail or become active). -- From hategan at mcs.anl.gov Thu Sep 25 09:58:50 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 25 Sep 2008 09:58:50 -0500 Subject: [Swift-devel] Re: examining the plots of a 65535 job CNARI run In-Reply-To: References: <1222353494.7226.0.camel@localhost> Message-ID: <1222354730.7737.1.camel@localhost> On Thu, 2008-09-25 at 14:41 +0000, Ben Clifford wrote: > > On Thu, 25 Sep 2008, Mihael Hategan wrote: > > > > A large number though take what looks to be about 2000 seconds to end (and > > > seemingly all are about the same duration, which maybe means its a timeout > > > on the task itself); > > > > That's probably when no worker is allocated, so it includes the queue > > time. > > My understanding of the allocation pattern for workers is that there > should be plenty of spare workers most/all of the time. > > These tasks aren't appearing at the every-6300s restart-the-workers points > - they seem scattered fairly evenly throughout the execution. When a task needs a new worker it becomes bound to the request for the new worker. It does not go to another worker if it becomes available while its worker is queued. From benc at hawaga.org.uk Thu Sep 25 10:06:59 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 25 Sep 2008 15:06:59 +0000 (GMT) Subject: [Swift-devel] Re: examining the plots of a 65535 job CNARI run In-Reply-To: <1222354730.7737.1.camel@localhost> References: <1222353494.7226.0.camel@localhost> <1222354730.7737.1.camel@localhost> Message-ID: On Thu, 25 Sep 2008, Mihael Hategan wrote: > When a task needs a new worker it becomes bound to the request for the > new worker. It does not go to another worker if it becomes available > while its worker is queued. Could be that, but ~2000s seems a long time for that - the every 6300s trough/peak periods where the coaster workers get restarted are only a couple hundred seconds long. -- From hategan at mcs.anl.gov Thu Sep 25 10:16:27 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 25 Sep 2008 10:16:27 -0500 Subject: [Swift-devel] Re: examining the plots of a 65535 job CNARI run In-Reply-To: References: <1222353494.7226.0.camel@localhost> <1222354730.7737.1.camel@localhost> Message-ID: <1222355787.7965.6.camel@localhost> On Thu, 2008-09-25 at 15:06 +0000, Ben Clifford wrote: > On Thu, 25 Sep 2008, Mihael Hategan wrote: > > > When a task needs a new worker it becomes bound to the request for the > > new worker. It does not go to another worker if it becomes available > > while its worker is queued. > > Could be that, but ~2000s seems a long time for that - the every 6300s > trough/peak periods where the coaster workers get restarted are only a > couple hundred seconds long. Maybe unrelated, but there were these workers that, as far as the queuing system was concerned, were running, without having produced any logs. They kept "running" after things stopped happening, despite the fact that they should have shut down for being idle for too long. So I suspect there is a problem there. If replication was on, the long jobs may be early replicas that happened to go to such a funny worker, and which were eventually canceled after another job went through the whole pipe. I will add some code to cancel workers if no registration is received a certain time after the respective job goes into running state. > From hategan at mcs.anl.gov Thu Sep 25 11:22:34 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 25 Sep 2008 11:22:34 -0500 Subject: [Swift-devel] Re: examining the plots of a 65535 job CNARI run In-Reply-To: References: <1222353494.7226.0.camel@localhost> Message-ID: <1222359754.9096.4.camel@localhost> On Thu, 2008-09-25 at 14:41 +0000, Ben Clifford wrote: > Look at this plot: > > http://www.ci.uchicago.edu/~benc/tmp/report-modelproc-20080924-1226-pkzripi7 > /karatasks.JOB_SUBMISSION.Queue.sorted-start.png > > Each task is represented by a red line with the left end being when it > goes into Submitted state and the right end being when it leaves submitted > state (eg to fail or become active). Any chance the colors can be made different for the various conditions? From benc at hawaga.org.uk Thu Sep 25 13:06:19 2008 From: benc at hawaga.org.uk (=?ISO-8859-1?Q?Ben_Clifford?=) Date: 25 Sep 2008 19:06:19 +0100 Subject: [Swift-devel] RE: Re: examining the plots of a 65535 job CNARI run Message-ID: <200809251807.m8PI7DS3002757@dildano.hawaga.org.uk> Could be done but that graph is not 65535 pixels deep so there's lots of overlap. I'll have a play round with some other summarisation methods and see what i can come up with. ---- Message d'origine ---- De?: Mihael Hategan Envoy??: 25 Sep 2008 11:21 -05:00 A?: Ben Clifford Cc?: , Objet?: Re: examining the plots of a 65535 job CNARI run On Thu, 2008-09-25 at 14:41 +0000, Ben Clifford wrote: > Look at this plot: > > http://www.ci.uchicago.edu/~benc/tmp/report-modelproc-20080924-1226-pkzripi7 > /karatasks.JOB_SUBMISSION.Queue.sorted-start.png > > Each task is represented by a red line with the left end being when it > goes into Submitted state and the right end being when it leaves submitted > state (eg to fail or become active). Any chance the colors can be made different for the various conditions? -- From hategan at mcs.anl.gov Thu Sep 25 13:49:09 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 25 Sep 2008 13:49:09 -0500 Subject: [Swift-devel] RE: Re: examining the plots of a 65535 job CNARI run In-Reply-To: <200809251807.m8PI7DS3002757@dildano.hawaga.org.uk> References: <200809251807.m8PI7DS3002757@dildano.hawaga.org.uk> Message-ID: <1222368549.13382.1.camel@localhost> On Thu, 2008-09-25 at 19:06 +0100, Ben Clifford wrote: > Could be done but that graph is not 65535 pixels deep so there's lots of overlap. Though if the long jobs end because of the same thing, there would be some clear visual cue there. From benc at hawaga.org.uk Fri Sep 26 14:45:25 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 26 Sep 2008 19:45:25 +0000 (GMT) Subject: [Swift-devel] status files Message-ID: For providers that return exit codes from jobs correctly, I think it is safe to not use status files and instead use the returned exit code. I think that's the case for gram4 and local execution and either is or can be the case for coasters and falkon. I'm specifically interested in two cases: one with falkon on the bg/p where status file munging seems to take some time; and the other with skenny's cnari app where file-related activity seems to dominate - it might help or might not. -- From hategan at mcs.anl.gov Fri Sep 26 14:51:39 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 26 Sep 2008 14:51:39 -0500 Subject: [Swift-devel] status files In-Reply-To: References: Message-ID: <1222458699.31268.2.camel@localhost> On Fri, 2008-09-26 at 19:45 +0000, Ben Clifford wrote: > For providers that return exit codes from jobs correctly, I think it is > safe to not use status files and instead use the returned exit code. > > I think that's the case for gram4 and local execution and either is or can > be the case for coasters and falkon. > > I'm specifically interested in two cases: one with falkon on the bg/p > where status file munging seems to take some time; and the other with > skenny's cnari app where file-related activity seems to dominate - it > might help or might not. The exit code test is fast, since only a deletion is done in normal circumstances. I do not believe that the grain in performance is worth making the code more complex than it needs to be, but it may be worth a try. From benc at hawaga.org.uk Fri Sep 26 15:08:07 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 26 Sep 2008 20:08:07 +0000 (GMT) Subject: [Swift-devel] status files In-Reply-To: <1222458699.31268.2.camel@localhost> References: <1222458699.31268.2.camel@localhost> Message-ID: > I do not believe that the grain in performance is worth making the code > more complex than it needs to be, but it may be worth a try. I'm not particularly convinced either way, but hopefully I can get some numbers that show one way or the other. -- From benc at hawaga.org.uk Fri Sep 26 19:49:14 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 27 Sep 2008 00:49:14 +0000 (GMT) Subject: [Swift-devel] status files In-Reply-To: References: <1222458699.31268.2.camel@localhost> Message-ID: On Fri, 26 Sep 2008, Ben Clifford wrote: > > I do not believe that the grain in performance is worth making the code > > more complex than it needs to be, but it may be worth a try. > > I'm not particularly convinced either way, but hopefully I can get some > numbers that show one way or the other. On my rough mockup of the cnari application, it looks like this doesn't really have much effect when running 1000 jobs at the load I'm getting. It might be interesting to try on falkon+gpfs on the BG/P at Argonne, where some of the worker node wrapper logs suggest that a bunch of time is being consumed by status file munging. -- From benc at hawaga.org.uk Sun Sep 28 12:14:22 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 28 Sep 2008 17:14:22 +0000 (GMT) Subject: [Swift-devel] fakecnari on ranger without gridftp Message-ID: I have an app 'fakecnari' that behaves somewhat like the CNARI app that skenny has been working on, in order to make it easier for me to look at bottlenecks. So far, that's had similar problems to skenny's real runs where input files cannot be staged in to ranger fast enough from UC - this limits the number of cores that can be used at any one time on Ranger to around 15. So I thought in order to see what other bottlenecks might be found, I'd make a run with swift running directly on a ranger headnode, submitting through coasters and with the input and output files moved around using the local copy file provider (the same as happens when you use the default local site). This looks like it manages to use over 100 cores quite a lot. The speedup for the run is (including allocation time for coaster workers, which is a significant part of this run) about 50000s worth of sleep done in 800s, which is sleeping 62 times as fast as on a single core. Discounting worker allocation time, this takes about 590s which is sleeping about 84 times as fast. Even with local copies instead of ftp, file transfers (limited to 4 at once) appear to be a rate limiting factor. There are full plots here: http://www.ci.uchicago.edu/~benc/tmp/report-fakecnari-20080928-1134-herl17vf/ -- From benc at hawaga.org.uk Sun Sep 28 14:24:28 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 28 Sep 2008 19:24:28 +0000 (GMT) Subject: [Swift-devel] Re: fakecnari on ranger without gridftp In-Reply-To: References: Message-ID: On Sun, 28 Sep 2008, Ben Clifford wrote: > There are full plots here: > > http://www.ci.uchicago.edu/~benc/tmp/report-fakecnari-20080928-1134-herl17vf/ A second run with same parameters looks somewhat different visually but still seems to take around the same amount of time - more cores in use at peak, for example, but longer gaps with much fewer (sometimes none) in use. http://www.ci.uchicago.edu/~benc/tmp/report-fakecnari-20080928-1345-i8fzjcla/ -- From foster at mcs.anl.gov Sun Sep 28 14:30:19 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Sun, 28 Sep 2008 14:30:19 -0500 Subject: [Swift-devel] fakecnari on ranger without gridftp In-Reply-To: References: Message-ID: Ben: I like the idea of being able to sleep faster. That could help a lot sometimes :) Zhao and others have been working a lot on BG/P, as you probably know. The challenges inherent in file transfers there are in some ways comparable. Their solution has been to develop collective I/O operations that (a) multicast replicated input files more efficiently than multiple independent reads, and (b) apply two-stage methods to bundled inputs and outputs, moving them from local disks to the global file system via an intermediate file system. I wonder whether similar methods could be applied here? Independently of that, it would be useful to develop a performance model for the whole end-to-end system to determine where the bottlenecks are, and the peak performance that could be expected for each stage (including data movement) so we can see where there are opportunities for improvement, and where we are limited by fundamental limits. Ian. On Sep 28, 2008, at 12:14 PM, Ben Clifford wrote: > > I have an app 'fakecnari' that behaves somewhat like the CNARI app > that > skenny has been working on, in order to make it easier for me to > look at > bottlenecks. > > So far, that's had similar problems to skenny's real runs where input > files cannot be staged in to ranger fast enough from UC - this > limits the > number of cores that can be used at any one time on Ranger to around > 15. > > So I thought in order to see what other bottlenecks might be found, > I'd > make a run with swift running directly on a ranger headnode, > submitting > through coasters and with the input and output files moved around > using > the local copy file provider (the same as happens when you use the > default > local site). > > This looks like it manages to use over 100 cores quite a lot. The > speedup > for the run is (including allocation time for coaster workers, which > is > a significant part of this run) about 50000s worth of sleep done in > 800s, > which is sleeping 62 times as fast as on a single core. Discounting > worker allocation time, this takes about 590s which is sleeping > about 84 > times as fast. > > Even with local copies instead of ftp, file transfers (limited to 4 at > once) appear to be a rate limiting factor. > > There are full plots here: > > http://www.ci.uchicago.edu/~benc/tmp/report-fakecnari-20080928-1134-herl17vf/ > > -- > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Sun Sep 28 14:45:10 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 28 Sep 2008 19:45:10 +0000 (GMT) Subject: [Swift-devel] fakecnari on ranger without gridftp In-Reply-To: References: Message-ID: On Sun, 28 Sep 2008, Ian Foster wrote: > Zhao and others have been working a lot on BG/P, as you probably know. The > challenges inherent in file transfers there are in some ways comparable. Their > solution has been to develop collective I/O operations that (a) multicast > replicated input files more efficiently than multiple independent reads, and > (b) apply two-stage methods to bundled inputs and outputs, moving them from > local disks to the global file system via an intermediate file system. I > wonder whether similar methods could be applied here? yes, I imagine methods that help in falkon+bg/p will help here too, at least in the abstract. > Independently of that, it would be useful to develop a performance model for > the whole end-to-end system to determine where the bottlenecks are, and the > peak performance that could be expected for each stage (including data > movement) so we can see where there are opportunities for improvement, and > where we are limited by fundamental limits. right. certainly I know that transferring the same amount of data as a tarball in this particular app seems much much faster (9s vs 450s) - see Subject: [Swift-devel] lots of very small files vs gridftp. >From a raw CPU perspective, there's 65535 parallel tasks each of a few seconds long each; there's a fairly obvious, if naive, target there of 65k x speedup (mmm). I still don't really have a feel for what the filesystem on Ranger (lustre) will do - its been behaving fairly well so far, I think, but I imagine thats because it hasn't been terribly heavily loaded in my testing. -- From hategan at mcs.anl.gov Sun Sep 28 15:28:53 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 28 Sep 2008 15:28:53 -0500 Subject: [Swift-devel] fakecnari on ranger without gridftp In-Reply-To: References: Message-ID: <1222633733.18444.4.camel@localhost> On Sun, 2008-09-28 at 19:45 +0000, Ben Clifford wrote: > certainly I know that transferring the same amount of data as a tarball in > this particular app seems much much faster (9s vs 450s) - see Subject: > [Swift-devel] lots of very small files vs gridftp. Though one thing that is missing there is the part where lots of small files are created on the filesystem, which, on a shared FS, may be a considerable time. So I think we should probably define clearly what we mean by transferring lots of files, and whether that includes opening each, reading from each, sending data, creating each, and writing to each of the files involved. From iraicu at cs.uchicago.edu Sun Sep 28 22:00:51 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sun, 28 Sep 2008 22:00:51 -0500 Subject: [Swift-devel] fakecnari on ranger without gridftp In-Reply-To: References: Message-ID: <48E044E3.7080505@cs.uchicago.edu> I see the following stats for this run: Total number of events: 10002 Shortest event (s): 3.4300000667572 Longest event (s): 753.21799993515 Total duration of all events (s): 53898.3449883461 Mean event duration (s): 5.38875674748511 Standard deviation of event duration (s): 7.48318471316593 Maximum number of events at one time: 113 What inherently limits the run to 113 events at a time? Is it the fact that Coaster only allocated 113 (maybe a few more) CPU-cores? How many CPU-cores did coaster allocate? With 113 CPU-cores and 5.38 sec tasks, that means a throughput of ~21 tasks/sec. Is this the bottleneck? Its probably not the file system (in terms of the app accessing the input/output data), as if it were, task execution times would simply increase with load... but it could be the file system being slow in getting the input data in the right place for the app to start computing, as in staging it in. BTW, do the times above include wait queue times? I see the longest task being 753 sec, but below you say that the workload takes 590 sec not including queue time. Do you have a plot of the number of CPUs in relation to the number of active tasks? Are all available CPUs kept busy? The speedup is one story, 84X out of 113X possible (this 113X should really be the number of CPU-cores), but sometimes the workload characteristics limit the maximum possible speedup... and in that case, its good to look at the CPU-core utilization. Is it possible to draw a graph that has this info? # of CPU-cores, number of active tasks, and throughput of completed tasks? Ioan Ben Clifford wrote: > I have an app 'fakecnari' that behaves somewhat like the CNARI app that > skenny has been working on, in order to make it easier for me to look at > bottlenecks. > > So far, that's had similar problems to skenny's real runs where input > files cannot be staged in to ranger fast enough from UC - this limits the > number of cores that can be used at any one time on Ranger to around 15. > > So I thought in order to see what other bottlenecks might be found, I'd > make a run with swift running directly on a ranger headnode, submitting > through coasters and with the input and output files moved around using > the local copy file provider (the same as happens when you use the default > local site). > > This looks like it manages to use over 100 cores quite a lot. The speedup > for the run is (including allocation time for coaster workers, which is > a significant part of this run) about 50000s worth of sleep done in 800s, > which is sleeping 62 times as fast as on a single core. Discounting > worker allocation time, this takes about 590s which is sleeping about 84 > times as fast. > > Even with local copies instead of ftp, file transfers (limited to 4 at > once) appear to be a rate limiting factor. > > There are full plots here: > > http://www.ci.uchicago.edu/~benc/tmp/report-fakecnari-20080928-1134-herl17vf/ > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From iraicu at cs.uchicago.edu Sun Sep 28 22:29:36 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sun, 28 Sep 2008 22:29:36 -0500 Subject: [Swift-devel] fakecnari on ranger without gridftp In-Reply-To: References: Message-ID: <48E04BA0.600@cs.uchicago.edu> Hi, Ben Clifford wrote: > On Sun, 28 Sep 2008, Ian Foster wrote: > > ... > > >From a raw CPU perspective, there's 65535 parallel tasks each of a few > seconds long each; there's a fairly obvious, if naive, target there of 65k > x speedup (mmm). > > At that point, the main bottlenecks will be scalability of the mechanisms you use to drive the execution framework (i.e. Coaster/Falkon), the time it takes to bootstrap your the execution framework, the throughput you can dispatch tasks and receive results, and the speed of the file system that it can read inputs and write outputs. Assuming the execution framework scales, the rest can be computed as long as you understand the performance of the machine and the execution framework. For example, here are two runs we made with Falkon on the BG/P recently that might be similar to the fakecnari workload. We had 32K CPU-cores, 128K tasks, and each task involved sleeping for 4 sec, and writing 1KB of data. In an ideal world with 0 costs for the execution framework, and 0 costs of I/O, the workload time would have been 16 seconds (128K/32K*4sec), which would equate to 524288 CPU seconds. Running this workload on GPFS directly took 180 seconds (2912X), and running the same workload through the collective I/O framework we have took 61 seconds (8594X). The bottleneck in the GPFS case was the rate that we could create files and write the 1KB file, in the context of 32K CPUs doing this concurrently. The bottleneck for the collective I/O was the dispatch rate of Falkon, which in this case was 2148 tasks/sec. Once you understand the performance of the file system, and execution framework, these large scale numbers can be estimated quite nicely. > I still don't really have a feel for what the filesystem on Ranger > (lustre) will do - its been behaving fairly well so far, I think, but I > imagine thats because it hasn't been terribly heavily loaded in my > testing. > And if it was built to support the entire machine at full scale (64K CPU-cores), then I'd imagine that you'll need at least 1000s, if not 10Ks of CPU-cores to saturate the file system with small files. Once of these days, we'll probably start testing some of our BG/P apps on Ranger as well, so then, we can exchange notes better on each other's experiences and problems we are each facing. Ioan -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From hategan at mcs.anl.gov Mon Sep 29 18:37:50 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 29 Sep 2008 18:37:50 -0500 Subject: [Swift-devel] lots of very small files vs gridftp In-Reply-To: References: Message-ID: <1222731470.23121.21.camel@localhost> On Thu, 2008-09-25 at 13:10 +0000, Ben Clifford wrote: > To transfer 1000 files: > > # concurrent conncetions | duration of copy (seconds, multiple runs) > 16 7, 16, 16 > 4 14, 14, 14 > 2 26, 25 > 1 48, 52 > I tried a similar experiment, this time with the java libraries, to see how that works. The setup was transfer 1024 files of 1024 bytes each with parallelism (at the karajan level, though this should cause corresponding gridftp connection parallelism) of 1 to 16 in powers of 2. I got this for Ranger (in ms): 1: 242030 2: 121916 4: 61787 8: 31903 16: died (probably trying to start too many connections concurrently) Then UC: 1: 212192 2: 106872 4: 54790 8: 28838 16: 18166 Then I made a quick file provider for coasters, which sends the data over the same connection (and upped the parallelism): UC-coaster 1: 102624 2: 31388 4: 18042 8: 8823 16: 5510 32: 5053 64: 6686 128: 5551 Then I ran the same, but instead of transferring to a nfs directory, things went to /dev/null: 1: 93997 2: 35694 4: 16269 8: 7349 16: 4462 32: 1865 64: 1332 128: 1304 I suppose the bad speed with coasters is because things go up on an encrypted connection, but it may be something else. So otherwise, if files are small, one can look at this as the task of sending (acknowledged) messages from one side to the other, where the communication lag is the problem and the way to solve it is by increasing parallelism (which essentially is what tarring things up does). That and whatever FS limitations the remote side has. Mihael From foster at mcs.anl.gov Mon Sep 29 20:34:18 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Mon, 29 Sep 2008 20:34:18 -0500 Subject: [Swift-devel] lots of very small files vs gridftp In-Reply-To: <1222731470.23121.21.camel@localhost> References: <1222731470.23121.21.camel@localhost> Message-ID: <2C47C9EB-948D-4E3E-BC83-402111B6C45E@mcs.anl.gov> Remind me again why we aren't just using TAR and GridFTP? Ian. On Sep 29, 2008, at 6:37 PM, Mihael Hategan wrote: > On Thu, 2008-09-25 at 13:10 +0000, Ben Clifford wrote: > >> To transfer 1000 files: >> >> # concurrent conncetions | duration of copy (seconds, multiple >> runs) >> 16 7, 16, 16 >> 4 14, 14, 14 >> 2 26, 25 >> 1 48, 52 >> > > I tried a similar experiment, this time with the java libraries, to > see > how that works. > > The setup was transfer 1024 files of 1024 bytes each with parallelism > (at the karajan level, though this should cause corresponding gridftp > connection parallelism) of 1 to 16 in powers of 2. > > I got this for Ranger (in ms): > 1: 242030 > 2: 121916 > 4: 61787 > 8: 31903 > 16: died (probably trying to start too many connections concurrently) > > Then UC: > 1: 212192 > 2: 106872 > 4: 54790 > 8: 28838 > 16: 18166 > > Then I made a quick file provider for coasters, which sends the data > over the same connection (and upped the parallelism): > UC-coaster > 1: 102624 > 2: 31388 > 4: 18042 > 8: 8823 > 16: 5510 > 32: 5053 > 64: 6686 > 128: 5551 > > Then I ran the same, but instead of transferring to a nfs directory, > things went to /dev/null: > 1: 93997 > 2: 35694 > 4: 16269 > 8: 7349 > 16: 4462 > 32: 1865 > 64: 1332 > 128: 1304 > > I suppose the bad speed with coasters is because things go up on an > encrypted connection, but it may be something else. > > So otherwise, if files are small, one can look at this as the task of > sending (acknowledged) messages from one side to the other, where the > communication lag is the problem and the way to solve it is by > increasing parallelism (which essentially is what tarring things up > does). That and whatever FS limitations the remote side has. > > Mihael > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Sep 30 00:23:39 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 30 Sep 2008 00:23:39 -0500 Subject: [Swift-devel] lots of very small files vs gridftp In-Reply-To: <2C47C9EB-948D-4E3E-BC83-402111B6C45E@mcs.anl.gov> References: <1222731470.23121.21.camel@localhost> <2C47C9EB-948D-4E3E-BC83-402111B6C45E@mcs.anl.gov> Message-ID: <1222752219.29016.11.camel@localhost> On Mon, 2008-09-29 at 20:34 -0500, Ian Foster wrote: > Remind me again why we aren't just using TAR and GridFTP? I don't think we're using anything at this point, hence all the testing and exploring. I think there is some complexity in figuring out dynamically what exactly to tar up and how to untar on the remote site. So more (or more complex) code than the other choice. But I'm not otherwise opposed to anything in particular. I suppose taring/untaring could be done manually, at the expense of messing the abstractness of swift. > > Ian. > > On Sep 29, 2008, at 6:37 PM, Mihael Hategan wrote: > > > On Thu, 2008-09-25 at 13:10 +0000, Ben Clifford wrote: > > > >> To transfer 1000 files: > >> > >> # concurrent conncetions | duration of copy (seconds, multiple > >> runs) > >> 16 7, 16, 16 > >> 4 14, 14, 14 > >> 2 26, 25 > >> 1 48, 52 > >> > > > > I tried a similar experiment, this time with the java libraries, to > > see > > how that works. > > > > The setup was transfer 1024 files of 1024 bytes each with parallelism > > (at the karajan level, though this should cause corresponding gridftp > > connection parallelism) of 1 to 16 in powers of 2. > > > > I got this for Ranger (in ms): > > 1: 242030 > > 2: 121916 > > 4: 61787 > > 8: 31903 > > 16: died (probably trying to start too many connections concurrently) > > > > Then UC: > > 1: 212192 > > 2: 106872 > > 4: 54790 > > 8: 28838 > > 16: 18166 > > > > Then I made a quick file provider for coasters, which sends the data > > over the same connection (and upped the parallelism): > > UC-coaster > > 1: 102624 > > 2: 31388 > > 4: 18042 > > 8: 8823 > > 16: 5510 > > 32: 5053 > > 64: 6686 > > 128: 5551 > > > > Then I ran the same, but instead of transferring to a nfs directory, > > things went to /dev/null: > > 1: 93997 > > 2: 35694 > > 4: 16269 > > 8: 7349 > > 16: 4462 > > 32: 1865 > > 64: 1332 > > 128: 1304 > > > > I suppose the bad speed with coasters is because things go up on an > > encrypted connection, but it may be something else. > > > > So otherwise, if files are small, one can look at this as the task of > > sending (acknowledged) messages from one side to the other, where the > > communication lag is the problem and the way to solve it is by > > increasing parallelism (which essentially is what tarring things up > > does). That and whatever FS limitations the remote side has. > > > > Mihael > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Tue Sep 30 19:02:05 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 1 Oct 2008 00:02:05 +0000 (GMT) Subject: [Swift-devel] lots of very small files vs gridftp In-Reply-To: <1222752219.29016.11.camel@localhost> References: <1222731470.23121.21.camel@localhost> <2C47C9EB-948D-4E3E-BC83-402111B6C45E@mcs.anl.gov> <1222752219.29016.11.camel@localhost> Message-ID: On Tue, 30 Sep 2008, Mihael Hategan wrote: > But I'm not otherwise opposed to anything in particular. I suppose > taring/untaring could be done manually, at the expense of messing the > abstractness of swift. I played some making Swift do tar/untar of stageins automatically (so no modifications are needed to the SwiftScript code). Theres a plot here http://www.ci.uchicago.edu/~benc/tmp/report-fakecnari-20080930-1820-0nmtamxg/ Basically the first 600s are taken up allocating coaster workers, and the remaining time uses quite a lot of cores at once. So the total duration of run doesn't seem that different; but I think that the behaviour as number of jobs increases will be better- the 600s startup is a fixed cost (which I also think can be massively reduced in a couple of ways) and the bit that is proportional to the number of jobs is the remaining three hundred seconds. This is a fairly dirty hack - there's no clustering for stageouts; there is fairly crude decision of whether to cluster transfers or not (basically, queue file transfers for 30s and after that, if there's more than one, make a cluster). The initial startup is slow, I think, because the initial startup of coaster workers is done based on a malformed job submission caused by the low quality of this clustering code - it doesn't pass through the coastersPerNode parameter for initial jobs so the initial coaster worker is very slow. -- From hategan at mcs.anl.gov Tue Sep 30 19:45:44 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 30 Sep 2008 19:45:44 -0500 Subject: [Swift-devel] lots of very small files vs gridftp In-Reply-To: References: <1222731470.23121.21.camel@localhost> <2C47C9EB-948D-4E3E-BC83-402111B6C45E@mcs.anl.gov> <1222752219.29016.11.camel@localhost> Message-ID: <1222821944.6731.24.camel@localhost> On Wed, 2008-10-01 at 00:02 +0000, Ben Clifford wrote: > On Tue, 30 Sep 2008, Mihael Hategan wrote: > > > But I'm not otherwise opposed to anything in particular. I suppose > > taring/untaring could be done manually, at the expense of messing the > > abstractness of swift. > > I played some making Swift do tar/untar of stageins automatically (so no > modifications are needed to the SwiftScript code). ?This reminds me of clustering of jobs we did initially with swift versus Falkon. > > Theres a plot here > http://www.ci.uchicago.edu/~benc/tmp/report-fakecnari-20080930-1820-0nmtamxg/ > > Basically the first 600s are taken up allocating coaster workers, and the > remaining time uses quite a lot of cores at once. So the total duration of > run doesn't seem that different; but I think that the behaviour as number > of jobs increases will be better- the 600s startup is a fixed cost (which > I also think can be massively reduced in a couple of ways) and the bit > that is proportional to the number of jobs is the remaining three hundred > seconds. Could the untar job be done with fork straight through gram? > > > This is a fairly dirty hack - there's no clustering for stageouts; Though that could be done in a similar way, right? > there > is fairly crude decision of whether to cluster transfers or not > (basically, queue file transfers for 30s and after that, if there's more > than one, make a cluster). > > The initial startup is slow, I think, because the initial startup of > coaster workers is done based on a malformed job submission caused by the > low quality of this clustering code - it doesn't pass through the > coastersPerNode parameter for initial jobs so the initial coaster worker > is very slow. Using fork would probably solve this, too. Now, not to leave the other side of the argument on its own, running other fileops (mkdir, ls, etc.) through coasters does offer the added benefit of parallelizing the RTT. We do at least one such operation per job. In the 2^16 jobs case, and with parallelism of 2^7 vs. 2^3, one would get, in the theoretical case, an improvement of (2^16/2^3 - 2^16/2^7)RTTs. Or (8192 - 512)RTTs. Or 7680s for an RTT of 1s. But then that could probably also be done in GridFTP with pipelining. From benc at hawaga.org.uk Tue Sep 30 19:56:54 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 1 Oct 2008 00:56:54 +0000 (GMT) Subject: [Swift-devel] lots of very small files vs gridftp In-Reply-To: <1222821944.6731.24.camel@localhost> References: <1222731470.23121.21.camel@localhost> <2C47C9EB-948D-4E3E-BC83-402111B6C45E@mcs.anl.gov> <1222752219.29016.11.camel@localhost> <1222821944.6731.24.camel@localhost> Message-ID: On Tue, 30 Sep 2008, Mihael Hategan wrote: > ?This reminds me of clustering of jobs we did initially with swift > versus Falkon. yes. its half a cut-and-paste job of that code. > Could the untar job be done with fork straight through gram? yes. though my feeling at the moment when there is something like coasters around with spare nodes is that this wouldn't change much. In the case of non-coaster runs using something like gram4 that can still take a reasonable number of jobs, getting the untar pushed ahead of queued 'work' jobs is probably good. -- From hategan at mcs.anl.gov Tue Sep 30 20:03:50 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 30 Sep 2008 20:03:50 -0500 Subject: [Swift-devel] lots of very small files vs gridftp In-Reply-To: References: <1222731470.23121.21.camel@localhost> <2C47C9EB-948D-4E3E-BC83-402111B6C45E@mcs.anl.gov> <1222752219.29016.11.camel@localhost> <1222821944.6731.24.camel@localhost> Message-ID: <1222823030.7845.0.camel@localhost> On Wed, 2008-10-01 at 00:56 +0000, Ben Clifford wrote: > On Tue, 30 Sep 2008, Mihael Hategan wrote: > > > ?This reminds me of clustering of jobs we did initially with swift > > versus Falkon. > > yes. its half a cut-and-paste job of that code. I was mostly thinking about the conceptual part and the implications. At least for small jobs, Falkon turned out to be the better option. > > > Could the untar job be done with fork straight through gram? > > yes. though my feeling at the moment when there is something like coasters > around with spare nodes is that this wouldn't change much. In the case of > non-coaster runs using something like gram4 that can still take a > reasonable number of jobs, getting the untar pushed ahead of queued 'work' > jobs is probably good. > From zhaozhang at uchicago.edu Tue Sep 30 21:40:20 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 30 Sep 2008 21:40:20 -0500 Subject: [Swift-devel] could swift use return code from falkon as a success notification? Message-ID: <48E2E314.6040809@uchicago.edu> Hi, All I am trying to optimize the swift performance on BGP, I finished it for the input phase, but suffering the poor performance at the output phase, which is exactly the status file creation process, as you could tell from the following picture. In this test, I ran sleep_30 jobs, which is expected to finish in 30 seconds. I am wondering if we could use falkon return code instead of the status file? Thanks. zhao -------------- next part -------------- A non-text attachment was scrubbed... Name: wrapper.JPG Type: image/jpeg Size: 33697 bytes Desc: not available URL: From hategan at mcs.anl.gov Tue Sep 30 22:01:21 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 30 Sep 2008 22:01:21 -0500 Subject: [Swift-devel] could swift use return code from falkon as a success notification? In-Reply-To: <48E2E314.6040809@uchicago.edu> References: <48E2E314.6040809@uchicago.edu> Message-ID: <1222830081.9463.4.camel@localhost> On Tue, 2008-09-30 at 21:40 -0500, Zhao Zhang wrote: > Hi, All > > I am trying to optimize the swift performance on BGP, I finished it for > the input phase, > but suffering the poor performance at the output phase, which is exactly > the status file > creation process, as you could tell from the following picture. In this > test, I ran sleep_30 > jobs, which is expected to finish in 30 seconds. > > I am wondering if we could use falkon return code instead of the status > file? Thanks. Yes you could. You would have to do the following: 1. Remove the relevant part from the wrapper (touching of the success file and sticking failure info in the failure file) 2. Comment out the checkJobStatus() call in vdl-int.k (around line 415) 3. Make the deef provider set a fault on the task (should be a JobException) when the exit code is not 0 4. Make the wrapper exit with a non-zero exit code when there is a problem If this is too brief, let me know, and I'll give you more details.