From benc at hawaga.org.uk Tue Apr 1 03:58:52 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 1 Apr 2008 08:58:52 +0000 (GMT) Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: <47E88D8C.4090207@mcs.anl.gov> References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov> Message-ID: On Tue, 25 Mar 2008, Michael Wilde wrote: > With this fixed, the total time in wrapper.sh including the app is now about > 15 seconds, with 3 being in the app-wrapper itself. The time seems about > evenly spread over the several wrapper.sh operations, which is not surprising > when 500 wrappers hit NFS all at once. Does this machine have a higher (/different) performance shared file system such as PVFS or GPFS? We spent some time in november layout out the filesystem to be sympathetic to GPFS to help avoid bottlenecks like you are seeing here. It would be kinda sad if either it isn't available or you aren't using it even though it is available. -- From benc at hawaga.org.uk Tue Apr 1 05:05:46 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 1 Apr 2008 10:05:46 +0000 (GMT) Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov> Message-ID: On Tue, 1 Apr 2008, Ben Clifford wrote: > > With this fixed, the total time in wrapper.sh including the app is now about > > 15 seconds, with 3 being in the app-wrapper itself. The time seems about > > evenly spread over the several wrapper.sh operations, which is not surprising > > when 500 wrappers hit NFS all at once. > > Does this machine have a higher (/different) performance shared file > system such as PVFS or GPFS? We spent some time in november layout out the > filesystem to be sympathetic to GPFS to help avoid bottlenecks like you > are seeing here. It would be kinda sad if either it isn't available or you > aren't using it even though it is available. >From what I can tell from the web, PVFS and/or GPFS are available on all of the Argonne Blue Gene machines. Is this true? I don't want to provide more scalability support for NFS-on-bluegene if it is. -- From wilde at mcs.anl.gov Tue Apr 1 08:04:04 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 01 Apr 2008 08:04:04 -0500 Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov> Message-ID: <47F232C4.3080607@mcs.anl.gov> We're only working on the BG/P system, and GPFS is the only shared filesystem there. GPFS access, however, remains a big scalabiity issue. Frequent small accesses to GPFS in our measurements really slow down the workflow. We did a lot of micro-benchmark tests. Zhao, can you gather a set of these tests into a small suite and post numbers so the Swift developers can get an understanding of the system's GPFS access performance? Also note: the only local filesystem is RAM disk on /tmp or /dev/shm. (Ioan and Zhao should confirm if they verified that /tmp is on RAM). - Mike On 4/1/08 5:05 AM, Ben Clifford wrote: > On Tue, 1 Apr 2008, Ben Clifford wrote: > >>> With this fixed, the total time in wrapper.sh including the app is now about >>> 15 seconds, with 3 being in the app-wrapper itself. The time seems about >>> evenly spread over the several wrapper.sh operations, which is not surprising >>> when 500 wrappers hit NFS all at once. >> Does this machine have a higher (/different) performance shared file >> system such as PVFS or GPFS? We spent some time in november layout out the >> filesystem to be sympathetic to GPFS to help avoid bottlenecks like you >> are seeing here. It would be kinda sad if either it isn't available or you >> aren't using it even though it is available. > >>From what I can tell from the web, PVFS and/or GPFS are available on all > of the Argonne Blue Gene machines. Is this true? I don't want to provide > more scalability support for NFS-on-bluegene if it is. > From iraicu at cs.uchicago.edu Tue Apr 1 08:37:52 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 01 Apr 2008 08:37:52 -0500 Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov> Message-ID: <47F23AB0.8010109@cs.uchicago.edu> Ben, The #s below were from the SiCortex, which only has NFS. We are using the latest Swift from SVN, so if the Swift improvements to avoid these bottlenecks are enabled by default in Swift, then we are using them already! On the BG/P, we have GPFS and PVFS, but we found GPFS to handle meta-data better, so we are using GPFS for all our tests. Ioan Ben Clifford wrote: > On Tue, 25 Mar 2008, Michael Wilde wrote: > > >> With this fixed, the total time in wrapper.sh including the app is now about >> 15 seconds, with 3 being in the app-wrapper itself. The time seems about >> evenly spread over the several wrapper.sh operations, which is not surprising >> when 500 wrappers hit NFS all at once. >> > > Does this machine have a higher (/different) performance shared file > system such as PVFS or GPFS? We spent some time in november layout out the > filesystem to be sympathetic to GPFS to help avoid bottlenecks like you > are seeing here. It would be kinda sad if either it isn't available or you > aren't using it even though it is available. > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Tue Apr 1 08:39:04 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 01 Apr 2008 08:39:04 -0500 Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov> Message-ID: <47F23AF8.2000508@cs.uchicago.edu> NFS on SiCortex ( in the future, there will be PVFS, but that is not today) GPFS on BG/P Ioan Ben Clifford wrote: > On Tue, 1 Apr 2008, Ben Clifford wrote: > > >>> With this fixed, the total time in wrapper.sh including the app is now about >>> 15 seconds, with 3 being in the app-wrapper itself. The time seems about >>> evenly spread over the several wrapper.sh operations, which is not surprising >>> when 500 wrappers hit NFS all at once. >>> >> Does this machine have a higher (/different) performance shared file >> system such as PVFS or GPFS? We spent some time in november layout out the >> filesystem to be sympathetic to GPFS to help avoid bottlenecks like you >> are seeing here. It would be kinda sad if either it isn't available or you >> aren't using it even though it is available. >> > > >From what I can tell from the web, PVFS and/or GPFS are available on all > of the Argonne Blue Gene machines. Is this true? I don't want to provide > more scalability support for NFS-on-bluegene if it is. > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Tue Apr 1 10:26:45 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 01 Apr 2008 10:26:45 -0500 Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: <47F232C4.3080607@mcs.anl.gov> References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov> <47F232C4.3080607@mcs.anl.gov> Message-ID: <47F25435.8080105@cs.uchicago.edu> Michael Wilde wrote: > We're only working on the BG/P system, and GPFS is the only shared > filesystem there. There is PVFS, but that performed even worse in our tests. > > GPFS access, however, remains a big scalabiity issue. Frequent small > accesses to GPFS in our measurements really slow down the workflow. We > did a lot of micro-benchmark tests. Yes! The BG/P's GPFS probably performs the worst out of all GPFSes I have worked on, in terms of small granular accesses. For example, reading 1 byte files, invoking a trivial script (i.e. exit 0), etc... all perform extremely poor, to the point that we need to move away from GPFS almost completely. For example, the things that we eventually need to avoid on GPFS for the BG/P are: invoking wrapper.sh mkdir any logging to GPFS There are probably others. > > Zhao, can you gather a set of these tests into a small suite and post > numbers so the Swift developers can get an understanding of the > system's GPFS access performance? > > Also note: the only local filesystem is RAM disk on /tmp or /dev/shm. > (Ioan and Zhao should confirm if they verified that /tmp is on RAM). Yes, there are no local disks on either BG/P or SiCortex. Both machines have /tmp and dev/shm mounted as ram disks. Ioan > > - Mike > > On 4/1/08 5:05 AM, Ben Clifford wrote: >> On Tue, 1 Apr 2008, Ben Clifford wrote: >> >>>> With this fixed, the total time in wrapper.sh including the app is >>>> now about >>>> 15 seconds, with 3 being in the app-wrapper itself. The time seems >>>> about >>>> evenly spread over the several wrapper.sh operations, which is not >>>> surprising >>>> when 500 wrappers hit NFS all at once. >>> Does this machine have a higher (/different) performance shared file >>> system such as PVFS or GPFS? We spent some time in november layout >>> out the filesystem to be sympathetic to GPFS to help avoid >>> bottlenecks like you are seeing here. It would be kinda sad if >>> either it isn't available or you aren't using it even though it is >>> available. >> >>> From what I can tell from the web, PVFS and/or GPFS are available on >>> all >> of the Argonne Blue Gene machines. Is this true? I don't want to >> provide more scalability support for NFS-on-bluegene if it is. >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From hategan at mcs.anl.gov Tue Apr 1 10:32:03 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 01 Apr 2008 10:32:03 -0500 Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: <47F25435.8080105@cs.uchicago.edu> References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov> <47F232C4.3080607@mcs.anl.gov> <47F25435.8080105@cs.uchicago.edu> Message-ID: <1207063923.30798.0.camel@blabla.mcs.anl.gov> On Tue, 2008-04-01 at 10:26 -0500, Ioan Raicu wrote: > > Michael Wilde wrote: > > We're only working on the BG/P system, and GPFS is the only shared > > filesystem there. > There is PVFS, but that performed even worse in our tests. > > > > GPFS access, however, remains a big scalabiity issue. Frequent small > > accesses to GPFS in our measurements really slow down the workflow. We > > did a lot of micro-benchmark tests. > Yes! The BG/P's GPFS probably performs the worst out of all GPFSes I > have worked on, in terms of small granular accesses. For example, > reading 1 byte files, invoking a trivial script (i.e. exit 0), etc... > all perform extremely poor, to the point that we need to move away from > GPFS almost completely. For example, the things that we eventually need > to avoid on GPFS for the BG/P are: > invoking wrapper.sh > mkdir > any logging to GPFS Doing nothing can be incredibly fast. > > There are probably others. > > > > Zhao, can you gather a set of these tests into a small suite and post > > numbers so the Swift developers can get an understanding of the > > system's GPFS access performance? > > > > Also note: the only local filesystem is RAM disk on /tmp or /dev/shm. > > (Ioan and Zhao should confirm if they verified that /tmp is on RAM). > Yes, there are no local disks on either BG/P or SiCortex. Both machines > have /tmp and dev/shm mounted as ram disks. > > Ioan > > > > - Mike > > > > On 4/1/08 5:05 AM, Ben Clifford wrote: > >> On Tue, 1 Apr 2008, Ben Clifford wrote: > >> > >>>> With this fixed, the total time in wrapper.sh including the app is > >>>> now about > >>>> 15 seconds, with 3 being in the app-wrapper itself. The time seems > >>>> about > >>>> evenly spread over the several wrapper.sh operations, which is not > >>>> surprising > >>>> when 500 wrappers hit NFS all at once. > >>> Does this machine have a higher (/different) performance shared file > >>> system such as PVFS or GPFS? We spent some time in november layout > >>> out the filesystem to be sympathetic to GPFS to help avoid > >>> bottlenecks like you are seeing here. It would be kinda sad if > >>> either it isn't available or you aren't using it even though it is > >>> available. > >> > >>> From what I can tell from the web, PVFS and/or GPFS are available on > >>> all > >> of the Argonne Blue Gene machines. Is this true? I don't want to > >> provide more scalability support for NFS-on-bluegene if it is. > >> > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > From iraicu at cs.uchicago.edu Tue Apr 1 10:43:16 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 01 Apr 2008 10:43:16 -0500 Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: <1207063923.30798.0.camel@blabla.mcs.anl.gov> References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov> <47F232C4.3080607@mcs.anl.gov> <47F25435.8080105@cs.uchicago.edu> <1207063923.30798.0.camel@blabla.mcs.anl.gov> Message-ID: <47F25814.8050301@cs.uchicago.edu> Mihael Hategan wrote: > On Tue, 2008-04-01 at 10:26 -0500, Ioan Raicu wrote: > >> Michael Wilde wrote: >> >>> We're only working on the BG/P system, and GPFS is the only shared >>> filesystem there. >>> >> There is PVFS, but that performed even worse in our tests. >> >>> GPFS access, however, remains a big scalabiity issue. Frequent small >>> accesses to GPFS in our measurements really slow down the workflow. We >>> did a lot of micro-benchmark tests. >>> >> Yes! The BG/P's GPFS probably performs the worst out of all GPFSes I >> have worked on, in terms of small granular accesses. For example, >> reading 1 byte files, invoking a trivial script (i.e. exit 0), etc... >> all perform extremely poor, to the point that we need to move away from >> GPFS almost completely. For example, the things that we eventually need >> to avoid on GPFS for the BG/P are: >> invoking wrapper.sh >> mkdir >> any logging to GPFS >> > > Doing nothing can be incredibly fast. > What I meant is that we need to move these operations to the local file system, i.e. RAM. We have run applications on BG/P via Falkon only, and implemented a caching strategy that caches all scripts, binaries, and input data, to RAM... once the task execution (all from RAM) completes, and has written its output to RAM, then there is a single copy operation of the output data from RAM to GPFS. We control how frequently this copy operation occurs, so we can essentially scale quite nicely and linearly with this approach. The hope is that we can eventually work this kind of functionality in the wrapper.sh, or in Swift itself. So, a reply to your statement, we would like to preserve the functionality of the wrapper.sh, but move as much as possible of that functionality from a shared file system to a local disk. Ioan > >> There are probably others. >> >>> Zhao, can you gather a set of these tests into a small suite and post >>> numbers so the Swift developers can get an understanding of the >>> system's GPFS access performance? >>> >>> Also note: the only local filesystem is RAM disk on /tmp or /dev/shm. >>> (Ioan and Zhao should confirm if they verified that /tmp is on RAM). >>> >> Yes, there are no local disks on either BG/P or SiCortex. Both machines >> have /tmp and dev/shm mounted as ram disks. >> >> Ioan >> >>> - Mike >>> >>> On 4/1/08 5:05 AM, Ben Clifford wrote: >>> >>>> On Tue, 1 Apr 2008, Ben Clifford wrote: >>>> >>>> >>>>>> With this fixed, the total time in wrapper.sh including the app is >>>>>> now about >>>>>> 15 seconds, with 3 being in the app-wrapper itself. The time seems >>>>>> about >>>>>> evenly spread over the several wrapper.sh operations, which is not >>>>>> surprising >>>>>> when 500 wrappers hit NFS all at once. >>>>>> >>>>> Does this machine have a higher (/different) performance shared file >>>>> system such as PVFS or GPFS? We spent some time in november layout >>>>> out the filesystem to be sympathetic to GPFS to help avoid >>>>> bottlenecks like you are seeing here. It would be kinda sad if >>>>> either it isn't available or you aren't using it even though it is >>>>> available. >>>>> >>>>> From what I can tell from the web, PVFS and/or GPFS are available on >>>>> all >>>>> >>>> of the Argonne Blue Gene machines. Is this true? I don't want to >>>> provide more scalability support for NFS-on-bluegene if it is. >>>> >>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> > > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Apr 1 11:08:24 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 01 Apr 2008 11:08:24 -0500 Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: <47F25814.8050301@cs.uchicago.edu> References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov> <47F232C4.3080607@mcs.anl.gov> <47F25435.8080105@cs.uchicago.edu> <1207063923.30798.0.camel@blabla.mcs.anl.gov> <47F25814.8050301@cs.uchicago.edu> Message-ID: <1207066105.434.0.camel@blabla.mcs.anl.gov> > > > > > > > Doing nothing can be incredibly fast. > > > What I meant is that we need to move these operations to the local > file system, i.e. RAM. We have run applications on BG/P via Falkon > only, and implemented a caching strategy that caches all scripts, > binaries, and input data, to RAM... once the task execution (all from > RAM) completes, and has written its output to RAM, then there is a > single copy operation of the output data from RAM to GPFS. We control > how frequently this copy operation occurs, so we can essentially scale > quite nicely and linearly with this approach. The hope is that we can > eventually work this kind of functionality in the wrapper.sh, or in > Swift itself. So, a reply to your statement, we would like to > preserve the functionality of the wrapper.sh, but move as much as > possible of that functionality from a shared file system to a local > disk. > Having optimized wrappers for different architectures is a perfectly valid option. Mihael From wilde at mcs.anl.gov Tue Apr 1 11:25:28 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 01 Apr 2008 11:25:28 -0500 Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: <1207066105.434.0.camel@blabla.mcs.anl.gov> References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov> <47F232C4.3080607@mcs.anl.gov> <47F25435.8080105@cs.uchicago.edu> <1207063923.30798.0.camel@blabla.mcs.anl.gov> <47F25814.8050301@cs.uchicago.edu> <1207066105.434.0.camel@blabla.mcs.anl.gov> Message-ID: <47F261F8.1050207@mcs.anl.gov> On 4/1/08 11:08 AM, Mihael Hategan wrote: > ... > Having optimized wrappers for different architectures is a perfectly > valid option. I agree. Also to consider is having the wrappers behave differently (e.g. use local vs shared filesystem) based on knowledge of the app's size and I/O volume vs available space and transfer rates. I'm in favor of heading to an approach where we have good fast default configurations for all our locally used systems (TG, OSG and the supercomputers) that work well for most apps, and some well documented guidelines tell users under what conditions they need to change the settings. From benc at hawaga.org.uk Tue Apr 1 11:34:03 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 1 Apr 2008 16:34:03 +0000 (GMT) Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: <47F261F8.1050207@mcs.anl.gov> References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov> <47F232C4.3080607@mcs.anl.gov> <47F25435.8080105@cs.uchicago.edu> <1207063923.30798.0.camel@blabla.mcs.anl.gov> <47F25814.8050301@cs.uchicago.edu> <1207066105.434.0.camel@blabla.mcs.anl.gov> <47F261F8.1050207@mcs.anl.gov> Message-ID: On Tue, 1 Apr 2008, Michael Wilde wrote: > I'm in favor of heading to an approach where we have good fast default > configurations for all our locally used systems (TG, OSG and the good != fast for example, debuggable and non-desturctive-to-target-resource are other desirable characteristics. the debuggable one is especially important. Crippling the logging system to achieve faster execution is something that should be turned on, not off - that moves error reporting back to a boolean WRONG! style of reporting rather than the (I think) more useful stuff that we have at the moment. Likewise, pushing stuff up to the limit of what a site can handle (especially using GRAM2) is something that should be approached with caution and not by default. -- From iraicu at cs.uchicago.edu Tue Apr 1 21:14:54 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 01 Apr 2008 21:14:54 -0500 Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov> <47F232C4.3080607@mcs.anl.gov> <47F25435.8080105@cs.uchicago.edu> <1207063923.30798.0.camel@blabla.mcs.anl.gov> <47F25814.8050301@cs.uchicago.edu> <1207066105.434.0.camel@blabla.mcs.anl.gov> <47F261F8.1050207@mcs.anl.gov> Message-ID: <47F2EC1E.4030301@cs.uchicago.edu> What I think would be nice to have, is a "high performance" option, which would disable all logging everywhere in Swift, except for the bare essential for Swift to be operational, in order to allow Swift to get the best performance possible. This doesn't have to be the default, but could allow a user to simply toggle a parameter and go from fast performance mode to slow debug mode. I think what we are trying to say with our recent experience with the BG/P is that we (as the users of Swift on BG/P) would be willing to live with a boolean error code if it meant that we could get significantly better performance, which in turn would give us higher resource utilization. Ioan Ben Clifford wrote: > On Tue, 1 Apr 2008, Michael Wilde wrote: > > >> I'm in favor of heading to an approach where we have good fast default >> configurations for all our locally used systems (TG, OSG and the >> > > good != fast > > for example, debuggable and non-desturctive-to-target-resource are other > desirable characteristics. the debuggable one is especially important. > Crippling the logging system to achieve faster execution is something that > should be turned on, not off - that moves error reporting back to a > boolean WRONG! style of reporting rather than the (I think) more useful > stuff that we have at the moment. Likewise, pushing stuff up to the limit > of what a site can handle (especially using GRAM2) is something that > should be approached with caution and not by default. > > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Tue Apr 1 23:11:23 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 2 Apr 2008 04:11:23 +0000 (GMT) Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: <1207066105.434.0.camel@blabla.mcs.anl.gov> References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov> <47F232C4.3080607@mcs.anl.gov> <47F25435.8080105@cs.uchicago.edu> <1207063923.30798.0.camel@blabla.mcs.anl.gov> <47F25814.8050301@cs.uchicago.edu> <1207066105.434.0.camel@blabla.mcs.anl.gov> Message-ID: On Tue, 1 Apr 2008, Mihael Hategan wrote: > Having optimized wrappers for different architectures is a perfectly > valid option. it might also be possible to replace the entire execute2 layer of stagein-execute-stageout if falkon wants to do its own worker-node data placement - Swift would be the same down to calling execute2 but the execute2-replacement would be falkon specific rather than using the present model which assumes a shared filesystem for stageins. I think that might fit in better with what Falkon is trying to do, letting it know which files are required by which job, rather than assuming a cluster-wide shared filesystem to fetch data from. (The same might apply for using condor with no shared filesystem, which isthe situation in many campus workstation labs that I've seen - an execute2 layer that submits a bundled up stagein/stageout/execute as a single condor submission - I have been mulling over that for a few months and maybe will get someone to play with that in google summer of code) -- From wilde at mcs.anl.gov Wed Apr 2 07:21:35 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 02 Apr 2008 07:21:35 -0500 Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov> <47F232C4.3080607@mcs.anl.gov> <47F25435.8080105@cs.uchicago.edu> <1207063923.30798.0.camel@blabla.mcs.anl.gov> <47F25814.8050301@cs.uchicago.edu> <1207066105.434.0.camel@blabla.mcs.anl.gov> Message-ID: <47F37A4F.4000606@mcs.anl.gov> Im very much in favor of this approach. - Mike On 4/1/08 11:11 PM, Ben Clifford wrote: > On Tue, 1 Apr 2008, Mihael Hategan wrote: > >> Having optimized wrappers for different architectures is a perfectly >> valid option. > > it might also be possible to replace the entire execute2 layer of > stagein-execute-stageout if falkon wants to do its own worker-node data > placement - Swift would be the same down to calling execute2 but the > execute2-replacement would be falkon specific rather than using the > present model which assumes a shared filesystem for stageins. > > I think that might fit in better with what Falkon is trying to do, letting > it know which files are required by which job, rather than assuming a > cluster-wide shared filesystem to fetch data from. > > (The same might apply for using condor with no shared filesystem, which > isthe situation in many campus workstation labs that I've seen - an > execute2 layer that submits a bundled up stagein/stageout/execute as a > single condor submission - I have been mulling over that for a few months > and maybe will get someone to play with that in google summer of code) > From bugzilla-daemon at mcs.anl.gov Thu Apr 3 00:18:12 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 3 Apr 2008 00:18:12 -0500 (CDT) Subject: [Swift-devel] [Bug 110] move OPTIONS out of swift executable In-Reply-To: Message-ID: <20080403051812.96936164BB@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=110 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|hategan at mcs.anl.gov |benc at hawaga.org.uk ------- Comment #1 from benc at hawaga.org.uk 2008-04-03 00:18 ------- There is a COG_OPTS environment variable that can be used for this. Probably should be documented in the user guide. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From benc at hawaga.org.uk Thu Apr 3 00:07:09 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 3 Apr 2008 05:07:09 +0000 (GMT) Subject: [Swift-devel] Re: plan for 0.5 release In-Reply-To: References: Message-ID: On Thu, 20 Mar 2008, Ben Clifford wrote: > There was a long long pause between swift 0.3 and swift 0.4; and > consequently a bunch of bugs have been discovered. so I'd like to put out > a 0.5 sometime in the next couple weeks to release those bugfixes. I would like to do that this week as 0.4 got a bunch of fairly big bugfixes right after release. However, I don't like that data channel caching doesn't work for a bunch of sites; one straightforward thing to do there is disable data channel caching entirely (that's what I have done in my personal development codebase). Opinion? -- From bugzilla-daemon at mcs.anl.gov Thu Apr 3 00:26:57 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 3 Apr 2008 00:26:57 -0500 (CDT) Subject: [Swift-devel] [Bug 128] New: out of memory situations sometimes cause silent hangs Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=128 Summary: out of memory situations sometimes cause silent hangs Product: Swift Version: unspecified Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: General AssignedTo: hategan at mcs.anl.gov ReportedBy: benc at hawaga.org.uk Sometimes when Swift gets low on or runs out of memory (as indicated by the heap size log lines), the execution hangs doing nothing without reporting an error, rather than cleanly exiting. This is a poor user experience. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From bugzilla-daemon at mcs.anl.gov Thu Apr 3 00:44:13 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 3 Apr 2008 00:44:13 -0500 (CDT) Subject: [Swift-devel] [Bug 129] New: ENV profiles using GRAM2 cause console output of environment variable value Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=129 Summary: ENV profiles using GRAM2 cause console output of environment variable value Product: Swift Version: unspecified Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: minor Priority: P2 Component: General AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk When profile entries in the ENV namespace are used with the GRAM2 provider, there is spurious console output of the value of those ENV profile entries. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at mcs.anl.gov Thu Apr 3 00:57:44 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 3 Apr 2008 00:57:44 -0500 (CDT) Subject: [Swift-devel] [Bug 130] New: submitting to TG NCSA Mercury PBS with PATH env profile set causes job to hang Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=130 Summary: submitting to TG NCSA Mercury PBS with PATH env profile set causes job to hang Product: Swift Version: unspecified Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: General AssignedTo: nobody at mcs.anl.gov ReportedBy: benc at hawaga.org.uk CC: skenny at uchicago.edu Submititng to TG NCSA Mercury PBS with PATH env profile set causes the job to hang on the worker node. Submitting without that PATH set does not cause the hang. Submitting to jobmanager-fork on that machine (instead of PBS) does not cause the hang. Submitting with PATH env profile to teraport PBS does not cause hang. This happens with every swift script i have tried. skenny has also seen similar behaviour (and is the instigator of this investigation) Here is a sites.xml entry that causes hangs for me: . /home/ac/benc TG-CCR080002N /:$PATH Removing that PATH profile entry makes things work again. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From foster at mcs.anl.gov Thu Apr 3 01:23:25 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Thu, 03 Apr 2008 01:23:25 -0500 Subject: [Swift-devel] Re: plan for 0.5 release In-Reply-To: References: Message-ID: <47F477DD.6040001@mcs.anl.gov> Ben: Can you explain a bit more about the "data channel doesn't work for a bunch of sites" problem? Ian. Ben Clifford wrote: > On Thu, 20 Mar 2008, Ben Clifford wrote: > > >> There was a long long pause between swift 0.3 and swift 0.4; and >> consequently a bunch of bugs have been discovered. so I'd like to put out >> a 0.5 sometime in the next couple weeks to release those bugfixes. >> > > I would like to do that this week as 0.4 got a bunch of fairly big > bugfixes right after release. > > However, I don't like that data channel caching doesn't work for a bunch > of sites; one straightforward thing to do there is disable data channel > caching entirely (that's what I have done in my personal development > codebase). Opinion? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Thu Apr 3 01:27:23 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 3 Apr 2008 06:27:23 +0000 (GMT) Subject: [Swift-devel] Re: plan for 0.5 release In-Reply-To: <47F477DD.6040001@mcs.anl.gov> References: <47F477DD.6040001@mcs.anl.gov> Message-ID: > Can you explain a bit more about the "data channel doesn't work for a bunch of > sites" problem? There's some channel reuse code that went into cog in the past few months. It gets enabled when it detects that it has been pointed at a specific version of the GridFTP server (which is in itself a bug as it should really work for lots of versions). The code appears to not work. So when Swift is pointed at a gridftp server of that version, it cannot stage in or out files. When it is pointed at a different version gridftp server, the two bugs cancel each other out - data channel reuse is not used, and so Swift can stage files in and out. -- From hategan at mcs.anl.gov Thu Apr 3 04:22:51 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 03 Apr 2008 04:22:51 -0500 Subject: [Swift-devel] Re: plan for 0.5 release In-Reply-To: References: <47F477DD.6040001@mcs.anl.gov> Message-ID: <1207214571.14147.5.camel@blabla.mcs.anl.gov> On Thu, 2008-04-03 at 06:27 +0000, Ben Clifford wrote: > > > Can you explain a bit more about the "data channel doesn't work for a bunch of > > sites" problem? > > There's some channel reuse code that went into cog in the past few months. > It gets enabled when it detects that it has been pointed at a specific > version of the GridFTP server (which is in itself a bug as it should > really work for lots of versions). Not quite. I've put that in precisely because some versions didn't work. > The code appears to not work. So when > Swift is pointed at a gridftp server of that version, it cannot stage in > or out files. When it is pointed at a different version gridftp server, > the two bugs cancel each other out - data channel reuse is not used, and > so Swift can stage files in and out. > From bugzilla-daemon at mcs.anl.gov Thu Apr 3 04:26:17 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 3 Apr 2008 04:26:17 -0500 (CDT) Subject: [Swift-devel] [Bug 130] submitting to TG NCSA Mercury PBS with PATH env profile set causes job to hang In-Reply-To: Message-ID: <20080403092617.146E31650A@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=130 ------- Comment #1 from hategan at mcs.anl.gov 2008-04-03 04:26 ------- /:$PATH I don't think that trick works (i.e. that the existing $PATH will be substituted). What probably happens is that the job runs without /bin and /usr/bin in the path. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From wilde at mcs.anl.gov Thu Apr 3 07:14:13 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 03 Apr 2008 07:14:13 -0500 Subject: [Swift-devel] Re: plan for 0.5 release In-Reply-To: <1207214571.14147.5.camel@blabla.mcs.anl.gov> References: <47F477DD.6040001@mcs.anl.gov> <1207214571.14147.5.camel@blabla.mcs.anl.gov> Message-ID: <47F4CA15.9090303@mcs.anl.gov> Mihael, Ben, How well do you understand the problem and whats your confidence in being able to reliably fix it? It seems best to disable it (or adjust things) to get more reliable operation across all or more sites, at the expense of performance on some sites, while its being fixed. Mihael, is this bug on your plate? Whats your estimate of effort involved? - Mike On 4/3/08 4:22 AM, Mihael Hategan wrote: > On Thu, 2008-04-03 at 06:27 +0000, Ben Clifford wrote: >>> Can you explain a bit more about the "data channel doesn't work for a bunch of >>> sites" problem? >> There's some channel reuse code that went into cog in the past few months. >> It gets enabled when it detects that it has been pointed at a specific >> version of the GridFTP server (which is in itself a bug as it should >> really work for lots of versions). > > Not quite. I've put that in precisely because some versions didn't work. > >> The code appears to not work. So when >> Swift is pointed at a gridftp server of that version, it cannot stage in >> or out files. When it is pointed at a different version gridftp server, >> the two bugs cancel each other out - data channel reuse is not used, and >> so Swift can stage files in and out. >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From hategan at mcs.anl.gov Thu Apr 3 07:19:27 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 03 Apr 2008 07:19:27 -0500 Subject: [Swift-devel] Re: plan for 0.5 release In-Reply-To: <47F4CA15.9090303@mcs.anl.gov> References: <47F477DD.6040001@mcs.anl.gov> <1207214571.14147.5.camel@blabla.mcs.anl.gov> <47F4CA15.9090303@mcs.anl.gov> Message-ID: <1207225168.27151.1.camel@blabla.mcs.anl.gov> On Thu, 2008-04-03 at 07:14 -0500, Michael Wilde wrote: > Mihael, Ben, > > How well do you understand the problem and whats your confidence in > being able to reliably fix it? > > It seems best to disable it (or adjust things) to get more reliable > operation across all or more sites, at the expense of performance on > some sites, while its being fixed. > > Mihael, is this bug on your plate? Yes. > Whats your estimate of effort involved? This was discussed before. In the bigger context of the small-file-optimization there is one week of real time left. > > - Mike > > > > > On 4/3/08 4:22 AM, Mihael Hategan wrote: > > On Thu, 2008-04-03 at 06:27 +0000, Ben Clifford wrote: > >>> Can you explain a bit more about the "data channel doesn't work for a bunch of > >>> sites" problem? > >> There's some channel reuse code that went into cog in the past few months. > >> It gets enabled when it detects that it has been pointed at a specific > >> version of the GridFTP server (which is in itself a bug as it should > >> really work for lots of versions). > > > > Not quite. I've put that in precisely because some versions didn't work. > > > >> The code appears to not work. So when > >> Swift is pointed at a gridftp server of that version, it cannot stage in > >> or out files. When it is pointed at a different version gridftp server, > >> the two bugs cancel each other out - data channel reuse is not used, and > >> so Swift can stage files in and out. > >> > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > From benc at hawaga.org.uk Thu Apr 3 13:40:29 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 3 Apr 2008 18:40:29 +0000 (GMT) Subject: [Swift-devel] Re: plan for 0.5 release In-Reply-To: <47F4CA15.9090303@mcs.anl.gov> References: <47F477DD.6040001@mcs.anl.gov> <1207214571.14147.5.camel@blabla.mcs.anl.gov> <47F4CA15.9090303@mcs.anl.gov> Message-ID: > How well do you understand the problem and whats your confidence in being able > to reliably fix it? for a this-week 0.5 release, disabling this is a one liner that I'm pretty confident doesn't break things (in as much as it passes the site tests that I have). that satisfies my urge to get something out fairly quickly that is less shitty than 0.4. -- From wilde at mcs.anl.gov Thu Apr 3 14:03:49 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 03 Apr 2008 14:03:49 -0500 Subject: [Swift-devel] Re: plan for 0.5 release In-Reply-To: References: <47F477DD.6040001@mcs.anl.gov> <1207214571.14147.5.camel@blabla.mcs.anl.gov> <47F4CA15.9090303@mcs.anl.gov> Message-ID: <47F52A15.6060209@mcs.anl.gov> OK. Hopefully it can be fixed in the next few weeks but no need to delay 0.5 for it. On 4/3/08 1:40 PM, Ben Clifford wrote: >> How well do you understand the problem and whats your confidence in being able >> to reliably fix it? > > for a this-week 0.5 release, disabling this is a one liner that I'm pretty > confident doesn't break things (in as much as it passes the site tests > that I have). that satisfies my urge to get something out fairly quickly > that is less shitty than 0.4. > From foster at mcs.anl.gov Thu Apr 3 14:31:35 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Thu, 03 Apr 2008 14:31:35 -0500 Subject: [Swift-devel] Re: plan for 0.5 release In-Reply-To: <47F52A15.6060209@mcs.anl.gov> References: <47F477DD.6040001@mcs.anl.gov> <1207214571.14147.5.camel@blabla.mcs.anl.gov> <47F4CA15.9090303@mcs.anl.gov> <47F52A15.6060209@mcs.anl.gov> Message-ID: <47F53097.3000909@mcs.anl.gov> could we have a flag so that if people want to turn it on, they can? (assuming it can work in some settings.) Michael Wilde wrote: > OK. Hopefully it can be fixed in the next few weeks but no need to > delay 0.5 for it. > > On 4/3/08 1:40 PM, Ben Clifford wrote: >>> How well do you understand the problem and whats your confidence in >>> being able >>> to reliably fix it? >> >> for a this-week 0.5 release, disabling this is a one liner that I'm >> pretty confident doesn't break things (in as much as it passes the >> site tests that I have). that satisfies my urge to get something out >> fairly quickly that is less shitty than 0.4. >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Thu Apr 3 14:40:26 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 3 Apr 2008 19:40:26 +0000 (GMT) Subject: [Swift-devel] Re: plan for 0.5 release In-Reply-To: <47F53097.3000909@mcs.anl.gov> References: <47F477DD.6040001@mcs.anl.gov> <1207214571.14147.5.camel@blabla.mcs.anl.gov> <47F4CA15.9090303@mcs.anl.gov> <47F52A15.6060209@mcs.anl.gov> <47F53097.3000909@mcs.anl.gov> Message-ID: > could we have a flag so that if people want to turn it on, they can? (assuming > it can work in some settings.) it doesn't work in any situation i've tried at the moment. when mihael has done his stuff it will work just fine. until then what I'm looking for is a quick non-damaging fix to cover from now until that point that lies somewhere (hopefully in the next six months) in the future. -- From bugzilla-daemon at mcs.anl.gov Fri Apr 4 01:12:44 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Fri, 4 Apr 2008 01:12:44 -0500 (CDT) Subject: [Swift-devel] [Bug 111] stage out -info and cluster logs in the same fashion as kickstart records. In-Reply-To: Message-ID: <20080404061244.0EE51164CF@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=111 ------- Comment #1 from benc at hawaga.org.uk 2008-04-04 01:12 ------- This is done for info logs. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at mcs.anl.gov Fri Apr 4 04:32:47 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Fri, 4 Apr 2008 04:32:47 -0500 (CDT) Subject: [Swift-devel] [Bug 41] Deadlock in atomic procedures In-Reply-To: Message-ID: <20080404093247.44E5B164EC@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=41 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|hategan at mcs.anl.gov |benc at hawaga.org.uk Component|General |SwiftScript language ------- Comment #1 from benc at hawaga.org.uk 2008-04-04 04:32 ------- probably the compiler should catch this. for example: i) it should detect that the parameter is of an inappropriate type (we should only allow simple types here) ii) it should detect that a variable that is know to be an output is being used in an input context. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From hategan at mcs.anl.gov Fri Apr 4 04:39:37 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 04 Apr 2008 04:39:37 -0500 Subject: [Swift-devel] coaster status summary Message-ID: <1207301977.1658.14.camel@blabla.mcs.anl.gov> I've been asked for a summary of the status of the coaster prototype, so here it is: - It's a prototype so bugs are plenty - It's self deployed (you don't need to start a service on the target cluster) - You can also use it while starting a service on the target cluster - There is a worker written in Perl - It uses encryption between client and coaster service - It uses UDP between the service and the workers (this may prove to be better or worse choice than TCP) - A preliminary test done locally shows an amortized throughput of around 180 jobs/s (/bin/date). This was done with encryption and with 10 workers. Pretty picture attached (total time vs. # of jobs) To do: - The scheduling algorithm in the service needs a bit more work - When worker messages are lost, some jobs may get lost (i.e. needs more fault tolerance) - Start testing it on actual clusters - Do some memory consumption benchmarks - Better allocation strategy for workers Mihael -------------- next part -------------- A non-text attachment was scrubbed... Name: speed.pdf Type: application/pdf Size: 18168 bytes Desc: not available URL: From benc at hawaga.org.uk Fri Apr 4 04:41:20 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 4 Apr 2008 09:41:20 +0000 (GMT) Subject: [Swift-devel] coaster status summary In-Reply-To: <1207301977.1658.14.camel@blabla.mcs.anl.gov> References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> Message-ID: are you going to put the source somewhere visible? -- From wilde at mcs.anl.gov Fri Apr 4 06:59:28 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 04 Apr 2008 06:59:28 -0500 Subject: [Swift-devel] coaster status summary In-Reply-To: <1207301977.1658.14.camel@blabla.mcs.anl.gov> References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> Message-ID: <47F61820.3090705@mcs.anl.gov> Mihael, this is great progress - very exciting. Some questions (dont need answers right away): How would the end user use it? Manually start a service? Is the service a separate process, or in the swift jvm? How are the number of workers set or adjusted? Does a service manage workers on one cluster or many? At 180 jobs/sec with 10 workers, what were the CPU loads on swift, worker and service? Do you want to try this on the workflows we're running on Falkon on the BGP and SiCortex? Im eager to try it when you feel its ready for others to test. Nice work! - Mike On 4/4/08 4:39 AM, Mihael Hategan wrote: > I've been asked for a summary of the status of the coaster prototype, so > here it is: > - It's a prototype so bugs are plenty > - It's self deployed (you don't need to start a service on the target > cluster) > - You can also use it while starting a service on the target cluster > - There is a worker written in Perl > - It uses encryption between client and coaster service > - It uses UDP between the service and the workers (this may prove to be > better or worse choice than TCP) > - A preliminary test done locally shows an amortized throughput of > around 180 jobs/s (/bin/date). This was done with encryption and with 10 > workers. Pretty picture attached (total time vs. # of jobs) > > To do: > - The scheduling algorithm in the service needs a bit more work > - When worker messages are lost, some jobs may get lost (i.e. needs more > fault tolerance) > - Start testing it on actual clusters > - Do some memory consumption benchmarks > - Better allocation strategy for workers > > Mihael > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Fri Apr 4 07:02:22 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 04 Apr 2008 07:02:22 -0500 Subject: [Swift-devel] coaster status summary In-Reply-To: References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> Message-ID: <1207310542.7171.0.camel@blabla.mcs.anl.gov> Of course. Just haven't done it yet. On Fri, 2008-04-04 at 09:41 +0000, Ben Clifford wrote: > are you going to put the source somewhere visible? From hategan at mcs.anl.gov Fri Apr 4 07:12:47 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 04 Apr 2008 07:12:47 -0500 Subject: [Swift-devel] coaster status summary In-Reply-To: <47F61820.3090705@mcs.anl.gov> References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> <47F61820.3090705@mcs.anl.gov> Message-ID: <1207311167.7171.12.camel@blabla.mcs.anl.gov> On Fri, 2008-04-04 at 06:59 -0500, Michael Wilde wrote: > Mihael, this is great progress - very exciting. > Some questions (dont need answers right away): > > How would the end user use it? Manually start a service? > Is the service a separate process, or in the swift jvm? I though the lines below answered some of these. A user would specify the coaster provider in sites.xml. The provider will then automatically deploy a service on the target machine without the user having to do so. Given that the service is on a different machine than the client, they can't be in the same JVM. > How are the number of workers set or adjusted? Currently workers are requested as much as needed, up to a maximum. This is preliminary hence "Better allocation strategy for workers". > Does a service manage workers on one cluster or many? One service per cluster. > At 180 jobs/sec with 10 workers, what were the CPU loads on swift, > worker and service? I faintly recall them being at less than 50% for some reason I don't understand. > > Do you want to try this on the workflows we're running on Falkon on the > BGP and SiCortex? Let me repeat "prototype" and "more testing". In no way do I want to do preliminary testing with an application that is shaky on an architecture that is also shaky. Mihael > > Im eager to try it when you feel its ready for others to test. > > Nice work! > > - Mike > > > > On 4/4/08 4:39 AM, Mihael Hategan wrote: > > I've been asked for a summary of the status of the coaster prototype, so > > here it is: > > - It's a prototype so bugs are plenty > > - It's self deployed (you don't need to start a service on the target > > cluster) > > - You can also use it while starting a service on the target cluster > > - There is a worker written in Perl > > - It uses encryption between client and coaster service > > - It uses UDP between the service and the workers (this may prove to be > > better or worse choice than TCP) > > - A preliminary test done locally shows an amortized throughput of > > around 180 jobs/s (/bin/date). This was done with encryption and with 10 > > workers. Pretty picture attached (total time vs. # of jobs) > > > > To do: > > - The scheduling algorithm in the service needs a bit more work > > - When worker messages are lost, some jobs may get lost (i.e. needs more > > fault tolerance) > > - Start testing it on actual clusters > > - Do some memory consumption benchmarks > > - Better allocation strategy for workers > > > > Mihael > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From bugzilla-daemon at mcs.anl.gov Fri Apr 4 07:36:50 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Fri, 4 Apr 2008 07:36:50 -0500 (CDT) Subject: [Swift-devel] [Bug 131] New: Clarify documentation on profiles Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=131 Summary: Clarify documentation on profiles Product: Swift Version: unspecified Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Documentation AssignedTo: benc at hawaga.org.uk ReportedBy: wilde at mcs.anl.gov clarify what profiles are for (few sentences) explain that profiles are associated with sites, in sites.xml, or with apps, in tc.data move profile section down after sites & tc description. where its at now, its confusing what profiles are for when you first encounter the section. list the set of recognized profiles, by namespace refer to other docs for parameter values that are beyond the scope of this doc (but show the common examples, which is mostly OK, but scattered throughout the UG at the moment) for Globus, explain more about how queue and maxwalltime interact to determine how your job is queued, and how to find the queue information (eg a few pointers to UC/TG and Teraport info in the local users section) in "local users" section list profile info relevant to uc/osg/tg environment - cputype is the main missing one I think for UC/TG only. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From iraicu at cs.uchicago.edu Fri Apr 4 19:02:44 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Fri, 04 Apr 2008 19:02:44 -0500 Subject: [Swift-devel] coaster status summary In-Reply-To: <1207311167.7171.12.camel@blabla.mcs.anl.gov> References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> <47F61820.3090705@mcs.anl.gov> <1207311167.7171.12.camel@blabla.mcs.anl.gov> Message-ID: <47F6C1A4.5030200@cs.uchicago.edu> You say that you use UDP on the workers. This might be more light weight, but might also pose practical issues. Some of those are: - might not work well on any network other than a LAN - won't be friendly to firewalls or NATs, no matter if you the service pushes jobs, or workers pull jobs; the logic is that you need 2 way communication, and using UDP (being a connectionless protocol), its like having a server socket and a client socket on both ends of the communication at the same time. This might not matter if the service and the worker are on the same LAN with no NATs or firewalls in the middle, but, it would matter on a machine such as the BG/P, as there is a NAT inbetween the login nodes and the compute nodes. In essence, for this to work on the BG/P, you'll need to avoid having server side sockets on the compute nodes (workers), and you'll probably only be able to do that via a connection oriented protocol (i.e. TCP). Is switching to TCP a relatively straight forward option? If not, it might be worth implementing to make the implementation more flexible - loosing messages and recovering from them will likely be harder than anticipated; I have a UDP version of the notification engine that Falkon uses, and after much debugging, I gave up and switched over to TCP. It worked most of the time, but the occasional lost message (1 in 1000s, maybe even more rare) made Falkon unreliable, and hence I stopped using it. Is the 180 tasks/sec the overall throughput measured from Swift's point of view, including overhead of wrapper.sh? Or is that a micro-benchmark measuring just the coaster performance? Ioan Mihael Hategan wrote: > On Fri, 2008-04-04 at 06:59 -0500, Michael Wilde wrote: > >> Mihael, this is great progress - very exciting. >> Some questions (dont need answers right away): >> >> How would the end user use it? Manually start a service? >> Is the service a separate process, or in the swift jvm? >> > > I though the lines below answered some of these. > > A user would specify the coaster provider in sites.xml. The provider > will then automatically deploy a service on the target machine without > the user having to do so. Given that the service is on a different > machine than the client, they can't be in the same JVM. > > >> How are the number of workers set or adjusted? >> > > Currently workers are requested as much as needed, up to a maximum. This > is preliminary hence "Better allocation strategy for workers". > > >> Does a service manage workers on one cluster or many? >> > > One service per cluster. > > >> At 180 jobs/sec with 10 workers, what were the CPU loads on swift, >> worker and service? >> > > I faintly recall them being at less than 50% for some reason I don't > understand. > > >> Do you want to try this on the workflows we're running on Falkon on the >> BGP and SiCortex? >> > > Let me repeat "prototype" and "more testing". In no way do I want to do > preliminary testing with an application that is shaky on an architecture > that is also shaky. > > Mihael > > >> Im eager to try it when you feel its ready for others to test. >> >> Nice work! >> >> - Mike >> >> >> >> On 4/4/08 4:39 AM, Mihael Hategan wrote: >> >>> I've been asked for a summary of the status of the coaster prototype, so >>> here it is: >>> - It's a prototype so bugs are plenty >>> - It's self deployed (you don't need to start a service on the target >>> cluster) >>> - You can also use it while starting a service on the target cluster >>> - There is a worker written in Perl >>> - It uses encryption between client and coaster service >>> - It uses UDP between the service and the workers (this may prove to be >>> better or worse choice than TCP) >>> - A preliminary test done locally shows an amortized throughput of >>> around 180 jobs/s (/bin/date). This was done with encryption and with 10 >>> workers. Pretty picture attached (total time vs. # of jobs) >>> >>> To do: >>> - The scheduling algorithm in the service needs a bit more work >>> - When worker messages are lost, some jobs may get lost (i.e. needs more >>> fault tolerance) >>> - Start testing it on actual clusters >>> - Do some memory consumption benchmarks >>> - Better allocation strategy for workers >>> >>> Mihael >>> >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Sat Apr 5 04:30:38 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 5 Apr 2008 09:30:38 +0000 (GMT) Subject: [Swift-devel] coaster status summary In-Reply-To: <47F6C1A4.5030200@cs.uchicago.edu> References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> <47F61820.3090705@mcs.anl.gov> <1207311167.7171.12.camel@blabla.mcs.anl.gov> <47F6C1A4.5030200@cs.uchicago.edu> Message-ID: > it would matter on a machine such as > the BG/P, as there is a NAT inbetween the login nodes and the compute nodes. wierd. is there a description of that somewhere? -- From hategan at mcs.anl.gov Sat Apr 5 04:45:54 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 05 Apr 2008 04:45:54 -0500 Subject: [Swift-devel] coaster status summary In-Reply-To: <47F6C1A4.5030200@cs.uchicago.edu> References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> <47F61820.3090705@mcs.anl.gov> <1207311167.7171.12.camel@blabla.mcs.anl.gov> <47F6C1A4.5030200@cs.uchicago.edu> Message-ID: <1207388755.10629.12.camel@blabla.mcs.anl.gov> On Fri, 2008-04-04 at 19:02 -0500, Ioan Raicu wrote: > You say that you use UDP on the workers. This might be more light > weight, but might also pose practical issues. Of course. That is the trade-off. > > Some of those are: > - might not work well on any network other than a LAN It works exactly as it's supposed to: no guarantee of uniqueness, no guarantee of order, no guarantee of integrity, and no guarantee of reliability. One has to drop duplicates, do checksums, re-order, have time-outs. > - won't be friendly to firewalls or NATs, no matter if you the service > pushes jobs, or workers pull jobs; the logic is that you need 2 way > communication, and using UDP (being a connectionless protocol), its > like having a server socket and a client socket on both ends of the > communication at the same time. Precisely so. In Java you can use one UDP socket as both client and server. Perl seems to be nastier as it won't let you send and receive on the same socket (at least in the implementation I've seen). > This might not matter if the service and the worker are on the same > LAN with no NATs or firewalls in the middle, but, it would matter on a > machine such as the BG/P, as there is a NAT inbetween the login nodes > and the compute nodes. That's odd. Do you have anything to back that up? > In essence, for this to work on the BG/P, you'll need to avoid > having server side sockets on the compute nodes (workers), and you'll > probably only be able to do that via a connection oriented protocol > (i.e. TCP). Is switching to TCP a relatively straight forward option? > If not, it might be worth implementing to make the implementation more > flexible > - loosing messages and recovering from them will likely be harder than > anticipated; I have a UDP version of the notification engine that > Falkon uses, and after much debugging, I gave up and switched over to > TCP. It worked most of the time, but the occasional lost message (1 > in 1000s, maybe even more rare) made Falkon unreliable, and hence I > stopped using it. Of course it's unreliable unless you deal with the reliability issues as outlined above. > > Is the 180 tasks/sec the overall throughput measured from Swift's > point of view, including overhead of wrapper.sh? Or is that a > micro-benchmark measuring just the coaster performance? It's at the provider level. No wrapper.sh. > > Ioan > > > Mihael Hategan wrote: > > On Fri, 2008-04-04 at 06:59 -0500, Michael Wilde wrote: > > > > > Mihael, this is great progress - very exciting. > > > Some questions (dont need answers right away): > > > > > > How would the end user use it? Manually start a service? > > > Is the service a separate process, or in the swift jvm? > > > > > > > I though the lines below answered some of these. > > > > A user would specify the coaster provider in sites.xml. The provider > > will then automatically deploy a service on the target machine without > > the user having to do so. Given that the service is on a different > > machine than the client, they can't be in the same JVM. > > > > > > > How are the number of workers set or adjusted? > > > > > > > Currently workers are requested as much as needed, up to a maximum. This > > is preliminary hence "Better allocation strategy for workers". > > > > > > > Does a service manage workers on one cluster or many? > > > > > > > One service per cluster. > > > > > > > At 180 jobs/sec with 10 workers, what were the CPU loads on swift, > > > worker and service? > > > > > > > I faintly recall them being at less than 50% for some reason I don't > > understand. > > > > > > > Do you want to try this on the workflows we're running on Falkon on the > > > BGP and SiCortex? > > > > > > > Let me repeat "prototype" and "more testing". In no way do I want to do > > preliminary testing with an application that is shaky on an architecture > > that is also shaky. > > > > Mihael > > > > > > > Im eager to try it when you feel its ready for others to test. > > > > > > Nice work! > > > > > > - Mike > > > > > > > > > > > > On 4/4/08 4:39 AM, Mihael Hategan wrote: > > > > > > > I've been asked for a summary of the status of the coaster prototype, so > > > > here it is: > > > > - It's a prototype so bugs are plenty > > > > - It's self deployed (you don't need to start a service on the target > > > > cluster) > > > > - You can also use it while starting a service on the target cluster > > > > - There is a worker written in Perl > > > > - It uses encryption between client and coaster service > > > > - It uses UDP between the service and the workers (this may prove to be > > > > better or worse choice than TCP) > > > > - A preliminary test done locally shows an amortized throughput of > > > > around 180 jobs/s (/bin/date). This was done with encryption and with 10 > > > > workers. Pretty picture attached (total time vs. # of jobs) > > > > > > > > To do: > > > > - The scheduling algorithm in the service needs a bit more work > > > > - When worker messages are lost, some jobs may get lost (i.e. needs more > > > > fault tolerance) > > > > - Start testing it on actual clusters > > > > - Do some memory consumption benchmarks > > > > - Better allocation strategy for workers > > > > > > > > Mihael > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > -- > =================================================== > Ioan Raicu > Ph.D. Candidate > =================================================== > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > =================================================== > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dev.globus.org/wiki/Incubator/Falkon > http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > =================================================== > =================================================== > From hategan at mcs.anl.gov Sat Apr 5 04:54:46 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 05 Apr 2008 04:54:46 -0500 Subject: [Swift-devel] coaster status summary In-Reply-To: <1207388755.10629.12.camel@blabla.mcs.anl.gov> References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> <47F61820.3090705@mcs.anl.gov> <1207311167.7171.12.camel@blabla.mcs.anl.gov> <47F6C1A4.5030200@cs.uchicago.edu> <1207388755.10629.12.camel@blabla.mcs.anl.gov> Message-ID: <1207389286.10629.16.camel@blabla.mcs.anl.gov> > > This might not matter if the service and the worker are on the same > > LAN with no NATs or firewalls in the middle, but, it would matter on a > > machine such as the BG/P, as there is a NAT inbetween the login nodes > > and the compute nodes. > > That's odd. Do you have anything to back that up? > Really really odd. I mean MPI has to work between any two worker nodes. If they are on separate networks with NAT in-between, this would be rather difficult. From benc at hawaga.org.uk Sat Apr 5 07:07:39 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 5 Apr 2008 12:07:39 +0000 (GMT) Subject: [Swift-devel] swift 0.5rc1 Message-ID: Swift 0.5 release candidate 1 is at http://www.ci.uchicago.edu/~benc/vdsk-0.5rc1.tar.gz This is primarily bugfixes for bugs that were found around the time of the 0.4 release - syntax error handling that was poorly tested before 0.4; and data channel caching problems. There shouldn't be many (if any) new features here. Please test. If there are no significant problems, I'll put this out as 0.5 on tuesday. -- From iraicu at cs.uchicago.edu Sat Apr 5 08:16:12 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sat, 05 Apr 2008 08:16:12 -0500 Subject: [Swift-devel] coaster status summary In-Reply-To: <47F61820.3090705@mcs.anl.gov> References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> <47F61820.3090705@mcs.anl.gov> Message-ID: <47F77B9C.2060200@cs.uchicago.edu> I meant to send this before, but somehow it seems to have gotten stuck in my draft folder ;( We are running out of time on the papers we are writing now, but it would certainly been a good comparison of implementations, assumptions, trade-offs, performance, etc... for a future paper! I am eager to learn more about it! Ioan Michael Wilde wrote: > Mihael, this is great progress - very exciting. > Some questions (dont need answers right away): > > How would the end user use it? Manually start a service? > Is the service a separate process, or in the swift jvm? > How are the number of workers set or adjusted? > Does a service manage workers on one cluster or many? > At 180 jobs/sec with 10 workers, what were the CPU loads on swift, > worker and service? > > Do you want to try this on the workflows we're running on Falkon on > the BGP and SiCortex? > > Im eager to try it when you feel its ready for others to test. > > Nice work! > > - Mike > > > > On 4/4/08 4:39 AM, Mihael Hategan wrote: >> I've been asked for a summary of the status of the coaster prototype, so >> here it is: >> - It's a prototype so bugs are plenty >> - It's self deployed (you don't need to start a service on the target >> cluster) >> - You can also use it while starting a service on the target cluster >> - There is a worker written in Perl >> - It uses encryption between client and coaster service >> - It uses UDP between the service and the workers (this may prove to be >> better or worse choice than TCP) >> - A preliminary test done locally shows an amortized throughput of >> around 180 jobs/s (/bin/date). This was done with encryption and with 10 >> workers. Pretty picture attached (total time vs. # of jobs) >> >> To do: >> - The scheduling algorithm in the service needs a bit more work >> - When worker messages are lost, some jobs may get lost (i.e. needs more >> fault tolerance) >> - Start testing it on actual clusters >> - Do some memory consumption benchmarks >> - Better allocation strategy for workers >> >> Mihael >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From iraicu at cs.uchicago.edu Sat Apr 5 08:25:36 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sat, 05 Apr 2008 08:25:36 -0500 Subject: [Swift-devel] coaster status summary In-Reply-To: References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> <47F61820.3090705@mcs.anl.gov> <1207311167.7171.12.camel@blabla.mcs.anl.gov> <47F6C1A4.5030200@cs.uchicago.edu> Message-ID: <47F77DD0.8040302@cs.uchicago.edu> I looked around for some docs on the networking structure, but couldn't find anything. There are several networks available on the BG/P: Torus, Tree, Barrier, RAS, 10Gig Ethernet. Of all these, we are only using the Ethernet network, which allows us to communicate via TCP/IP (or potentially UDP/IP) between compute nodes and I/O nodes, or between compute nodes and login nodes. For the rest of the discussion, we assume only Ethernet communication. There is 1 I/O node per 64 compute nodes (what we call a P-SET), and the I/O node can only communicate with compute nodes that it manages within the same P-SET (the 64 nodes). A compute node from one P-SET cannot directly communicate with another compute from a different P-SET. This is primarily because compute nodes have private addresses (192.168.x.x), I/O nodes are the NAT between the public IP and the private IP, and the login nodes only have a public IP. So, the compute nodes all have the same IP addresses, 192.168.x.x, and they repeat for every P-SET, and the I/O nodes handle their traffic in and out. Zhao, if you have any docs on the Ethernet network and the NAT that sits on the I/O node, can you please send it to the mailing list? Ioan Ben Clifford wrote: >> it would matter on a machine such as >> the BG/P, as there is a NAT inbetween the login nodes and the compute nodes. >> > > wierd. is there a description of that somewhere? > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From iraicu at cs.uchicago.edu Sat Apr 5 08:36:18 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sat, 05 Apr 2008 08:36:18 -0500 Subject: [Swift-devel] coaster status summary In-Reply-To: <1207388755.10629.12.camel@blabla.mcs.anl.gov> References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> <47F61820.3090705@mcs.anl.gov> <1207311167.7171.12.camel@blabla.mcs.anl.gov> <47F6C1A4.5030200@cs.uchicago.edu> <1207388755.10629.12.camel@blabla.mcs.anl.gov> Message-ID: <47F78052.5020702@cs.uchicago.edu> Mihael Hategan wrote: > On Fri, 2008-04-04 at 19:02 -0500, Ioan Raicu wrote: > >> You say that you use UDP on the workers. This might be more light >> weight, but might also pose practical issues. >> > > Of course. That is the trade-off. > > Right, there will be, the key is to be able to switch between TCP and UDP easily. >> Some of those are: >> - might not work well on any network other than a LAN >> > > It works exactly as it's supposed to: no guarantee of uniqueness, no > guarantee of order, no guarantee of integrity, and no guarantee of > reliability. One has to drop duplicates, do checksums, re-order, have > time-outs. > > >> - won't be friendly to firewalls or NATs, no matter if you the service >> pushes jobs, or workers pull jobs; the logic is that you need 2 way >> communication, and using UDP (being a connectionless protocol), its >> like having a server socket and a client socket on both ends of the >> communication at the same time. >> > > Precisely so. In Java you can use one UDP socket as both client and > server. But even if the abstraction is OK and allows you to use the same socket for both reads and writes, that doesn't mean that the NAT will actually set up the coresponding entries for you to have 2-way communication. With TCP, given the connection oriented protocol, NATs are fine as long as one initiates the connection from the inside the NAT, but with UDP, you will only be able to have outgoing messages, but incoming messages to the NAT will not have the rules setup. The only way I could see UDP working through the NAT is to have static rules setup ahead of time, that map between some PORTs on the NAT and IP:PORT on the compute nodes.... > Perl seems to be nastier as it won't let you send and receive on > the same socket (at least in the implementation I've seen). > > >> This might not matter if the service and the worker are on the same >> LAN with no NATs or firewalls in the middle, but, it would matter on a >> machine such as the BG/P, as there is a NAT inbetween the login nodes >> and the compute nodes. >> > > That's odd. Do you have anything to back that up? > > Compute nodes have a private address per P-SET (64 nodes), and there are 16 P-SETs in the current machine we use, and there will be 640 P-SETs in the final machine. The I/O nodes (1 per P-SET) act as a NAT and have network connectivity on both public and private networks, and the login nodes only have access to the public network. This has been our experience for the past 2 months in using TCP/IP on the BG/P. Zhao, if you have anything else to add (especially links to docs confirming what I just said), please do so. >> In essence, for this to work on the BG/P, you'll need to avoid >> having server side sockets on the compute nodes (workers), and you'll >> probably only be able to do that via a connection oriented protocol >> (i.e. TCP). Is switching to TCP a relatively straight forward option? >> If not, it might be worth implementing to make the implementation more >> flexible >> - loosing messages and recovering from them will likely be harder than >> anticipated; I have a UDP version of the notification engine that >> Falkon uses, and after much debugging, I gave up and switched over to >> TCP. It worked most of the time, but the occasional lost message (1 >> in 1000s, maybe even more rare) made Falkon unreliable, and hence I >> stopped using it. >> > > Of course it's unreliable unless you deal with the reliability issues as > outlined above. > I did deal with them, duplicates, out of order, retries, timeouts, etc... yet, I still couldn't get a 100% reliable implementation, and I gave up... in theory, UDP should work given that you deal with all the reliability issues you outlined. I am just pointing out that after lots of debugging, I gave in and swapped UDP for TCP to avoid the unexplained lost message once in a while. I am positive it was a bug in my code, so perhaps you'll have better luck! > >> Is the 180 tasks/sec the overall throughput measured from Swift's >> point of view, including overhead of wrapper.sh? Or is that a >> micro-benchmark measuring just the coaster performance? >> > > It's at the provider level. No wrapper.sh. > OK, great! Ioan > >> Ioan >> >> >> Mihael Hategan wrote: >> >>> On Fri, 2008-04-04 at 06:59 -0500, Michael Wilde wrote: >>> >>> >>>> Mihael, this is great progress - very exciting. >>>> Some questions (dont need answers right away): >>>> >>>> How would the end user use it? Manually start a service? >>>> Is the service a separate process, or in the swift jvm? >>>> >>>> >>> I though the lines below answered some of these. >>> >>> A user would specify the coaster provider in sites.xml. The provider >>> will then automatically deploy a service on the target machine without >>> the user having to do so. Given that the service is on a different >>> machine than the client, they can't be in the same JVM. >>> >>> >>> >>>> How are the number of workers set or adjusted? >>>> >>>> >>> Currently workers are requested as much as needed, up to a maximum. This >>> is preliminary hence "Better allocation strategy for workers". >>> >>> >>> >>>> Does a service manage workers on one cluster or many? >>>> >>>> >>> One service per cluster. >>> >>> >>> >>>> At 180 jobs/sec with 10 workers, what were the CPU loads on swift, >>>> worker and service? >>>> >>>> >>> I faintly recall them being at less than 50% for some reason I don't >>> understand. >>> >>> >>> >>>> Do you want to try this on the workflows we're running on Falkon on the >>>> BGP and SiCortex? >>>> >>>> >>> Let me repeat "prototype" and "more testing". In no way do I want to do >>> preliminary testing with an application that is shaky on an architecture >>> that is also shaky. >>> >>> Mihael >>> >>> >>> >>>> Im eager to try it when you feel its ready for others to test. >>>> >>>> Nice work! >>>> >>>> - Mike >>>> >>>> >>>> >>>> On 4/4/08 4:39 AM, Mihael Hategan wrote: >>>> >>>> >>>>> I've been asked for a summary of the status of the coaster prototype, so >>>>> here it is: >>>>> - It's a prototype so bugs are plenty >>>>> - It's self deployed (you don't need to start a service on the target >>>>> cluster) >>>>> - You can also use it while starting a service on the target cluster >>>>> - There is a worker written in Perl >>>>> - It uses encryption between client and coaster service >>>>> - It uses UDP between the service and the workers (this may prove to be >>>>> better or worse choice than TCP) >>>>> - A preliminary test done locally shows an amortized throughput of >>>>> around 180 jobs/s (/bin/date). This was done with encryption and with 10 >>>>> workers. Pretty picture attached (total time vs. # of jobs) >>>>> >>>>> To do: >>>>> - The scheduling algorithm in the service needs a bit more work >>>>> - When worker messages are lost, some jobs may get lost (i.e. needs more >>>>> fault tolerance) >>>>> - Start testing it on actual clusters >>>>> - Do some memory consumption benchmarks >>>>> - Better allocation strategy for workers >>>>> >>>>> Mihael >>>>> >>>>> >>>>> ------------------------------------------------------------------------ >>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>> >>>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> >>> >> -- >> =================================================== >> Ioan Raicu >> Ph.D. Candidate >> =================================================== >> Distributed Systems Laboratory >> Computer Science Department >> University of Chicago >> 1100 E. 58th Street, Ryerson Hall >> Chicago, IL 60637 >> =================================================== >> Email: iraicu at cs.uchicago.edu >> Web: http://www.cs.uchicago.edu/~iraicu >> http://dev.globus.org/wiki/Incubator/Falkon >> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page >> =================================================== >> =================================================== >> >> > > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From iraicu at cs.uchicago.edu Sat Apr 5 08:47:09 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sat, 05 Apr 2008 08:47:09 -0500 Subject: [Swift-devel] coaster status summary In-Reply-To: <1207389286.10629.16.camel@blabla.mcs.anl.gov> References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> <47F61820.3090705@mcs.anl.gov> <1207311167.7171.12.camel@blabla.mcs.anl.gov> <47F6C1A4.5030200@cs.uchicago.edu> <1207388755.10629.12.camel@blabla.mcs.anl.gov> <1207389286.10629.16.camel@blabla.mcs.anl.gov> Message-ID: <47F782DD.5070805@cs.uchicago.edu> Mihael Hategan wrote: >>> This might not matter if the service and the worker are on the same >>> LAN with no NATs or firewalls in the middle, but, it would matter on a >>> machine such as the BG/P, as there is a NAT inbetween the login nodes >>> and the compute nodes. >>> >> That's odd. Do you have anything to back that up? >> >> > > Really really odd. I mean MPI has to work between any two worker nodes. > If they are on separate networks with NAT in-between, this would be > rather difficult. > MPI doesn't use the Ethernet network. There are 5 networks to choose from (Torus, Tree, Barrier, RAS, 10Gig Ethernet), and I bet the NAT is only on one of them. However, the Ethernet network is important, because we want to use TCP/UDP/IP so we can leverage code and systems that work in a typical Linux environment that traditionally only has Ethernet networks. So, if you are willing to use MPI to communicate between service and workers, then you will likely not have to deal with a NAT. However, then this might limit the generality of the implementation, as some Linux clusters might not have the necessary MPI packages installed. The middle ground that we found useful, use TCP, and initiate all communication from the workers; this approach has worked for us great so far! We have been able to scale on the BG/P to 4K workers, and on the SiCortex with 5.8K workers. I expect our current TCP-based implementation to scale to at least 10K workers per service, maybe more. More testing is needed to find the upper bound of how many workers we can manage with the current login nodes memory capacity (4GB) and the quad-cpu systems we have. Ioan > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From hategan at mcs.anl.gov Sat Apr 5 13:06:51 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 05 Apr 2008 13:06:51 -0500 Subject: [Swift-devel] Re: plan for 0.5 release In-Reply-To: References: <47F477DD.6040001@mcs.anl.gov> <1207214571.14147.5.camel@blabla.mcs.anl.gov> <47F4CA15.9090303@mcs.anl.gov> <47F52A15.6060209@mcs.anl.gov> <47F53097.3000909@mcs.anl.gov> Message-ID: <1207418811.23834.0.camel@blabla.mcs.anl.gov> Issue should be fixed in cog r1956. On Thu, 2008-04-03 at 19:40 +0000, Ben Clifford wrote: > > could we have a flag so that if people want to turn it on, they can? (assuming > > it can work in some settings.) > > it doesn't work in any situation i've tried at the moment. when mihael has > done his stuff it will work just fine. until then what I'm looking for is > a quick non-damaging fix to cover from now until that point that lies > somewhere (hopefully in the next six months) in the future. > From hategan at mcs.anl.gov Sun Apr 6 04:14:11 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 06 Apr 2008 04:14:11 -0500 Subject: [Swift-devel] coaster status summary In-Reply-To: <47F77DD0.8040302@cs.uchicago.edu> References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> <47F61820.3090705@mcs.anl.gov> <1207311167.7171.12.camel@blabla.mcs.anl.gov> <47F6C1A4.5030200@cs.uchicago.edu> <47F77DD0.8040302@cs.uchicago.edu> Message-ID: <1207473251.10063.1.camel@blabla.mcs.anl.gov> On Sat, 2008-04-05 at 08:25 -0500, Ioan Raicu wrote: > I looked around for some docs on the networking structure, but couldn't > find anything. > > There are several networks available on the BG/P: Torus, Tree, Barrier, > RAS, 10Gig Ethernet. > > Of all these, we are only using the Ethernet network, which allows us to > communicate via TCP/IP (or potentially UDP/IP) between compute nodes and > I/O nodes, or between compute nodes and login nodes. For the rest of > the discussion, we assume only Ethernet communication. There is 1 I/O > node per 64 compute nodes (what we call a P-SET), and the I/O node can > only communicate with compute nodes that it manages within the same > P-SET (the 64 nodes). A compute node from one P-SET cannot directly > communicate with another compute from a different P-SET. This is > primarily because compute nodes have private addresses (192.168.x.x), > I/O nodes are the NAT between the public IP and the private IP, and the > login nodes only have a public IP. So, the compute nodes all have the > same IP addresses, 192.168.x.x, and they repeat for every P-SET, and the > I/O nodes handle their traffic in and out. You are describing NAT. I understand what NAT is. I was looking for an independent source confirming this. > > > Zhao, if you have any docs on the Ethernet network and the NAT that sits > on the I/O node, can you please send it to the mailing list? > > Ioan > > Ben Clifford wrote: > >> it would matter on a machine such as > >> the BG/P, as there is a NAT inbetween the login nodes and the compute nodes. > >> > > > > wierd. is there a description of that somewhere? > > > > > From hategan at mcs.anl.gov Sun Apr 6 04:17:22 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 06 Apr 2008 04:17:22 -0500 Subject: [Swift-devel] coaster status summary In-Reply-To: <47F78052.5020702@cs.uchicago.edu> References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> <47F61820.3090705@mcs.anl.gov> <1207311167.7171.12.camel@blabla.mcs.anl.gov> <47F6C1A4.5030200@cs.uchicago.edu> <1207388755.10629.12.camel@blabla.mcs.anl.gov> <47F78052.5020702@cs.uchicago.edu> Message-ID: <1207473442.10063.3.camel@blabla.mcs.anl.gov> > > > > Of course it's unreliable unless you deal with the reliability issues as > > outlined above. > > > I did deal with them, duplicates, out of order, retries, timeouts, > etc... yet, I still couldn't get a 100% reliable implementation, Of course you couldn't. It's impossible. > and I > gave up... in theory, UDP should work given that you deal with all the > reliability issues you outlined. I am just pointing out that after lots > of debugging, I gave in and swapped UDP for TCP to avoid the unexplained > lost message once in a while. I am positive it was a bug in my code, so > perhaps you'll have better luck! > > > >> Is the 180 tasks/sec the overall throughput measured from Swift's > >> point of view, including overhead of wrapper.sh? Or is that a > >> micro-benchmark measuring just the coaster performance? > >> > > > > It's at the provider level. No wrapper.sh. > > > OK, great! > > Ioan > > > >> Ioan > >> > >> > >> Mihael Hategan wrote: > >> > >>> On Fri, 2008-04-04 at 06:59 -0500, Michael Wilde wrote: > >>> > >>> > >>>> Mihael, this is great progress - very exciting. > >>>> Some questions (dont need answers right away): > >>>> > >>>> How would the end user use it? Manually start a service? > >>>> Is the service a separate process, or in the swift jvm? > >>>> > >>>> > >>> I though the lines below answered some of these. > >>> > >>> A user would specify the coaster provider in sites.xml. The provider > >>> will then automatically deploy a service on the target machine without > >>> the user having to do so. Given that the service is on a different > >>> machine than the client, they can't be in the same JVM. > >>> > >>> > >>> > >>>> How are the number of workers set or adjusted? > >>>> > >>>> > >>> Currently workers are requested as much as needed, up to a maximum. This > >>> is preliminary hence "Better allocation strategy for workers". > >>> > >>> > >>> > >>>> Does a service manage workers on one cluster or many? > >>>> > >>>> > >>> One service per cluster. > >>> > >>> > >>> > >>>> At 180 jobs/sec with 10 workers, what were the CPU loads on swift, > >>>> worker and service? > >>>> > >>>> > >>> I faintly recall them being at less than 50% for some reason I don't > >>> understand. > >>> > >>> > >>> > >>>> Do you want to try this on the workflows we're running on Falkon on the > >>>> BGP and SiCortex? > >>>> > >>>> > >>> Let me repeat "prototype" and "more testing". In no way do I want to do > >>> preliminary testing with an application that is shaky on an architecture > >>> that is also shaky. > >>> > >>> Mihael > >>> > >>> > >>> > >>>> Im eager to try it when you feel its ready for others to test. > >>>> > >>>> Nice work! > >>>> > >>>> - Mike > >>>> > >>>> > >>>> > >>>> On 4/4/08 4:39 AM, Mihael Hategan wrote: > >>>> > >>>> > >>>>> I've been asked for a summary of the status of the coaster prototype, so > >>>>> here it is: > >>>>> - It's a prototype so bugs are plenty > >>>>> - It's self deployed (you don't need to start a service on the target > >>>>> cluster) > >>>>> - You can also use it while starting a service on the target cluster > >>>>> - There is a worker written in Perl > >>>>> - It uses encryption between client and coaster service > >>>>> - It uses UDP between the service and the workers (this may prove to be > >>>>> better or worse choice than TCP) > >>>>> - A preliminary test done locally shows an amortized throughput of > >>>>> around 180 jobs/s (/bin/date). This was done with encryption and with 10 > >>>>> workers. Pretty picture attached (total time vs. # of jobs) > >>>>> > >>>>> To do: > >>>>> - The scheduling algorithm in the service needs a bit more work > >>>>> - When worker messages are lost, some jobs may get lost (i.e. needs more > >>>>> fault tolerance) > >>>>> - Start testing it on actual clusters > >>>>> - Do some memory consumption benchmarks > >>>>> - Better allocation strategy for workers > >>>>> > >>>>> Mihael > >>>>> > >>>>> > >>>>> ------------------------------------------------------------------------ > >>>>> > >>>>> _______________________________________________ > >>>>> Swift-devel mailing list > >>>>> Swift-devel at ci.uchicago.edu > >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>>>> > >>>>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>> > >>> > >>> > >> -- > >> =================================================== > >> Ioan Raicu > >> Ph.D. Candidate > >> =================================================== > >> Distributed Systems Laboratory > >> Computer Science Department > >> University of Chicago > >> 1100 E. 58th Street, Ryerson Hall > >> Chicago, IL 60637 > >> =================================================== > >> Email: iraicu at cs.uchicago.edu > >> Web: http://www.cs.uchicago.edu/~iraicu > >> http://dev.globus.org/wiki/Incubator/Falkon > >> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > >> =================================================== > >> =================================================== > >> > >> > > > > > > > From wilde at mcs.anl.gov Sun Apr 6 20:20:27 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 06 Apr 2008 20:20:27 -0500 Subject: [Swift-devel] Problem calling atomic procedure with multiple args via Falkon Message-ID: <47F976DB.2060500@mcs.anl.gov> I'm debugging a problem where I changed an atomic proc from one input arg to two. Im using the patched wrapper.sh (Ben's 3 patches to run in /tmp). Seems to work for local execution. With Falkon execution on BGP Im getting this error: bg$ cat ./status/h/dockwrap1-hpaczuqi-error Missing -of argument It looks like Falkon is getting the following command from swift and sending it to its bgp worker: Sent task to worker 172.16.3.12:33161: 426 urn:0-1-1-1207529458407#/bin/bash#shared/wrapper.sh dockwrap1-hpaczuqi -jobdir h -e /home/wilde/dock/bin/dockwrap1.cn -out stdout.txt -err stderr.txt -i -d mol-1M/8269 -if mol-1M/8269/058269.in|mol-1M/8269/058269.mol2 -of mol-1M/8269/058269.out -k -a /home/wilde/dock/DOCK5_bgp_ram.tgz dock_bgp_login mol-1M/8269/058269.in mol-1M/8269/058269.mol2 mol-1M/8269/058269.out # #/home/wilde/swiftwork/dock2-20080406-1950-krt29l04# The problem seems to stem from this arg: -if mol-1M/8269/058269.in|mol-1M/8269/058269.mol2 The value after the -if arg needs to be in quotes to shield it from the shell, as falkon takes the string and involves wrapper.sh using system(). The IFS char "|" is causing the cmd to end there. Its not clear if the arg is not quoted because in other providers its somehow shielded from shell evaluation, or if Falkon or the deef provider is pulling off the quotes. Can anyone spot where the problem is? From zhoujianghua1017 at 163.com Sun Apr 6 20:15:39 2008 From: zhoujianghua1017 at 163.com (jezhee) Date: Mon, 7 Apr 2008 09:15:39 +0800 Subject: [Swift-devel] The method getting result back Message-ID: <200804070915377736963@163.com> After the tasks are dispatcjed to the computing nodes, what should swift do? Now, it simply blocks and waits until the tasks are completed and send the result back. It's suitable for the light-weighted tasks or the situation swift handles a few applications. When Swift runs as a server for several applications, simple block will lead to waste and inconvenience. In the future version, I think this work should be done by a independent thread running as a demon. When the tasks has been transfered to the computing grid, the main thread switched the work to this thread. This thread will do the left stuff. Even for the simple application, there is no harm too. ?Jezhee 2008-04-07 From benc at hawaga.org.uk Mon Apr 7 01:18:49 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 7 Apr 2008 06:18:49 +0000 (GMT) Subject: [Swift-devel] Problem calling atomic procedure with multiple args via Falkon In-Reply-To: <47F976DB.2060500@mcs.anl.gov> References: <47F976DB.2060500@mcs.anl.gov> Message-ID: On Sun, 6 Apr 2008, Michael Wilde wrote: > involves wrapper.sh using system(). The IFS char "|" is causing the cmd > to end there. [...] > Can anyone spot where the problem is? Using system to invoke the command is perhaps a bad thing to do - the other layers in the stack (including in Falkon in the Java worker) keep the arguments in array-like data structures to help avoid need for quoting. The C worker isn't portable enough to build on my laptop so I can't easily play there, but you might try yourself replacing the system call with execve or something like that. -- From benc at hawaga.org.uk Mon Apr 7 02:03:27 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 7 Apr 2008 07:03:27 +0000 (GMT) Subject: [Swift-devel] coaster status summary In-Reply-To: References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> Message-ID: there's an interesting issue here of how much we should (in our general codebase, rather than in special customisations such as plugging in falkon) be supporting 'wierd' systems such as the BG/P which have, apparently, neither a decent shared filesystem or IP layer network fabric, vs support for more traditional clusters which seem to have both IP layer interconnect between nodes and decent shared filesystems. -- From hategan at mcs.anl.gov Mon Apr 7 02:50:11 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 07 Apr 2008 02:50:11 -0500 Subject: [Swift-devel] The method getting result back In-Reply-To: <200804070915377736963@163.com> References: <200804070915377736963@163.com> Message-ID: <1207554611.22686.0.camel@blabla.mcs.anl.gov> Swift already uses lightweight threading. There is a limited number of OS threads that do the work. On Mon, 2008-04-07 at 09:15 +0800, jezhee wrote: > After the tasks are dispatcjed to the computing nodes, what should swift do? Now, it simply blocks and waits until the tasks are completed and send the result back. It's suitable for the light-weighted tasks or the situation swift handles a few applications. When Swift runs as a server for several applications, simple block will lead to waste and inconvenience. > In the future version, I think this work should be done by a independent thread running as a demon. When the tasks has been transfered to the computing grid, the main thread switched the work to this thread. This thread will do the left stuff. Even for the simple application, there is no harm too. > ?Jezhee > 2008-04-07 > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Mon Apr 7 02:59:52 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 07 Apr 2008 02:59:52 -0500 Subject: [Swift-devel] coaster status summary In-Reply-To: References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> Message-ID: <1207555192.22686.6.camel@blabla.mcs.anl.gov> On Mon, 2008-04-07 at 07:03 +0000, Ben Clifford wrote: > there's an interesting issue here of how much we should (in our general > codebase, rather than in special customisations such as plugging in > falkon) be supporting 'wierd' systems such as the BG/P which have, > apparently, neither a decent shared filesystem or IP layer network fabric, > vs support for more traditional clusters which seem to have both IP layer > interconnect between nodes and decent shared filesystems. That is a good point. However, UDP was chosen to support systems with a very large number of CPUs in the first place. In other words if UDP won't work on BG/P, I don't see much reason for going with it. > From hategan at mcs.anl.gov Mon Apr 7 03:04:20 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 07 Apr 2008 03:04:20 -0500 Subject: [Swift-devel] swift 0.5rc1 In-Reply-To: References: Message-ID: <1207555460.22686.8.camel@blabla.mcs.anl.gov> cog r1961 fixes some issues that would prevent the gridftp connection caching mechanism from caching things. I think it's worth an rc2. On Sat, 2008-04-05 at 12:07 +0000, Ben Clifford wrote: > Swift 0.5 release candidate 1 is at > http://www.ci.uchicago.edu/~benc/vdsk-0.5rc1.tar.gz > > This is primarily bugfixes for bugs that were found around the time of the > 0.4 release - syntax error handling that was poorly tested before 0.4; and > data channel caching problems. There shouldn't be many (if any) new > features here. > > Please test. > > If there are no significant problems, I'll put this out as 0.5 on tuesday. > From hategan at mcs.anl.gov Mon Apr 7 03:14:23 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 07 Apr 2008 03:14:23 -0500 Subject: [Swift-devel] Problem calling atomic procedure with multiple args via Falkon In-Reply-To: <47F976DB.2060500@mcs.anl.gov> References: <47F976DB.2060500@mcs.anl.gov> Message-ID: <1207556063.22686.14.camel@blabla.mcs.anl.gov> > The problem seems to stem from this arg: > > -if mol-1M/8269/058269.in|mol-1M/8269/058269.mol2 I'd say the problem stems from improper passing of the arguments by some layer somewhere. > > The value after the -if arg needs to be in quotes to shield it from the > shell, as falkon takes the string and involves wrapper.sh using > system(). ...presumably by concatenating the arguments into a single string and hoping system() will split them correctly. Falkon should use execve(). > The IFS char "|" is causing the cmd to end there. > > Its not clear if the arg is not quoted because in other providers its > somehow shielded from shell evaluation, or if Falkon or the deef > provider is pulling off the quotes. > > Can anyone spot where the problem is? > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From foster at mcs.anl.gov Mon Apr 7 07:35:41 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Mon, 07 Apr 2008 07:35:41 -0500 Subject: [Swift-devel] coaster status summary In-Reply-To: <1207555192.22686.6.camel@blabla.mcs.anl.gov> References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> <1207555192.22686.6.camel@blabla.mcs.anl.gov> Message-ID: <47FA151D.1080504@mcs.anl.gov> I wonder whether we should be making use of MPI on the BG/P where we can ... I suspect that is what is optimized, rather than the IP stack. Mihael Hategan wrote: > On Mon, 2008-04-07 at 07:03 +0000, Ben Clifford wrote: > >> there's an interesting issue here of how much we should (in our general >> codebase, rather than in special customisations such as plugging in >> falkon) be supporting 'wierd' systems such as the BG/P which have, >> apparently, neither a decent shared filesystem or IP layer network fabric, >> vs support for more traditional clusters which seem to have both IP layer >> interconnect between nodes and decent shared filesystems. >> > > That is a good point. However, UDP was chosen to support systems with a > very large number of CPUs in the first place. In other words if UDP > won't work on BG/P, I don't see much reason for going with it. > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Apr 7 07:45:35 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 07 Apr 2008 07:45:35 -0500 Subject: [Swift-devel] coaster status summary In-Reply-To: <47FA151D.1080504@mcs.anl.gov> References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> <1207555192.22686.6.camel@blabla.mcs.anl.gov> <47FA151D.1080504@mcs.anl.gov> Message-ID: <1207572335.27797.5.camel@blabla.mcs.anl.gov> One unfortunate part there is the lack of a decent Java MPI implementation. Another unfortunate part may be that MPI may not work very well with processes that come and go dynamically, but I guess that can be addressed in a way or another. On Mon, 2008-04-07 at 07:35 -0500, Ian Foster wrote: > I wonder whether we should be making use of MPI on the BG/P where we > can ... I suspect that is what is optimized, rather than the IP stack. > > Mihael Hategan wrote: > > On Mon, 2008-04-07 at 07:03 +0000, Ben Clifford wrote: > > > > > there's an interesting issue here of how much we should (in our general > > > codebase, rather than in special customisations such as plugging in > > > falkon) be supporting 'wierd' systems such as the BG/P which have, > > > apparently, neither a decent shared filesystem or IP layer network fabric, > > > vs support for more traditional clusters which seem to have both IP layer > > > interconnect between nodes and decent shared filesystems. > > > > > > > That is a good point. However, UDP was chosen to support systems with a > > very large number of CPUs in the first place. In other words if UDP > > won't work on BG/P, I don't see much reason for going with it. > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > From benc at hawaga.org.uk Mon Apr 7 07:49:40 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 7 Apr 2008 12:49:40 +0000 (GMT) Subject: [Swift-devel] coaster status summary In-Reply-To: <1207572335.27797.5.camel@blabla.mcs.anl.gov> References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> <1207555192.22686.6.camel@blabla.mcs.anl.gov> <47FA151D.1080504@mcs.anl.gov> <1207572335.27797.5.camel@blabla.mcs.anl.gov> Message-ID: Wary of excessive optimisation of job completion notification speed in order to get high 'trivial/useless job' numbers, when there also seem to be problems getting shared filesystem access fast enough for non-useless jobs. Getting a ridiculously high trivial job throughput is not (in my eyes) a design goal of this coaster work. -- From foster at mcs.anl.gov Mon Apr 7 07:59:25 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Mon, 07 Apr 2008 07:59:25 -0500 Subject: [Swift-devel] coaster status summary In-Reply-To: References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> <1207555192.22686.6.camel@blabla.mcs.anl.gov> <47FA151D.1080504@mcs.anl.gov> <1207572335.27797.5.camel@blabla.mcs.anl.gov> Message-ID: <47FA1AAD.5080009@mcs.anl.gov> YES! I agree absolutely. Ben Clifford wrote: > Wary of excessive optimisation of job completion notification speed in > order to get high 'trivial/useless job' numbers, when there also seem to > be problems getting shared filesystem access fast enough for non-useless > jobs. Getting a ridiculously high trivial job throughput is not (in my > eyes) a design goal of this coaster work. > > From hategan at mcs.anl.gov Mon Apr 7 08:08:29 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 07 Apr 2008 08:08:29 -0500 Subject: [Swift-devel] coaster status summary In-Reply-To: References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> <1207555192.22686.6.camel@blabla.mcs.anl.gov> <47FA151D.1080504@mcs.anl.gov> <1207572335.27797.5.camel@blabla.mcs.anl.gov> Message-ID: <1207573709.27797.16.camel@blabla.mcs.anl.gov> On Mon, 2008-04-07 at 12:49 +0000, Ben Clifford wrote: > Wary of excessive optimisation of job completion notification speed in > order to get high 'trivial/useless job' numbers, when there also seem to > be problems getting shared filesystem access fast enough for non-useless > jobs. Getting a ridiculously high trivial job throughput is not (in my > eyes) a design goal of this coaster work. 200 j/s should be enough for anybody. Joking aside, the issue was ability to scale to large number of jobs rather than speed. But it looks like the issue is only an issue for monsters such as the BG/P. > From benc at hawaga.org.uk Mon Apr 7 03:40:34 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 7 Apr 2008 08:40:34 +0000 (GMT) Subject: [Swift-devel] swift 0.5rc1 In-Reply-To: <1207555460.22686.8.camel@blabla.mcs.anl.gov> References: <1207555460.22686.8.camel@blabla.mcs.anl.gov> Message-ID: ok. I will put one out later. On Mon, 7 Apr 2008, Mihael Hategan wrote: > cog r1961 fixes some issues that would prevent the gridftp connection > caching mechanism from caching things. I think it's worth an rc2. -- From wilde at mcs.anl.gov Mon Apr 7 12:18:26 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 07 Apr 2008 12:18:26 -0500 Subject: [Swift-devel] Problem calling atomic procedure with multiple args via Falkon In-Reply-To: References: <47F976DB.2060500@mcs.anl.gov> Message-ID: <47FA5762.2070408@mcs.anl.gov> On 4/7/08 1:18 AM, Ben Clifford wrote: > On Sun, 6 Apr 2008, Michael Wilde wrote: > >> involves wrapper.sh using system(). The IFS char "|" is causing the cmd >> to end there. > > [...] > >> Can anyone spot where the problem is? > > Using system to invoke the command is perhaps a bad thing to do - the > other layers in the stack (including in Falkon in the Java worker) keep > the arguments in array-like data structures to help avoid need for > quoting. That makes sense. Can you start some documentation fragments in the users guide on how quoting and tokenization works for atomic procedures from the swift declaration down to the actual invocation? something like: - each token in the app {} declaration becomes one arg to execve() - strings must be enclosed in quotes - quotes and other special chars within the strings can be represented as... - quotes are expected to pass through the provider interfaces(GRAMs, PBS, Falkon etc) without further processing or alteration...??? -etc > The C worker isn't portable enough to build on my laptop so I can't easily > play there, but you might try yourself replacing the system call with > execve or something like that. That sounds reasonable - we can do that. I was able to work around the problem for the moment by changing | to "," just before the system() call, but I agree that execve() is the right way to do things. - Mike From iraicu at cs.uchicago.edu Mon Apr 7 13:16:25 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 07 Apr 2008 13:16:25 -0500 Subject: [Swift-devel] coaster status summary In-Reply-To: <1207473442.10063.3.camel@blabla.mcs.anl.gov> References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> <47F61820.3090705@mcs.anl.gov> <1207311167.7171.12.camel@blabla.mcs.anl.gov> <47F6C1A4.5030200@cs.uchicago.edu> <1207388755.10629.12.camel@blabla.mcs.anl.gov> <47F78052.5020702@cs.uchicago.edu> <1207473442.10063.3.camel@blabla.mcs.anl.gov> Message-ID: <47FA64F9.8060005@cs.uchicago.edu> Although, when switching to TCP, most of my problems magically went away... obviously TCP's error recovery mechanisms are more robust than what I implemented. The moral of the story is from my experience, have a UDP option for potentially better performance and scalability, but have TCP as a configurable option for potentially better reliability and robustness. Ioan Mihael Hategan wrote: >>> Of course it's unreliable unless you deal with the reliability issues as >>> outlined above. >>> >>> >> I did deal with them, duplicates, out of order, retries, timeouts, >> etc... yet, I still couldn't get a 100% reliable implementation, >> > > Of course you couldn't. It's impossible. > > >> and I >> gave up... in theory, UDP should work given that you deal with all the >> reliability issues you outlined. I am just pointing out that after lots >> of debugging, I gave in and swapped UDP for TCP to avoid the unexplained >> lost message once in a while. I am positive it was a bug in my code, so >> perhaps you'll have better luck! >> >>> >>> >>>> Is the 180 tasks/sec the overall throughput measured from Swift's >>>> point of view, including overhead of wrapper.sh? Or is that a >>>> micro-benchmark measuring just the coaster performance? >>>> >>>> >>> It's at the provider level. No wrapper.sh. >>> >>> >> OK, great! >> >> Ioan >> >>> >>> >>>> Ioan >>>> >>>> >>>> Mihael Hategan wrote: >>>> >>>> >>>>> On Fri, 2008-04-04 at 06:59 -0500, Michael Wilde wrote: >>>>> >>>>> >>>>> >>>>>> Mihael, this is great progress - very exciting. >>>>>> Some questions (dont need answers right away): >>>>>> >>>>>> How would the end user use it? Manually start a service? >>>>>> Is the service a separate process, or in the swift jvm? >>>>>> >>>>>> >>>>>> >>>>> I though the lines below answered some of these. >>>>> >>>>> A user would specify the coaster provider in sites.xml. The provider >>>>> will then automatically deploy a service on the target machine without >>>>> the user having to do so. Given that the service is on a different >>>>> machine than the client, they can't be in the same JVM. >>>>> >>>>> >>>>> >>>>> >>>>>> How are the number of workers set or adjusted? >>>>>> >>>>>> >>>>>> >>>>> Currently workers are requested as much as needed, up to a maximum. This >>>>> is preliminary hence "Better allocation strategy for workers". >>>>> >>>>> >>>>> >>>>> >>>>>> Does a service manage workers on one cluster or many? >>>>>> >>>>>> >>>>>> >>>>> One service per cluster. >>>>> >>>>> >>>>> >>>>> >>>>>> At 180 jobs/sec with 10 workers, what were the CPU loads on swift, >>>>>> worker and service? >>>>>> >>>>>> >>>>>> >>>>> I faintly recall them being at less than 50% for some reason I don't >>>>> understand. >>>>> >>>>> >>>>> >>>>> >>>>>> Do you want to try this on the workflows we're running on Falkon on the >>>>>> BGP and SiCortex? >>>>>> >>>>>> >>>>>> >>>>> Let me repeat "prototype" and "more testing". In no way do I want to do >>>>> preliminary testing with an application that is shaky on an architecture >>>>> that is also shaky. >>>>> >>>>> Mihael >>>>> >>>>> >>>>> >>>>> >>>>>> Im eager to try it when you feel its ready for others to test. >>>>>> >>>>>> Nice work! >>>>>> >>>>>> - Mike >>>>>> >>>>>> >>>>>> >>>>>> On 4/4/08 4:39 AM, Mihael Hategan wrote: >>>>>> >>>>>> >>>>>> >>>>>>> I've been asked for a summary of the status of the coaster prototype, so >>>>>>> here it is: >>>>>>> - It's a prototype so bugs are plenty >>>>>>> - It's self deployed (you don't need to start a service on the target >>>>>>> cluster) >>>>>>> - You can also use it while starting a service on the target cluster >>>>>>> - There is a worker written in Perl >>>>>>> - It uses encryption between client and coaster service >>>>>>> - It uses UDP between the service and the workers (this may prove to be >>>>>>> better or worse choice than TCP) >>>>>>> - A preliminary test done locally shows an amortized throughput of >>>>>>> around 180 jobs/s (/bin/date). This was done with encryption and with 10 >>>>>>> workers. Pretty picture attached (total time vs. # of jobs) >>>>>>> >>>>>>> To do: >>>>>>> - The scheduling algorithm in the service needs a bit more work >>>>>>> - When worker messages are lost, some jobs may get lost (i.e. needs more >>>>>>> fault tolerance) >>>>>>> - Start testing it on actual clusters >>>>>>> - Do some memory consumption benchmarks >>>>>>> - Better allocation strategy for workers >>>>>>> >>>>>>> Mihael >>>>>>> >>>>>>> >>>>>>> ------------------------------------------------------------------------ >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Swift-devel mailing list >>>>>>> Swift-devel at ci.uchicago.edu >>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>>>> >>>>>>> >>>>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>> >>>>> >>>>> >>>>> >>>> -- >>>> =================================================== >>>> Ioan Raicu >>>> Ph.D. Candidate >>>> =================================================== >>>> Distributed Systems Laboratory >>>> Computer Science Department >>>> University of Chicago >>>> 1100 E. 58th Street, Ryerson Hall >>>> Chicago, IL 60637 >>>> =================================================== >>>> Email: iraicu at cs.uchicago.edu >>>> Web: http://www.cs.uchicago.edu/~iraicu >>>> http://dev.globus.org/wiki/Incubator/Falkon >>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page >>>> =================================================== >>>> =================================================== >>>> >>>> >>>> >>> >>> > > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Mon Apr 7 13:16:53 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 07 Apr 2008 13:16:53 -0500 Subject: [Swift-devel] Problem calling atomic procedure with multiple args via Falkon In-Reply-To: References: <47F976DB.2060500@mcs.anl.gov> Message-ID: <47FA6515.3060000@cs.uchicago.edu> Ben Clifford wrote: > On Sun, 6 Apr 2008, Michael Wilde wrote: > > >> involves wrapper.sh using system(). The IFS char "|" is causing the cmd >> to end there. >> > > [...] > > >> Can anyone spot where the problem is? >> > > Using system to invoke the command is perhaps a bad thing to do - the > other layers in the stack (including in Falkon in the Java worker) keep > the arguments in array-like data structures to help avoid need for > quoting. > I agree, and one of these days (maybe sooner rather than later), we'll switch to fork() and exec(), rather than system. > The C worker isn't portable enough to build on my laptop so I can't easily > play there, The C worker is quite basic, what error do you get that it doesn't compile? It has compiled for me on numerous platforms as is, so if its something we need to fix in general to help it be more portable, let us know. Ioan > but you might try yourself replacing the system call with > execve or something like that. > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Mon Apr 7 13:18:12 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 07 Apr 2008 13:18:12 -0500 Subject: [Swift-devel] coaster status summary In-Reply-To: <1207573709.27797.16.camel@blabla.mcs.anl.gov> References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> <1207555192.22686.6.camel@blabla.mcs.anl.gov> <47FA151D.1080504@mcs.anl.gov> <1207572335.27797.5.camel@blabla.mcs.anl.gov> <1207573709.27797.16.camel@blabla.mcs.anl.gov> Message-ID: <47FA6564.60008@cs.uchicago.edu> I agree that the BG/P is the only system I can think of right now that won't work with the UDP scheme you currently have, assuming that you will run the service on a login node that has access to both compute nodes and external world (i.e. Swift). The compute nodes don't support Java, so you'd have to have some C/Fortran code, or maybe some scripting language (which I don't know what kind of support there is). If you use C or Fortran, MPI becomes a viable alternative. TCP has always been an alternative. Anyways, if UDP doesn't work on the BG/P, and the BG/P is the only scale large enough (today) that warrants a connectionless protocol, then I suggest you switch to TCP (which has worked for us well on the BG/P, and is general enough to work in most environments) or even MPI (but you loose the generality of TCP, but might gain performance). Ioan Mihael Hategan wrote: > On Mon, 2008-04-07 at 12:49 +0000, Ben Clifford wrote: > >> Wary of excessive optimisation of job completion notification speed in >> order to get high 'trivial/useless job' numbers, when there also seem to >> be problems getting shared filesystem access fast enough for non-useless >> jobs. Getting a ridiculously high trivial job throughput is not (in my >> eyes) a design goal of this coaster work. >> > > 200 j/s should be enough for anybody. > > Joking aside, the issue was ability to scale to large number of jobs > rather than speed. But it looks like the issue is only an issue for > monsters such as the BG/P. > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Mon Apr 7 13:19:25 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 07 Apr 2008 13:19:25 -0500 Subject: [Swift-devel] Problem calling atomic procedure with multiple args via Falkon In-Reply-To: <1207556063.22686.14.camel@blabla.mcs.anl.gov> References: <47F976DB.2060500@mcs.anl.gov> <1207556063.22686.14.camel@blabla.mcs.anl.gov> Message-ID: <47FA65AD.1060509@cs.uchicago.edu> Mihael Hategan wrote: >> The problem seems to stem from this arg: >> >> -if mol-1M/8269/058269.in|mol-1M/8269/058269.mol2 >> > > I'd say the problem stems from improper passing of the arguments by some > layer somewhere. > > >> The value after the -if arg needs to be in quotes to shield it from the >> shell, as falkon takes the string and involves wrapper.sh using >> system(). >> > > ...presumably by concatenating the arguments into a single string and > hoping system() will split them correctly. Falkon should use execve(). > > Right... its on the to-do list! http://bugzilla.globus.org/globus/show_bug.cgi?id=5987 >> The IFS char "|" is causing the cmd to end there. >> >> Its not clear if the arg is not quoted because in other providers its >> somehow shielded from shell evaluation, or if Falkon or the deef >> provider is pulling off the quotes. >> >> Can anyone spot where the problem is? >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Mon Apr 7 15:48:47 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 07 Apr 2008 15:48:47 -0500 Subject: [Swift-devel] coaster status summary In-Reply-To: <1207473251.10063.1.camel@blabla.mcs.anl.gov> References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> <47F61820.3090705@mcs.anl.gov> <1207311167.7171.12.camel@blabla.mcs.anl.gov> <47F6C1A4.5030200@cs.uchicago.edu> <47F77DD0.8040302@cs.uchicago.edu> <1207473251.10063.1.camel@blabla.mcs.anl.gov> Message-ID: <47FA88AF.7050302@cs.uchicago.edu> If you won't take my word for it, when I have been on the machine and have seen what I described first hand, then feel free to write tech support for the BG/P! Here is their email address: ALCF Support Cheers, Ioan Mihael Hategan wrote: > On Sat, 2008-04-05 at 08:25 -0500, Ioan Raicu wrote: > >> I looked around for some docs on the networking structure, but couldn't >> find anything. >> >> There are several networks available on the BG/P: Torus, Tree, Barrier, >> RAS, 10Gig Ethernet. >> >> Of all these, we are only using the Ethernet network, which allows us to >> communicate via TCP/IP (or potentially UDP/IP) between compute nodes and >> I/O nodes, or between compute nodes and login nodes. For the rest of >> the discussion, we assume only Ethernet communication. There is 1 I/O >> node per 64 compute nodes (what we call a P-SET), and the I/O node can >> only communicate with compute nodes that it manages within the same >> P-SET (the 64 nodes). A compute node from one P-SET cannot directly >> communicate with another compute from a different P-SET. This is >> primarily because compute nodes have private addresses (192.168.x.x), >> I/O nodes are the NAT between the public IP and the private IP, and the >> login nodes only have a public IP. So, the compute nodes all have the >> same IP addresses, 192.168.x.x, and they repeat for every P-SET, and the >> I/O nodes handle their traffic in and out. >> > > You are describing NAT. I understand what NAT is. I was looking for an > independent source confirming this. > > >> >> >> Zhao, if you have any docs on the Ethernet network and the NAT that sits >> on the I/O node, can you please send it to the mailing list? >> >> Ioan >> >> Ben Clifford wrote: >> >>>> it would matter on a machine such as >>>> the BG/P, as there is a NAT inbetween the login nodes and the compute nodes. >>>> >>>> >>> wierd. is there a description of that somewhere? >>> >>> >>> > > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From duxu at mcs.anl.gov Mon Apr 7 09:57:55 2008 From: duxu at mcs.anl.gov (Xu Du) Date: Mon, 7 Apr 2008 09:57:55 -0500 Subject: [Swift-devel] Swift Innovation for BOINC : Design Spec References: <00ec01c89361$e21c5ec0$6a08dd8c@karen> <47F22DCE.5070405@mcs.anl.gov> Message-ID: <004601c898bf$c6335ed0$9a01a8c0@karen> Dear Mike, I have updated the design spec of "Swift Innovation for BOINC", please find it in the attachement. I also copy it to the swift develop members, any suggestion is welcome. Thanks, Du, Xu -------------- next part -------------- A non-text attachment was scrubbed... Name: SWIFT-SDS-0000-D0.2_080406.pdf Type: application/pdf Size: 80731 bytes Desc: not available URL: From duxu at mcs.anl.gov Mon Apr 7 10:01:01 2008 From: duxu at mcs.anl.gov (Xu Du) Date: Mon, 7 Apr 2008 10:01:01 -0500 Subject: [Swift-devel] SWIFT INNOVATION FOR BOINC: Weekly Report Mar.31-Apr.6 References: <00ec01c89361$e21c5ec0$6a08dd8c@karen> <47F22DCE.5070405@mcs.anl.gov> Message-ID: <005301c898c0$32052580$9a01a8c0@karen> Dear Mike, The following is the last weekly report. Any suggestion and comment are welcome. Regards, Xu -------------------------------------------------------------------------------------- Weekly Report Mar.31-Apr.6 Done: 1. Traced the source code of Cogkit and Swift, made clear about how Swift works. The ?Boinc provider? can be found by Swift now, but it still does not work well with Swift 2. Swift adaptor can now handle the situation when jobs are submitted simultaneously, even though they ask the same application to execute with different input files or arguments. Issues: 1. As for the problem of ?Boinc provider? can not work well with Swift, since ?Boinc provider? employs the same mechanism as ?ssh provider?, so we try ?ssh provider?, and the same problem occurred. User can not login to ssh server using Swift, the SSH server can only receive the header of the authentication message, and then the connection lost. To Do: 1. Solve the problem of ?ssh? and work out the prototype, and then test the whole system. 2. Make the program be able to handle job state query request from BOINC provider. From hategan at mcs.anl.gov Mon Apr 7 19:22:33 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 07 Apr 2008 19:22:33 -0500 Subject: [Swift-devel] coaster status summary In-Reply-To: <47FA6564.60008@cs.uchicago.edu> References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> <1207555192.22686.6.camel@blabla.mcs.anl.gov> <47FA151D.1080504@mcs.anl.gov> <1207572335.27797.5.camel@blabla.mcs.anl.gov> <1207573709.27797.16.camel@blabla.mcs.anl.gov> <47FA6564.60008@cs.uchicago.edu> Message-ID: <1207614154.6864.0.camel@blabla.mcs.anl.gov> Do you tweak the TCP window size or do you use the default? On Mon, 2008-04-07 at 13:18 -0500, Ioan Raicu wrote: > I agree that the BG/P is the only system I can think of right now that > won't work with the UDP scheme you currently have, assuming that you > will run the service on a login node that has access to both compute > nodes and external world (i.e. Swift). The compute nodes don't > support Java, so you'd have to have some C/Fortran code, or maybe some > scripting language (which I don't know what kind of support there is). > If you use C or Fortran, MPI becomes a viable alternative. TCP has > always been an alternative. Anyways, if UDP doesn't work on the BG/P, > and the BG/P is the only scale large enough (today) that warrants a > connectionless protocol, then I suggest you switch to TCP (which has > worked for us well on the BG/P, and is general enough to work in most > environments) or even MPI (but you loose the generality of TCP, but > might gain performance). > > Ioan > > Mihael Hategan wrote: > > On Mon, 2008-04-07 at 12:49 +0000, Ben Clifford wrote: > > > > > Wary of excessive optimisation of job completion notification speed in > > > order to get high 'trivial/useless job' numbers, when there also seem to > > > be problems getting shared filesystem access fast enough for non-useless > > > jobs. Getting a ridiculously high trivial job throughput is not (in my > > > eyes) a design goal of this coaster work. > > > > > > > 200 j/s should be enough for anybody. > > > > Joking aside, the issue was ability to scale to large number of jobs > > rather than speed. But it looks like the issue is only an issue for > > monsters such as the BG/P. > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > -- > =================================================== > Ioan Raicu > Ph.D. Candidate > =================================================== > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > =================================================== > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dev.globus.org/wiki/Incubator/Falkon > http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > =================================================== > =================================================== > From benc at hawaga.org.uk Tue Apr 8 02:54:55 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 8 Apr 2008 07:54:55 +0000 (GMT) Subject: [Swift-devel] Swift Innovation for BOINC : Design Spec In-Reply-To: <004601c898bf$c6335ed0$9a01a8c0@karen> References: <00ec01c89361$e21c5ec0$6a08dd8c@karen> <47F22DCE.5070405@mcs.anl.gov> <004601c898bf$c6335ed0$9a01a8c0@karen> Message-ID: Hi. When running a program in Swift, there is a requirement that the input files for a job are placed in the current working directory that the unix process runs on. Usually, that is achieved by using a shared filesystem between every worker node. But, I think that in a BOINC deployment, there will not be a shared filesystem that is shared between every worker node. I see that you intend to do 'file transfer' with BOINC, but it is not clear to me how those files will be connected with the jobs that want to use them. Is this addressed in your design? -- From benc at hawaga.org.uk Tue Apr 8 03:03:55 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 8 Apr 2008 08:03:55 +0000 (GMT) Subject: [Swift-devel] Problem calling atomic procedure with multiple args via Falkon In-Reply-To: <47FA6515.3060000@cs.uchicago.edu> References: <47F976DB.2060500@mcs.anl.gov> <47FA6515.3060000@cs.uchicago.edu> Message-ID: On Mon, 7 Apr 2008, Ioan Raicu wrote: > > The C worker isn't portable enough to build on my laptop so I can't easily > > play there, > The C worker is quite basic, what error do you get that it doesn't compile? > It has compiled for me on numerous platforms as is, so if its something we > need to fix in general to help it be more portable, let us know. $ ./make.worker-c.sh Compiling C Executor BGexec.c: In function 'set_sockopt': BGexec.c:48: error: 'SOL_TCP' undeclared (first use in this function) BGexec.c:48: error: (Each undeclared identifier is reported only once BGexec.c:48: error: for each function it appears in.) BGexec.c:48: error: 'TCP_KEEPCNT' undeclared (first use in this function) BGexec.c:53: error: 'TCP_KEEPIDLE' undeclared (first use in this function) BGexec.c:58: error: 'TCP_KEEPINTVL' undeclared (first use in this function) $ uname -a Darwin soju.hawaga.org.uk 8.11.1 Darwin Kernel Version 8.11.1: Wed Oct 10 18:23:28 PDT 2007; root:xnu-792.25.20~1/RELEASE_I386 i386 i386 If/when you rearrange the source code so it can be easily checked out, you can have multi-platform testing of this on a bunch of platforms in NMI build-and-test. -- From benc at hawaga.org.uk Tue Apr 8 05:01:04 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 8 Apr 2008 10:01:04 +0000 (GMT) Subject: [Swift-devel] cog r1957 breaks swift ftp usage (when port not specified?) Message-ID: CoG r1957 appears to break handling of gsiftp URLs specified in the Swift site catalog. All of the site tests in https://svn.ci.uchicago.edu/svn/vdl2/trunk/tests/sites are configured that way and are broken by r1957. When I apply svn diff -r 1957:1956 to my checkout, things work better (I ran the site tests on a few of the sites, but not all as I got tired of waiting) First part of stack trace is: Using sites file: ../sites/tp-fork-gram2-ftpport.xml Using tc.data: ../sites/tc.data For input string: "" For input string: "" task:service @ vdl-sc.k, line: 23 sys:if @ vdl-sc.k, line: 21 gridftp @ tp-fork-gram2-ftpport.xml, line: 4 pool @ tp-fork-gram2-ftpport.xml, line: 4 pool @ tp-fork-gram2-ftpport.xml, line: 4 org.globus.cog.karajan.workflow.nodes.Sequential @ tp-fork-gram2-ftpport.xml sys:executefile @ vdl-sc.k, line: 59 task:resources @ vdl-sc.k, line: 59 vdl:sitecatalog @ scheduler.xml, line: 42 task:scheduler @ scheduler.xml, line: 27 kernel:import @ scheduler.xml, line: 3 kernel:project @ 061-cattwo.kml, line: 2 061-cattwo-20080408-1100-eibujbf8 Caused by: java.lang.NumberFormatException: For input string: "" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Integer.parseInt(Integer.java:468) at java.lang.Integer.parseInt(Integer.java:497) at org.globus.cog.abstraction.impl.common.task.ServiceContactImpl.parse(ServiceContactImpl.java:81) at org.globus.cog.abstraction.impl.common.task.ServiceContactImpl.(ServiceContactImpl.java:26) at org.globus.cog.karajan.workflow.nodes.grid.ServiceNode.function(ServiceNode.java:123) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) at org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) -- From hategan at mcs.anl.gov Tue Apr 8 06:10:03 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 08 Apr 2008 06:10:03 -0500 Subject: [Swift-devel] Re: cog r1957 breaks swift ftp usage (when port not specified?) In-Reply-To: References: Message-ID: <1207653003.7262.0.camel@blabla.mcs.anl.gov> Grr! Fix coming up. On Tue, 2008-04-08 at 10:01 +0000, Ben Clifford wrote: > CoG r1957 appears to break handling of gsiftp URLs specified in the Swift > site catalog. > > All of the site tests in > https://svn.ci.uchicago.edu/svn/vdl2/trunk/tests/sites are configured that > way and are broken by r1957. > > When I apply svn diff -r 1957:1956 to my checkout, things work better (I > ran the site tests on a few of the sites, but not all as I got tired of > waiting) > > First part of stack trace is: > > Using sites file: ../sites/tp-fork-gram2-ftpport.xml > Using tc.data: ../sites/tc.data > For input string: "" > For input string: "" > task:service @ vdl-sc.k, line: 23 > sys:if @ vdl-sc.k, line: 21 > gridftp @ tp-fork-gram2-ftpport.xml, line: 4 > pool @ tp-fork-gram2-ftpport.xml, line: 4 > pool @ tp-fork-gram2-ftpport.xml, line: 4 > org.globus.cog.karajan.workflow.nodes.Sequential @ > tp-fork-gram2-ftpport.xml > sys:executefile @ vdl-sc.k, line: 59 > task:resources @ vdl-sc.k, line: 59 > vdl:sitecatalog @ scheduler.xml, line: 42 > task:scheduler @ scheduler.xml, line: 27 > kernel:import @ scheduler.xml, line: 3 > kernel:project @ 061-cattwo.kml, line: 2 > 061-cattwo-20080408-1100-eibujbf8 > Caused by: java.lang.NumberFormatException: For input string: "" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) > at java.lang.Integer.parseInt(Integer.java:468) > at java.lang.Integer.parseInt(Integer.java:497) > at > org.globus.cog.abstraction.impl.common.task.ServiceContactImpl.parse(ServiceContactImpl.java:81) > at > org.globus.cog.abstraction.impl.common.task.ServiceContactImpl.(ServiceContactImpl.java:26) > at > org.globus.cog.karajan.workflow.nodes.grid.ServiceNode.function(ServiceNode.java:123) > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45) > at > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > at > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > > > From hategan at mcs.anl.gov Tue Apr 8 06:21:15 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 08 Apr 2008 06:21:15 -0500 Subject: [Swift-devel] Re: cog r1957 breaks swift ftp usage (when port not specified?) In-Reply-To: <1207653003.7262.0.camel@blabla.mcs.anl.gov> References: <1207653003.7262.0.camel@blabla.mcs.anl.gov> Message-ID: <1207653675.7262.2.camel@blabla.mcs.anl.gov> Ok. Try r1962. On Tue, 2008-04-08 at 06:10 -0500, Mihael Hategan wrote: > Grr! > Fix coming up. > > On Tue, 2008-04-08 at 10:01 +0000, Ben Clifford wrote: > > CoG r1957 appears to break handling of gsiftp URLs specified in the Swift > > site catalog. > > > > All of the site tests in > > https://svn.ci.uchicago.edu/svn/vdl2/trunk/tests/sites are configured that > > way and are broken by r1957. > > > > When I apply svn diff -r 1957:1956 to my checkout, things work better (I > > ran the site tests on a few of the sites, but not all as I got tired of > > waiting) > > > > First part of stack trace is: > > > > Using sites file: ../sites/tp-fork-gram2-ftpport.xml > > Using tc.data: ../sites/tc.data > > For input string: "" > > For input string: "" > > task:service @ vdl-sc.k, line: 23 > > sys:if @ vdl-sc.k, line: 21 > > gridftp @ tp-fork-gram2-ftpport.xml, line: 4 > > pool @ tp-fork-gram2-ftpport.xml, line: 4 > > pool @ tp-fork-gram2-ftpport.xml, line: 4 > > org.globus.cog.karajan.workflow.nodes.Sequential @ > > tp-fork-gram2-ftpport.xml > > sys:executefile @ vdl-sc.k, line: 59 > > task:resources @ vdl-sc.k, line: 59 > > vdl:sitecatalog @ scheduler.xml, line: 42 > > task:scheduler @ scheduler.xml, line: 27 > > kernel:import @ scheduler.xml, line: 3 > > kernel:project @ 061-cattwo.kml, line: 2 > > 061-cattwo-20080408-1100-eibujbf8 > > Caused by: java.lang.NumberFormatException: For input string: "" > > at > > java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) > > at java.lang.Integer.parseInt(Integer.java:468) > > at java.lang.Integer.parseInt(Integer.java:497) > > at > > org.globus.cog.abstraction.impl.common.task.ServiceContactImpl.parse(ServiceContactImpl.java:81) > > at > > org.globus.cog.abstraction.impl.common.task.ServiceContactImpl.(ServiceContactImpl.java:26) > > at > > org.globus.cog.karajan.workflow.nodes.grid.ServiceNode.function(ServiceNode.java:123) > > at > > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45) > > at > > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > > at > > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Tue Apr 8 06:40:52 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 8 Apr 2008 11:40:52 +0000 (GMT) Subject: [Swift-devel] Re: cog r1957 breaks swift ftp usage (when port not specified?) In-Reply-To: <1207653675.7262.2.camel@blabla.mcs.anl.gov> References: <1207653003.7262.0.camel@blabla.mcs.anl.gov> <1207653675.7262.2.camel@blabla.mcs.anl.gov> Message-ID: Seems to work better. On Tue, 8 Apr 2008, Mihael Hategan wrote: > Ok. Try r1962. > > On Tue, 2008-04-08 at 06:10 -0500, Mihael Hategan wrote: > > Grr! > > Fix coming up. > > > > On Tue, 2008-04-08 at 10:01 +0000, Ben Clifford wrote: > > > CoG r1957 appears to break handling of gsiftp URLs specified in the Swift > > > site catalog. -- From iraicu at cs.uchicago.edu Tue Apr 8 08:29:35 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 08 Apr 2008 08:29:35 -0500 Subject: [Swift-devel] Problem calling atomic procedure with multiple args via Falkon In-Reply-To: References: <47F976DB.2060500@mcs.anl.gov> <47FA6515.3060000@cs.uchicago.edu> Message-ID: <47FB733F.2060200@cs.uchicago.edu> Thanks for the error output.... they seem to be TCP related variables that are not found, so I assume that we have to include additional header files for your platform to ensure that it finds these variables. We'll track it down and fix it! http://bugzilla.globus.org/globus/show_bug.cgi?id=5990 Ioan Ben Clifford wrote: > On Mon, 7 Apr 2008, Ioan Raicu wrote: > > >>> The C worker isn't portable enough to build on my laptop so I can't easily >>> play there, >>> > > >> The C worker is quite basic, what error do you get that it doesn't compile? >> It has compiled for me on numerous platforms as is, so if its something we >> need to fix in general to help it be more portable, let us know. >> > > $ ./make.worker-c.sh > Compiling C Executor > BGexec.c: In function 'set_sockopt': > BGexec.c:48: error: 'SOL_TCP' undeclared (first use in this function) > BGexec.c:48: error: (Each undeclared identifier is reported only once > BGexec.c:48: error: for each function it appears in.) > BGexec.c:48: error: 'TCP_KEEPCNT' undeclared (first use in this function) > BGexec.c:53: error: 'TCP_KEEPIDLE' undeclared (first use in this function) > BGexec.c:58: error: 'TCP_KEEPINTVL' undeclared (first use in this > function) > > $ uname -a > Darwin soju.hawaga.org.uk 8.11.1 Darwin Kernel Version 8.11.1: Wed Oct 10 > 18:23:28 PDT 2007; root:xnu-792.25.20~1/RELEASE_I386 i386 i386 > > > If/when you rearrange the source code so it can be easily checked out, you > can have multi-platform testing of this on a bunch of platforms in NMI > build-and-test. > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From iraicu at cs.uchicago.edu Tue Apr 8 08:31:59 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 08 Apr 2008 08:31:59 -0500 Subject: [Swift-devel] coaster status summary In-Reply-To: <1207614154.6864.0.camel@blabla.mcs.anl.gov> References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> <1207555192.22686.6.camel@blabla.mcs.anl.gov> <47FA151D.1080504@mcs.anl.gov> <1207572335.27797.5.camel@blabla.mcs.anl.gov> <1207573709.27797.16.camel@blabla.mcs.anl.gov> <47FA6564.60008@cs.uchicago.edu> <1207614154.6864.0.camel@blabla.mcs.anl.gov> Message-ID: <47FB73CF.4090509@cs.uchicago.edu> We use the default. For the SiCortex, we had to tweak the TCP keepalives to ensure that the TCP connections were not getting disconnected by the firewall on the SiCortex, which only allowed 180 seconds of inactivity before it disconnected connections. This meant that any job that took more than 180 seconds, or any Falkon idleness for more than 180 seconds resulted in TCP connection terminations. BTW, we did not experience this kind of firewall rules when running in other environments, so it took us a week to debug and find the root of the problem. This also happens because the Falkon service was running outside the SiCortex home network, but we had to do this as the SiCortex doesn't support Java, and at the time, didn't have access to any system within the internal network that supported Java. Ioan Mihael Hategan wrote: > Do you tweak the TCP window size or do you use the default? > > On Mon, 2008-04-07 at 13:18 -0500, Ioan Raicu wrote: > >> I agree that the BG/P is the only system I can think of right now that >> won't work with the UDP scheme you currently have, assuming that you >> will run the service on a login node that has access to both compute >> nodes and external world (i.e. Swift). The compute nodes don't >> support Java, so you'd have to have some C/Fortran code, or maybe some >> scripting language (which I don't know what kind of support there is). >> If you use C or Fortran, MPI becomes a viable alternative. TCP has >> always been an alternative. Anyways, if UDP doesn't work on the BG/P, >> and the BG/P is the only scale large enough (today) that warrants a >> connectionless protocol, then I suggest you switch to TCP (which has >> worked for us well on the BG/P, and is general enough to work in most >> environments) or even MPI (but you loose the generality of TCP, but >> might gain performance). >> >> Ioan >> >> Mihael Hategan wrote: >> >>> On Mon, 2008-04-07 at 12:49 +0000, Ben Clifford wrote: >>> >>> >>>> Wary of excessive optimisation of job completion notification speed in >>>> order to get high 'trivial/useless job' numbers, when there also seem to >>>> be problems getting shared filesystem access fast enough for non-useless >>>> jobs. Getting a ridiculously high trivial job throughput is not (in my >>>> eyes) a design goal of this coaster work. >>>> >>>> >>> 200 j/s should be enough for anybody. >>> >>> Joking aside, the issue was ability to scale to large number of jobs >>> rather than speed. But it looks like the issue is only an issue for >>> monsters such as the BG/P. >>> >>> >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> >>> >> -- >> =================================================== >> Ioan Raicu >> Ph.D. Candidate >> =================================================== >> Distributed Systems Laboratory >> Computer Science Department >> University of Chicago >> 1100 E. 58th Street, Ryerson Hall >> Chicago, IL 60637 >> =================================================== >> Email: iraicu at cs.uchicago.edu >> Web: http://www.cs.uchicago.edu/~iraicu >> http://dev.globus.org/wiki/Incubator/Falkon >> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page >> =================================================== >> =================================================== >> >> > > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From bugzilla-daemon at mcs.anl.gov Tue Apr 8 09:09:34 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 8 Apr 2008 09:09:34 -0500 (CDT) Subject: [Swift-devel] [Bug 106] Improve error messages for double-set and un-set variables In-Reply-To: Message-ID: <20080408140934.9EDB6164CF@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=106 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED ------- Comment #1 from benc at hawaga.org.uk 2008-04-08 09:09 ------- t10 is fairly easy to fix (I have a patch that I think does this) t7 is harder. the variable m is marked as an input (because it is never assigned to). in the case of a file-mapped variable, that would mean the rest of execution would assume that the backing file exists at the start. In the case of a variable that is being used as an in-memory unmapped variable like m, then the present behaviour doesn't work. There perhaps need to be tighter constraints on when it is permissible to extract a value from a closed dataset in this situation. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From bugzilla-daemon at mcs.anl.gov Tue Apr 8 09:13:44 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 8 Apr 2008 09:13:44 -0500 (CDT) Subject: [Swift-devel] [Bug 76] disable intermediate stageout of data In-Reply-To: Message-ID: <20080408141344.5E10D164CF@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=76 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |DUPLICATE ------- Comment #2 from benc at hawaga.org.uk 2008-04-08 09:13 ------- *** This bug has been marked as a duplicate of 29 *** -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at mcs.anl.gov Tue Apr 8 09:13:44 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 8 Apr 2008 09:13:44 -0500 (CDT) Subject: [Swift-devel] [Bug 29] Staging out of temporary files In-Reply-To: Message-ID: <20080408141344.A6EF416532@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=29 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |nefedova at mcs.anl.gov ------- Comment #2 from benc at hawaga.org.uk 2008-04-08 09:13 ------- *** Bug 76 has been marked as a duplicate of this bug. *** -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at mcs.anl.gov Tue Apr 8 09:13:45 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 8 Apr 2008 09:13:45 -0500 (CDT) Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules In-Reply-To: Message-ID: <20080408141345.09D5516562@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72 Bug 72 depends on bug 76, which changed state. Bug 76 Summary: disable intermediate stageout of data http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=76 What |Old Value |New Value ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |DUPLICATE -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at mcs.anl.gov Tue Apr 8 09:27:57 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 8 Apr 2008 09:27:57 -0500 (CDT) Subject: [Swift-devel] [Bug 32] Hello world gone wild In-Reply-To: Message-ID: <20080408142757.C9E211650A@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=32 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from benc at hawaga.org.uk 2008-04-08 09:27 ------- the mentioned jobmanager attribute was added in Swift 0.4 -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at mcs.anl.gov Tue Apr 8 09:29:15 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 8 Apr 2008 09:29:15 -0500 (CDT) Subject: [Swift-devel] [Bug 9] Limitation when abusing the submission rate In-Reply-To: Message-ID: <20080408142915.3A8D1164EC@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=9 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |WONTFIX -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From bugzilla-daemon at mcs.anl.gov Tue Apr 8 09:38:48 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 8 Apr 2008 09:38:48 -0500 (CDT) Subject: [Swift-devel] [Bug 56] multiple variable definitions in a row hide previous ones rather than causing a syntax error. In-Reply-To: Message-ID: <20080408143848.8DABC164CF@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=56 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from benc at hawaga.org.uk 2008-04-08 09:38 ------- A compile-time error for this was introduced in Swift 0.4. r1786 introduces a test for this bug based on the below code. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From hategan at mcs.anl.gov Tue Apr 8 10:00:48 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 08 Apr 2008 10:00:48 -0500 Subject: [Swift-devel] coaster status summary In-Reply-To: <47FB73CF.4090509@cs.uchicago.edu> References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> <1207555192.22686.6.camel@blabla.mcs.anl.gov> <47FA151D.1080504@mcs.anl.gov> <1207572335.27797.5.camel@blabla.mcs.anl.gov> <1207573709.27797.16.camel@blabla.mcs.anl.gov> <47FA6564.60008@cs.uchicago.edu> <1207614154.6864.0.camel@blabla.mcs.anl.gov> <47FB73CF.4090509@cs.uchicago.edu> Message-ID: <1207666848.12977.5.camel@blabla.mcs.anl.gov> You may want to try lowering the window size. The default is in the order of 100K (as far as I understand from various sources). That may be quite a bit if you have many connections. It may also be fairly useless for local LAN connections used to send short messages (i.e. less than the MTU/MSS). On Tue, 2008-04-08 at 08:31 -0500, Ioan Raicu wrote: > We use the default. For the SiCortex, we had to tweak the TCP > keepalives to ensure that the TCP connections were not getting > disconnected by the firewall on the SiCortex, which only allowed 180 > seconds of inactivity before it disconnected connections. This meant > that any job that took more than 180 seconds, or any Falkon idleness for > more than 180 seconds resulted in TCP connection terminations. BTW, we > did not experience this kind of firewall rules when running in other > environments, so it took us a week to debug and find the root of the > problem. This also happens because the Falkon service was running > outside the SiCortex home network, but we had to do this as the SiCortex > doesn't support Java, and at the time, didn't have access to any system > within the internal network that supported Java. > > Ioan > > Mihael Hategan wrote: > > Do you tweak the TCP window size or do you use the default? > > > > On Mon, 2008-04-07 at 13:18 -0500, Ioan Raicu wrote: > > > >> I agree that the BG/P is the only system I can think of right now that > >> won't work with the UDP scheme you currently have, assuming that you > >> will run the service on a login node that has access to both compute > >> nodes and external world (i.e. Swift). The compute nodes don't > >> support Java, so you'd have to have some C/Fortran code, or maybe some > >> scripting language (which I don't know what kind of support there is). > >> If you use C or Fortran, MPI becomes a viable alternative. TCP has > >> always been an alternative. Anyways, if UDP doesn't work on the BG/P, > >> and the BG/P is the only scale large enough (today) that warrants a > >> connectionless protocol, then I suggest you switch to TCP (which has > >> worked for us well on the BG/P, and is general enough to work in most > >> environments) or even MPI (but you loose the generality of TCP, but > >> might gain performance). > >> > >> Ioan > >> > >> Mihael Hategan wrote: > >> > >>> On Mon, 2008-04-07 at 12:49 +0000, Ben Clifford wrote: > >>> > >>> > >>>> Wary of excessive optimisation of job completion notification speed in > >>>> order to get high 'trivial/useless job' numbers, when there also seem to > >>>> be problems getting shared filesystem access fast enough for non-useless > >>>> jobs. Getting a ridiculously high trivial job throughput is not (in my > >>>> eyes) a design goal of this coaster work. > >>>> > >>>> > >>> 200 j/s should be enough for anybody. > >>> > >>> Joking aside, the issue was ability to scale to large number of jobs > >>> rather than speed. But it looks like the issue is only an issue for > >>> monsters such as the BG/P. > >>> > >>> > >>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>> > >>> > >>> > >> -- > >> =================================================== > >> Ioan Raicu > >> Ph.D. Candidate > >> =================================================== > >> Distributed Systems Laboratory > >> Computer Science Department > >> University of Chicago > >> 1100 E. 58th Street, Ryerson Hall > >> Chicago, IL 60637 > >> =================================================== > >> Email: iraicu at cs.uchicago.edu > >> Web: http://www.cs.uchicago.edu/~iraicu > >> http://dev.globus.org/wiki/Incubator/Falkon > >> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > >> =================================================== > >> =================================================== > >> > >> > > > > > > > From duxu at mcs.anl.gov Tue Apr 8 11:37:29 2008 From: duxu at mcs.anl.gov (Xu Du) Date: Tue, 8 Apr 2008 11:37:29 -0500 Subject: [Swift-devel] Swift Innovation for BOINC : Design Spec References: <00ec01c89361$e21c5ec0$6a08dd8c@karen> <47F22DCE.5070405@mcs.anl.gov> <004601c898bf$c6335ed0$9a01a8c0@karen> Message-ID: <002401c89996$d600cd20$6a08dd8c@karen> Hi Ben, The object of the project is to enable swift dispatch a job to BOINC. Currently, BOINC and Swift are independent. BOINC has itself mechanism to manage the jobs( tasks), it doesn't employ the shared file system. Simple speaking, a tasks will be created and the relate data (input files) will be put into BOINC Date Base when we submit a job to BOINC sever. In fact, there is no change on the files (including the executable program and the data) during so called "file transfer", what is done here is only to put the files to BOINC DB and register a new task on BOINC sever. After a task submitted, BOINC server will handle the task, all BOINC clients connect with BOINC server, the procedure of BOINC system processing the task is transparent to Swift. Thanks, Xu ----- Original Message ----- From: "Ben Clifford" To: "Xu Du" Cc: "Michael Wilde" ; Sent: Tuesday, April 08, 2008 2:54 AM Subject: Re: [Swift-devel] Swift Innovation for BOINC : Design Spec > > Hi. > > When running a program in Swift, there is a requirement that the input > files for a job are placed in the current working directory that the unix > process runs on. > > Usually, that is achieved by using a shared filesystem between every > worker node. But, I think that in a BOINC deployment, there will not be a > shared filesystem that is shared between every worker node. > > I see that you intend to do 'file transfer' with BOINC, but it is not > clear to me how those files will be connected with the jobs that want to > use them. > > Is this addressed in your design? > > -- > > > From benc at hawaga.org.uk Wed Apr 9 04:58:30 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 9 Apr 2008 09:58:30 +0000 (GMT) Subject: [Swift-devel] swift 0.5rc2 Message-ID: release candidate 2 for swift 0.5 is available here: http://www.ci.uchicago.edu/~benc/vdsk-0.5rc2.tar.gz Please test. If no significant fixes required, I'll put it out at the weekend as final release. -- From bugzilla-daemon at mcs.anl.gov Wed Apr 9 08:58:41 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 9 Apr 2008 08:58:41 -0500 (CDT) Subject: [Swift-devel] [Bug 106] Improve error messages for double-set and un-set variables In-Reply-To: Message-ID: <20080409135841.0975E164BB@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=106 ------- Comment #2 from benc at hawaga.org.uk 2008-04-09 08:58 ------- r1785 adds multiple assignment detection (which I thought I'd put in previously, but apparently not); this addresses the first part of the bug report. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From benc at hawaga.org.uk Wed Apr 9 10:08:24 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 9 Apr 2008 15:08:24 +0000 (GMT) Subject: [Swift-devel] how to put different wrapper behaviour into production Message-ID: Last week or so I made some patches that change wrapper.sh to copy lots of stuff around to a(n assumed) worker-local filesystem rather than using the shared filesystem. I don't particularly like this for general use - it means doing more steps, and more stuff to go wrong. Most especially, the worker-node-local info log files mean that if something goes wrong during execution (as often happens) there is a much greater level of difficulty in getting hold of those logs to debug. There are two paths that I see: i) add a swift runtime option that is passed to the wrapper, to select more-worker-node-local or less-worker-node-local behaviour; with one wrapper script able to function in both modes. or ii) allow the wrapper script to be specified as a runtime option; supply the standard wrapper script and the worker-node local script. Option i leads down a path of perhaps having lots of different options passed to the worker. This might be a good thing or might not. Option ii allows more open ended customisation of the wrapper scripts, but is likely to result in people keeping their own versions of the wrapper script around which will quickly stagnate and cause problems when they try to use. I'm somewhat inclined towards option ii. -- From bugzilla-daemon at mcs.anl.gov Wed Apr 9 10:26:05 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 9 Apr 2008 10:26:05 -0500 (CDT) Subject: [Swift-devel] [Bug 40] source location indication in execution-time error messages In-Reply-To: Message-ID: <20080409152605.228AD164BB@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=40 ------- Comment #2 from benc at hawaga.org.uk 2008-04-09 10:26 ------- In Swift 0.4, better compile-time line number handling was added, as was better compile time error checking. Whilst not directly addressing this bug, many situations where this was a problem before are now caught by the compile time changes. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at mcs.anl.gov Wed Apr 9 10:29:38 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 9 Apr 2008 10:29:38 -0500 (CDT) Subject: [Swift-devel] [Bug 42] paper(s) on Swift web not externally readable In-Reply-To: Message-ID: <20080409152938.8B393164BB@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=42 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from benc at hawaga.org.uk 2008-04-09 10:29 ------- both of the mentioned papers are in the CI Swift WWW space and are accessible (to me). -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From bugzilla-daemon at mcs.anl.gov Wed Apr 9 10:46:45 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 9 Apr 2008 10:46:45 -0500 (CDT) Subject: [Swift-devel] [Bug 132] New: order of stdin and stdout on app commandline can cause XML validation exceptions Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=132 Summary: order of stdin and stdout on app commandline can cause XML validation exceptions Product: Swift Version: unspecified Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk The order in which stdin and stdout (and presumably stderr) elements are placed in the intermediate XML appears to be the same as the order in which they appear in source text. However, only one order is valid according to the XML schema. This validates: echo "hello" stdin=@filename(t) stdout=@filename(q); This does not: echo "hello" stdout=@filename(q) stdin=@filename(t); -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at mcs.anl.gov Wed Apr 9 12:13:42 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 9 Apr 2008 12:13:42 -0500 (CDT) Subject: [Swift-devel] [Bug 132] order of stdin and stdout on app commandline can cause XML validation exceptions In-Reply-To: Message-ID: <20080409171342.15F27164BB@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=132 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from benc at hawaga.org.uk 2008-04-09 12:13 ------- this should be fixed in r1787 -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at mcs.anl.gov Wed Apr 9 16:19:04 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 9 Apr 2008 16:19:04 -0500 (CDT) Subject: [Swift-devel] [Bug 101] failure in site initialisation appears to cause job to fail rather than be retried elsewhere. In-Reply-To: Message-ID: <20080409211904.29AFE164BB@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=101 ------- Comment #2 from benc at hawaga.org.uk 2008-04-09 16:19 ------- There's another example of this in ccf-perm-wf-20080409-1511-kz872673.log Looks like permission error on one site means that it becomes available for use again rapidly, whilst the other sites (3 of them) are occupied running jobs successfully. So a failed job is retried on the only free resource, the broken one, over and over until it fails. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. You reported the bug, or are watching the reporter. From bugzilla-daemon at mcs.anl.gov Thu Apr 10 03:24:18 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 10 Apr 2008 03:24:18 -0500 (CDT) Subject: [Swift-devel] [Bug 101] failure in site initialisation appears to cause job to fail rather than be retried elsewhere. In-Reply-To: Message-ID: <20080410082418.15FFA164CF@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=101 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|failure in site |failure in site |initialisation appears to |initialisation appears to |cause job to fail rather |cause job to fail rather |than be retried elsewhere. |than be retried elsewhere. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. You reported the bug, or are watching the reporter. From bugzilla-daemon at mcs.anl.gov Thu Apr 10 03:25:24 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 10 Apr 2008 03:25:24 -0500 (CDT) Subject: [Swift-devel] [Bug 101] failure in site initialisation appears to cause job to fail rather than be retried elsewhere. In-Reply-To: Message-ID: <20080410082524.7A30E164CF@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=101 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|failure in site |failure in site |initialisation appears to |initialisation appears to |cause job to fail rather |cause job to fail rather |than be retried elsewhere. |than be retried elsewhere. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. You reported the bug, or are watching the reporter. From wilde at mcs.anl.gov Thu Apr 10 07:31:51 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 10 Apr 2008 07:31:51 -0500 Subject: [Swift-devel] SSH support In-Reply-To: <1195854124.12780.7.camel@blabla.mcs.anl.gov> References: <1195854124.12780.7.camel@blabla.mcs.anl.gov> Message-ID: <47FE08B7.9070606@mcs.anl.gov> I just tried this for the first time and I cant get it to work, Mihael. Can you take a look? I get these errors: 2008-04-09 23:04:47,221-0500 DEBUG TaskImpl Task(type=FILE_OPERATION, identity=urn:0-1207800286502) setting status to Failed org.globus.cog.abstraction.impl.file.FileResourceException: Error while communi\ cating with the SSH server on login.ci.uchicago.edu:22 Could not initialize shared directory on login.ci ... Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Error while communicating with the SSH server on login.ci.uchicago.edu:22 ... Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: org.globus.cog.abstraction.impl.file.FileResourceException: Error while communicating with the SSH server on login.ci.uchica\ go.edu:22 Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Error while communicating with the SSH server on login.ci.uchicago.edu:22 Caused by: java.lang.NullPointerException All the related files and logs are in ~benc/swift/logs/wilde/run354 Im running swift on terminable, with a 1-job test workflow to login.ci. I created a new rsa key, with a passphrase, and added it to authorized-keys. I tested the key and can manually ssh to login.ci from terminable with it, and verified the passphrase. (see file keytest) Also, once we have this working, can I eliminate the passphrase from auth.defaults if I use an agent? Thanks, - Mike On 11/23/07 3:42 PM, Mihael Hategan wrote: > I've updated the SSH provider in cog to do a few things: > - make better use of connections (cache them). SSH has this nifty thing: > On one connection you can configure multiple independent channels > (OpenSSH servers seem to support up to 10 such channels per connection). > With this you get up to 10 independent shells without authenticating > again. > - access remote filesystems (a file op provider) with SFTP > - get default authentication information from a file > (~/.ssh/auth.defaults). I attached a sample. I need to document this. > > I also added a filesystem element in the site catalog, which works in a > similar way to the execution element: > > storage="/homes/hategan/tmp" /> > > /homes/hategan/tmp > > > That basically allows Swift to work with SSH. > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Thu Apr 10 07:55:06 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 10 Apr 2008 07:55:06 -0500 Subject: [Swift-devel] SSH support In-Reply-To: <47FE08B7.9070606@mcs.anl.gov> References: <1195854124.12780.7.camel@blabla.mcs.anl.gov> <47FE08B7.9070606@mcs.anl.gov> Message-ID: <47FE0E2A.4000404@mcs.anl.gov> Ooops, I have a typo in my sites file - I fixed it but must have saved into wrong place. Let me re-test before you look into this. Sorry. - Mike On 4/10/08 7:31 AM, Michael Wilde wrote: > I just tried this for the first time and I cant get it to work, Mihael. > Can you take a look? > > I get these errors: > > 2008-04-09 23:04:47,221-0500 DEBUG TaskImpl Task(type=FILE_OPERATION, > identity=urn:0-1207800286502) setting status to Failed > org.globus.cog.abstraction.impl.file.FileResourceException: Error while > communi\ > cating with the SSH server on login.ci.uchicago.edu:22 > Could not initialize shared directory on login.ci > ... > Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: > Error while communicating with the SSH server on login.ci.uchicago.edu:22 > ... > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > org.globus.cog.abstraction.impl.file.FileResourceException: Error while > communicating with the SSH server on login.ci.uchica\ > go.edu:22 > Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: > Error while communicating with the SSH server on login.ci.uchicago.edu:22 > Caused by: java.lang.NullPointerException > > > All the related files and logs are in ~benc/swift/logs/wilde/run354 > > Im running swift on terminable, with a 1-job test workflow to login.ci. > > I created a new rsa key, with a passphrase, and added it to > authorized-keys. I tested the key and can manually ssh to login.ci from > terminable with it, and verified the passphrase. (see file keytest) > > Also, once we have this working, can I eliminate the passphrase from > auth.defaults if I use an agent? > > Thanks, > > - Mike > > > > On 11/23/07 3:42 PM, Mihael Hategan wrote: >> I've updated the SSH provider in cog to do a few things: >> - make better use of connections (cache them). SSH has this nifty thing: >> On one connection you can configure multiple independent channels >> (OpenSSH servers seem to support up to 10 such channels per connection). >> With this you get up to 10 independent shells without authenticating >> again. >> - access remote filesystems (a file op provider) with SFTP >> - get default authentication information from a file >> (~/.ssh/auth.defaults). I attached a sample. I need to document this. >> >> I also added a filesystem element in the site catalog, which works in a >> similar way to the execution element: >> >> > storage="/homes/hategan/tmp" /> >> >> /homes/hategan/tmp >> >> >> That basically allows Swift to work with SSH. >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Thu Apr 10 08:02:53 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 10 Apr 2008 08:02:53 -0500 Subject: [Swift-devel] SSH support In-Reply-To: <47FE08B7.9070606@mcs.anl.gov> References: <1195854124.12780.7.camel@blabla.mcs.anl.gov> <47FE08B7.9070606@mcs.anl.gov> Message-ID: <1207832573.26566.13.camel@blabla.mcs.anl.gov> On Thu, 2008-04-10 at 07:31 -0500, Michael Wilde wrote: > I just tried this for the first time and I cant get it to work, Mihael. > Can you take a look? > > I get these errors: > > 2008-04-09 23:04:47,221-0500 DEBUG TaskImpl Task(type=FILE_OPERATION, > identity=urn:0-1207800286502) setting status to Failed > org.globus.cog.abstraction.impl.file.FileResourceException: Error while > communi\ > cating with the SSH server on login.ci.uchicago.edu:22 > Could not initialize shared directory on login.ci > ... > Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: > Error while communicating with the SSH server on login.ci.uchicago.edu:22 > ... > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > org.globus.cog.abstraction.impl.file.FileResourceException: Error while > communicating with the SSH server on login.ci.uchica\ > go.edu:22 > Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: > Error while communicating with the SSH server on login.ci.uchicago.edu:22 > Caused by: java.lang.NullPointerException > > > All the related files and logs are in ~benc/swift/logs/wilde/run354 I'll take a look. > > Im running swift on terminable, with a 1-job test workflow to login.ci. > > I created a new rsa key, with a passphrase, and added it to > authorized-keys. I tested the key and can manually ssh to login.ci from > terminable with it, and verified the passphrase. (see file keytest) > > Also, once we have this working, can I eliminate the passphrase from > auth.defaults if I use an agent? No. The agent is not supported. > > Thanks, > > - Mike > > > > On 11/23/07 3:42 PM, Mihael Hategan wrote: > > I've updated the SSH provider in cog to do a few things: > > - make better use of connections (cache them). SSH has this nifty thing: > > On one connection you can configure multiple independent channels > > (OpenSSH servers seem to support up to 10 such channels per connection). > > With this you get up to 10 independent shells without authenticating > > again. > > - access remote filesystems (a file op provider) with SFTP > > - get default authentication information from a file > > (~/.ssh/auth.defaults). I attached a sample. I need to document this. > > > > I also added a filesystem element in the site catalog, which works in a > > similar way to the execution element: > > > > > storage="/homes/hategan/tmp" /> > > > > /homes/hategan/tmp > > > > > > That basically allows Swift to work with SSH. > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Thu Apr 10 08:10:41 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 10 Apr 2008 08:10:41 -0500 Subject: [Swift-devel] SSH support In-Reply-To: <47FE0E2A.4000404@mcs.anl.gov> References: <1195854124.12780.7.camel@blabla.mcs.anl.gov> <47FE08B7.9070606@mcs.anl.gov> <47FE0E2A.4000404@mcs.anl.gov> Message-ID: <47FE11D1.8050408@mcs.anl.gov> Found and fixed a typo, and it works. Very nice! I'll use this to access the SiCortex. - Mike On 4/10/08 7:55 AM, Michael Wilde wrote: > Ooops, I have a typo in my sites file - I fixed it but must have saved > into wrong place. Let me re-test before you look into this. Sorry. > > - Mike > > > On 4/10/08 7:31 AM, Michael Wilde wrote: >> I just tried this for the first time and I cant get it to work, >> Mihael. Can you take a look? >> >> I get these errors: >> >> 2008-04-09 23:04:47,221-0500 DEBUG TaskImpl Task(type=FILE_OPERATION, >> identity=urn:0-1207800286502) setting status to Failed >> org.globus.cog.abstraction.impl.file.FileResourceException: Error >> while communi\ >> cating with the SSH server on login.ci.uchicago.edu:22 >> Could not initialize shared directory on login.ci >> ... >> Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: >> Error while communicating with the SSH server on login.ci.uchicago.edu:22 >> ... >> Caused by: >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >> org.globus.cog.abstraction.impl.file.FileResourceException: Error >> while communicating with the SSH server on login.ci.uchica\ >> go.edu:22 >> Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: >> Error while communicating with the SSH server on login.ci.uchicago.edu:22 >> Caused by: java.lang.NullPointerException >> >> >> All the related files and logs are in ~benc/swift/logs/wilde/run354 >> >> Im running swift on terminable, with a 1-job test workflow to login.ci. >> >> I created a new rsa key, with a passphrase, and added it to >> authorized-keys. I tested the key and can manually ssh to login.ci >> from terminable with it, and verified the passphrase. (see file keytest) >> >> Also, once we have this working, can I eliminate the passphrase from >> auth.defaults if I use an agent? >> >> Thanks, >> >> - Mike >> >> >> >> On 11/23/07 3:42 PM, Mihael Hategan wrote: >>> I've updated the SSH provider in cog to do a few things: >>> - make better use of connections (cache them). SSH has this nifty thing: >>> On one connection you can configure multiple independent channels >>> (OpenSSH servers seem to support up to 10 such channels per connection). >>> With this you get up to 10 independent shells without authenticating >>> again. >>> - access remote filesystems (a file op provider) with SFTP >>> - get default authentication information from a file >>> (~/.ssh/auth.defaults). I attached a sample. I need to document this. >>> >>> I also added a filesystem element in the site catalog, which works in a >>> similar way to the execution element: >>> >>> >> storage="/homes/hategan/tmp" /> >>> >>> /homes/hategan/tmp >>> >>> >>> That basically allows Swift to work with SSH. >>> >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > From benc at hawaga.org.uk Fri Apr 11 08:43:28 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 11 Apr 2008 13:43:28 +0000 (GMT) Subject: [Swift-devel] fast-failing jobs Message-ID: bug 101 discusses a class of site-selection failures that look like this: two (or more) sites: site G works site F fails all jobs submitted to it, very rapidly. Submit 10 non-trivial jobs for scheduling. At present, the minimum number of simultaneous jobs that will be sent to a site is 2. Two jobs go to site G, and occupy it (for eg 20 minutes); two jobs go to site F and fail (within eg. 10 seconds). two more jobs go to site F and fail (within eg 10 seconds). All jobs apart from the two jobs that went to site G are repeatedly submitted to site F and fail, exhausting all their retries and causing a workflow failure. One approach to stopping this is to slow down submission to poorly scoring sites. However, in this case, the delay between submissions would need to be on the scale of minutes .. tens of minutes to avoid this. However, the delay needs to be on roughly the same scale as the length of a job, which varies widely depending on usage (some people are putting through half hour jobs, some people put through jobs that are a few seconds long). That seems difficult to determine at startup. It seems undesirable to block a site from execution entirely based on poor performance because much can change over the duration of a long run (working sites break and non-working sites unbreak). Related to the need for job execution length information here is stuff we've talked about in the past where jobs should be unselected/relaunched at a different site if they take 'too long', where 'too long' is determined based perhaps on some statistical analysis of other jobs that have executed successfully. -- From iraicu at cs.uchicago.edu Fri Apr 11 09:11:42 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Fri, 11 Apr 2008 09:11:42 -0500 Subject: [Swift-devel] fast-failing jobs In-Reply-To: References: Message-ID: <47FF719E.2040404@cs.uchicago.edu> We addressed this in Falkon by suspending bad nodes (within Falkon). About trying to solve the problem in general, here is an idea. The retry counter is on a per site basis. Lets assume the max retry is set to 3, and we have 4 sites, of which 3 are broken (fail fast, seconds per job), and only 1 site is good (computes for minutes per job). Assuming we have 10 jobs in total to do, within 1 minute, all 10 jobs will have failed 3 times per site, and the only site left that could potentially run these 10 jobs is the 4th site that is working at a few minutes per job. Now, the 3 sites that are bad aren't penalized in any way, if there are jobs that have not run there yet and failed, then they will be tried... This sounds like it would fix your problem, but, I am not sure how easy it is to keep track of the retry per site, and only fail a job if it has failed the max number of times at all sites! Ioan Ben Clifford wrote: > bug 101 discusses a class of site-selection failures that look like this: > > two (or more) sites: > site G works > site F fails all jobs submitted to it, very rapidly. > > Submit 10 non-trivial jobs for scheduling. At present, the minimum number > of simultaneous jobs that will be sent to a site is 2. Two jobs go to site > G, and occupy it (for eg 20 minutes); two jobs go to site F and fail > (within eg. 10 seconds). two more jobs go to site F and fail (within eg 10 > seconds). All jobs apart from the two jobs that went to site G are > repeatedly submitted to site F and fail, exhausting all their retries and > causing a workflow failure. > > One approach to stopping this is to slow down submission to poorly scoring > sites. However, in this case, the delay between submissions would need to > be on the scale of minutes .. tens of minutes to avoid this. > > However, the delay needs to be on roughly the same scale as the length of > a job, which varies widely depending on usage (some people are putting > through half hour jobs, some people put through jobs that are a few > seconds long). That seems difficult to determine at startup. > > It seems undesirable to block a site from execution entirely based on poor > performance because much can change over the duration of a long run > (working sites break and non-working sites unbreak). > > Related to the need for job execution length information here is stuff > we've talked about in the past where jobs should be unselected/relaunched > at a different site if they take 'too long', where 'too long' is > determined based perhaps on some statistical analysis of other jobs that > have executed successfully. > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From wilde at mcs.anl.gov Sat Apr 12 08:45:19 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 12 Apr 2008 08:45:19 -0500 Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: References: <47FEB489.509@mcs.anl.gov> <47FEEB6E.3060506@cs.uchicago.edu> <47FEF367.1060902@uchicago.edu> <47FEF3D5.9060707@cs.uchicago.edu> <47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> Message-ID: <4800BCEF.5040206@mcs.anl.gov> Ben, can you confirm this: to turn off all job submission throttling (for Falkon), the correct setting for each of the following props is "off"? # remove all limits on job submit rates throttle.submit=off throttle.host.submit=off throttle.score.job.factor=off Long ago (circa Nov) it seemed "off" didn't give the the wide-open throttle effect I was looking for, but "off" is a more clear setting than "big numbers" if we know it to be working as expected. - Mike On 4/12/08 5:22 AM, Ben Clifford wrote: > On Sat, 12 Apr 2008, Zhao Zhang wrote: > >> i) what files are shared (so swift will only stage-in once) >> how many files per job and how big in total per job? >> >> >> I am not familiar with definition of "stage-in". Mike, could you help to >> explain this? > > When a job runs in Swift, it looks like this: > > stage in the input files - copy the input files from where they are > stored to where they are needed for execution > execute (using falkon in this case) > stage out the output files > > The execute stage is what falkon is involved in; there are other > mechanisms to move files around into the appropraite > > So really I want to know what are the input files to your jobs. > >> on GPFS of Blue Gene > > If everything is on the same filesystem on bluegene, an interesting idea > springs to mind of potentially only symlinking the input files rather than > copying them around. > > I can have a play with that next week perhaps. > From hategan at mcs.anl.gov Sat Apr 12 13:08:43 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 12 Apr 2008 13:08:43 -0500 Subject: [Swift-devel] fast-failing jobs In-Reply-To: References: Message-ID: <1208023723.2963.13.camel@blabla.mcs.anl.gov> On Fri, 2008-04-11 at 13:43 +0000, Ben Clifford wrote: > bug 101 discusses a class of site-selection failures that look like this: > > two (or more) sites: > site G works > site F fails all jobs submitted to it, very rapidly. > > Submit 10 non-trivial jobs for scheduling. At present, the minimum number > of simultaneous jobs that will be sent to a site is 2. Two jobs go to site > G, and occupy it (for eg 20 minutes); two jobs go to site F and fail > (within eg. 10 seconds). two more jobs go to site F and fail (within eg 10 > seconds). All jobs apart from the two jobs that went to site G are > repeatedly submitted to site F and fail, exhausting all their retries and > causing a workflow failure. > > One approach to stopping this is to slow down submission to poorly scoring > sites. However, in this case, the delay between submissions would need to > be on the scale of minutes .. tens of minutes to avoid this. > > However, the delay needs to be on roughly the same scale as the length of > a job, which varies widely depending on usage (some people are putting > through half hour jobs, some people put through jobs that are a few > seconds long). That's pretty much what a low score does if there's throttling based on score. Perhaps our solution is to have a low job throttle and a higher score range (i.e. T=1000 instead of 100). That or we could enforce a submission rate (j/s) based on score. > That seems difficult to determine at startup. That, again, is in the nature of things. A good score approximation is difficult to determine at startup. From hategan at mcs.anl.gov Sat Apr 12 13:11:38 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 12 Apr 2008 13:11:38 -0500 Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: <4800BCEF.5040206@mcs.anl.gov> References: <47FEB489.509@mcs.anl.gov> <47FEEB6E.3060506@cs.uchicago.edu> <47FEF367.1060902@uchicago.edu> <47FEF3D5.9060707@cs.uchicago.edu> <47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> Message-ID: <1208023898.2963.17.camel@blabla.mcs.anl.gov> On Sat, 2008-04-12 at 08:45 -0500, Michael Wilde wrote: > Ben, can you confirm this: to turn off all job submission throttling > (for Falkon), the correct setting for each of the following props is "off"? > > # remove all limits on job submit rates > throttle.submit=off > throttle.host.submit=off > throttle.score.job.factor=off > > Long ago (circa Nov) it seemed "off" didn't give the the wide-open > throttle effect I was looking for, but "off" is a more clear setting > than "big numbers" if we know it to be working as expected. If "off" doesn't work as expected we should figure out why, not invent another "off". So please use "off" and report any problems. > > - Mike > > > > On 4/12/08 5:22 AM, Ben Clifford wrote: > > On Sat, 12 Apr 2008, Zhao Zhang wrote: > > > >> i) what files are shared (so swift will only stage-in once) > >> how many files per job and how big in total per job? > >> > >> > >> I am not familiar with definition of "stage-in". Mike, could you help to > >> explain this? > > > > When a job runs in Swift, it looks like this: > > > > stage in the input files - copy the input files from where they are > > stored to where they are needed for execution > > execute (using falkon in this case) > > stage out the output files > > > > The execute stage is what falkon is involved in; there are other > > mechanisms to move files around into the appropraite > > > > So really I want to know what are the input files to your jobs. > > > >> on GPFS of Blue Gene > > > > If everything is on the same filesystem on bluegene, an interesting idea > > springs to mind of potentially only symlinking the input files rather than > > copying them around. > > > > I can have a play with that next week perhaps. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From zhaozhang at uchicago.edu Sat Apr 12 16:46:18 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Sat, 12 Apr 2008 16:46:18 -0500 Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: <4800BCEF.5040206@mcs.anl.gov> References: <47FEB489.509@mcs.anl.gov> <47FEEB6E.3060506@cs.uchicago.edu> <47FEF367.1060902@uchicago.edu> <47FEF3D5.9060707@cs.uchicago.edu> <47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> Message-ID: <48012DAA.2000308@uchicago.edu> Hi, Ioan Check the log file at BGP, /home/falkon/DOCK_swift+falkon/DOCK_swift+falkon_4x512_6084_2008.04.12_15.38.31 I ran 6084 DOCK tasks, and it indeed runs on 2048 cores. zhao Michael Wilde wrote: > Ben, can you confirm this: to turn off all job submission throttling > (for Falkon), the correct setting for each of the following props is > "off"? > > # remove all limits on job submit rates > throttle.submit=off > throttle.host.submit=off > throttle.score.job.factor=off > > Long ago (circa Nov) it seemed "off" didn't give the the wide-open > throttle effect I was looking for, but "off" is a more clear setting > than "big numbers" if we know it to be working as expected. > > - Mike > > > > On 4/12/08 5:22 AM, Ben Clifford wrote: >> On Sat, 12 Apr 2008, Zhao Zhang wrote: >> >>> i) what files are shared (so swift will only stage-in once) >>> how many files per job and how big in total per job? >>> >>> I am not familiar with definition of "stage-in". Mike, could you >>> help to >>> explain this? >> >> When a job runs in Swift, it looks like this: >> >> stage in the input files - copy the input files from where they are >> stored to where they are needed for execution >> execute (using falkon in this case) >> stage out the output files >> >> The execute stage is what falkon is involved in; there are other >> mechanisms to move files around into the appropraite >> >> So really I want to know what are the input files to your jobs. >> >>> on GPFS of Blue Gene >> >> If everything is on the same filesystem on bluegene, an interesting >> idea springs to mind of potentially only symlinking the input files >> rather than copying them around. >> >> I can have a play with that next week perhaps. >> > From zhaozhang at uchicago.edu Sat Apr 12 16:50:33 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Sat, 12 Apr 2008 16:50:33 -0500 Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: <4800BCEF.5040206@mcs.anl.gov> References: <47FEB489.509@mcs.anl.gov> <47FEEB6E.3060506@cs.uchicago.edu> <47FEF367.1060902@uchicago.edu> <47FEF3D5.9060707@cs.uchicago.edu> <47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> Message-ID: <48012EA9.1060102@uchicago.edu> Hi, Ben I got a log file of 6084 successful runs on BGP. Check it here, terninable:/home/zzhang/swift_file/dock2-20080412-1609-99cy0z4g.log zhao Michael Wilde wrote: > Ben, can you confirm this: to turn off all job submission throttling > (for Falkon), the correct setting for each of the following props is > "off"? > > # remove all limits on job submit rates > throttle.submit=off > throttle.host.submit=off > throttle.score.job.factor=off > > Long ago (circa Nov) it seemed "off" didn't give the the wide-open > throttle effect I was looking for, but "off" is a more clear setting > than "big numbers" if we know it to be working as expected. > > - Mike > > > > On 4/12/08 5:22 AM, Ben Clifford wrote: >> On Sat, 12 Apr 2008, Zhao Zhang wrote: >> >>> i) what files are shared (so swift will only stage-in once) >>> how many files per job and how big in total per job? >>> >>> I am not familiar with definition of "stage-in". Mike, could you >>> help to >>> explain this? >> >> When a job runs in Swift, it looks like this: >> >> stage in the input files - copy the input files from where they are >> stored to where they are needed for execution >> execute (using falkon in this case) >> stage out the output files >> >> The execute stage is what falkon is involved in; there are other >> mechanisms to move files around into the appropraite >> >> So really I want to know what are the input files to your jobs. >> >>> on GPFS of Blue Gene >> >> If everything is on the same filesystem on bluegene, an interesting >> idea springs to mind of potentially only symlinking the input files >> rather than copying them around. >> >> I can have a play with that next week perhaps. >> > From benc at hawaga.org.uk Sun Apr 13 09:52:15 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 13 Apr 2008 14:52:15 +0000 (GMT) Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: <4800BCEF.5040206@mcs.anl.gov> References: <47FEB489.509@mcs.anl.gov> <47FEEB6E.3060506@cs.uchicago.edu> <47FEF367.1060902@uchicago.edu> <47FEF3D5.9060707@cs.uchicago.edu> <47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> Message-ID: On Sat, 12 Apr 2008, Michael Wilde wrote: > Long ago (circa Nov) it seemed "off" didn't give the the wide-open throttle > effect I was looking for, but "off" is a more clear setting than "big numbers" > if we know it to be working as expected. If that was the angle stuff that we were doing for SC, a lot of the time (or all of the time?) that was constrained by stagein speed, not by job submission speed. -- From benc at hawaga.org.uk Sun Apr 13 11:19:04 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 13 Apr 2008 16:19:04 +0000 (GMT) Subject: [Swift-devel] fast-failing jobs In-Reply-To: <1208023723.2963.13.camel@blabla.mcs.anl.gov> References: <1208023723.2963.13.camel@blabla.mcs.anl.gov> Message-ID: On Sat, 12 Apr 2008, Mihael Hategan wrote: > That's pretty much what a low score does if there's throttling based on > score. Perhaps our solution is to have a low job throttle and a higher > score range (i.e. T=1000 instead of 100). The present scoring system won't ever go below 2 jobs per site, so pretty much whatever the parameters are tweaked to, a fast-fail site will eat 2 jobs per fast-fail cycle. > That or we could enforce a submission rate (j/s) based on score. That is perhaps better. It would make lower scores more punitive than at the moment, which may be a problem given the way that in certain other failure modes the score gets reduced catastrophically. (eg a transient problem where all jobs fail that are in progress, with a large number of jobs in progress - this was why I put that lower bound in on the scores - to prevent the score actually getting really low) -- From hategan at mcs.anl.gov Sun Apr 13 11:22:40 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 13 Apr 2008 11:22:40 -0500 Subject: [Swift-devel] fast-failing jobs In-Reply-To: References: <1208023723.2963.13.camel@blabla.mcs.anl.gov> Message-ID: <1208103760.15803.1.camel@blabla.mcs.anl.gov> On Sun, 2008-04-13 at 16:19 +0000, Ben Clifford wrote: > On Sat, 12 Apr 2008, Mihael Hategan wrote: > > > That's pretty much what a low score does if there's throttling based on > > score. Perhaps our solution is to have a low job throttle and a higher > > score range (i.e. T=1000 instead of 100). > > The present scoring system won't ever go below 2 jobs per site, so pretty > much whatever the parameters are tweaked to, a fast-fail site will eat 2 > jobs per fast-fail cycle. That being one thing that probably should be changed. > > > That or we could enforce a submission rate (j/s) based on score. > > That is perhaps better. > > It would make lower scores more punitive than at the moment, which may be > a problem given the way that in certain other failure modes the score gets > reduced catastrophically. (eg a transient problem where all jobs fail that > are in progress, with a large number of jobs in progress - this was why I > put that lower bound in on the scores - to prevent the score actually > getting really low) > From benc at hawaga.org.uk Sun Apr 13 11:37:40 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 13 Apr 2008 16:37:40 +0000 (GMT) Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: <48012EA9.1060102@uchicago.edu> References: <47FEB489.509@mcs.anl.gov> <47FEEB6E.3060506@cs.uchicago.edu> <47FEF367.1060902@uchicago.edu> <47FEF3D5.9060707@cs.uchicago.edu> <47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu> Message-ID: On Sat, 12 Apr 2008, Zhao Zhang wrote: > Hi, Ben > > I got a log file of 6084 successful runs on BGP. Check it here, > terninable:/home/zzhang/swift_file/dock2-20080412-1609-99cy0z4g.log This one runs better - it gets up to a peak of 5000 jobs submitted into Falkon simultaneously, and spends a considerable amount of time over the 2048 level that I suppose is what you need to be over to get all 2048 CPUs used. There's a lot of stage-in activity that probably could be eliminated / changed for the single-filesytem case. -- From wilde at mcs.anl.gov Sun Apr 13 14:17:56 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 13 Apr 2008 14:17:56 -0500 Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: References: <47FEB489.509@mcs.anl.gov> <47FEEB6E.3060506@cs.uchicago.edu> <47FEF367.1060902@uchicago.edu> <47FEF3D5.9060707@cs.uchicago.edu> <47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu> Message-ID: <48025C64.9020502@mcs.anl.gov> Ben, can you point me to the graphs for this run? (Zhao's *99cy0z4g.log) Here's a high-level summary of this run: Swift end 16:42:17 Swift start 16:09:07 Runtime 33:10 = 1990 seconds 2048 cores Total app wall time = 1190260 seconds 1190260 / ( 1990 * 2048 ) = .29 efficiency Once stage-ins start to complete, are the corresponding jobs initiated quickly, or is Swift doing mostly stage-ins for some period? Zhao indicated he saw data indicating there was about a 700 second lag from workflow start time till the first Falkon jobs started, if I understood correctly. Do the graphs confirm this or say something different? If the 700-second delay figure is true, and stage-in was eliminated by copying input files right to the /tmp workdir rather than first to /shared, then we'd have: 1190260 / ( 1290 * 2048 ) = .45 efficiency A good gain, but only partway to a number that looks good. I assume we're paying the same staging price on the output side? What I think we learned from the MARS app run, which had no input data and only tiny output data files (10 bytes vs 10K bytes), was that the optimized wrapper achieved somewhere between .7 to .8 efficiency. I'd like to look at whatever data we can get from this or similar subsequent runs to learn what steps we could take next to increase the efficiency metric. Guidance welcome. Thanks, Mike On 4/13/08 11:37 AM, Ben Clifford wrote: > On Sat, 12 Apr 2008, Zhao Zhang wrote: > >> Hi, Ben >> >> I got a log file of 6084 successful runs on BGP. Check it here, >> terninable:/home/zzhang/swift_file/dock2-20080412-1609-99cy0z4g.log > > This one runs better - it gets up to a peak of 5000 jobs submitted into > Falkon simultaneously, and spends a considerable amount of time over the > 2048 level that I suppose is what you need to be over to get all 2048 CPUs > used. > > There's a lot of stage-in activity that probably could be eliminated / > changed for the single-filesytem case. > - Mike From benc at hawaga.org.uk Sun Apr 13 14:43:45 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 13 Apr 2008 19:43:45 +0000 (GMT) Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: <48012EA9.1060102@uchicago.edu> References: <47FEB489.509@mcs.anl.gov> <47FEEB6E.3060506@cs.uchicago.edu> <47FEF367.1060902@uchicago.edu> <47FEF3D5.9060707@cs.uchicago.edu> <47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu> Message-ID: > I got a log file of 6084 successful runs on BGP. Check it here, > terninable:/home/zzhang/swift_file/dock2-20080412-1609-99cy0z4g.log ok. It is useful if you also supply the -info files if there is doubt about how time is being spent on the worker nodes (which I think there is some of). -- From benc at hawaga.org.uk Sun Apr 13 14:57:06 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 13 Apr 2008 19:57:06 +0000 (GMT) Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: <48025C64.9020502@mcs.anl.gov> References: <47FEB489.509@mcs.anl.gov> <47FEF367.1060902@uchicago.edu> <47FEF3D5.9060707@cs.uchicago.edu> <47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu> <48025C64.9020502@mcs.anl.gov> Message-ID: > Ben, can you point me to the graphs for this run? (Zhao's *99cy0z4g.log) http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g > Once stage-ins start to complete, are the corresponding jobs initiated > quickly, or is Swift doing mostly stage-ins for some period? In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to falkon) pretty much right as the corresponding stagein completes. I have no deeper information about when the worker actually starts to run. > Zhao indicated he saw data indicating there was about a 700 second lag from > workflow start time till the first Falkon jobs started, if I understood > correctly. Do the graphs confirm this or say something different? There is a period of about 500s or so until stuff starts to happen; I haven't looked at it. That is before stage-ins start too, though, which means that i think this... > If the 700-second delay figure is true, and stage-in was eliminated by copying > input files right to the /tmp workdir rather than first to /shared, then we'd > have: > > 1190260 / ( 1290 * 2048 ) = .45 efficiency calculation is not meaningful. I have not looked at what is going on during that 500s startup time, but I plan to. > I assume we're paying the same staging price on the output side? not really - the output stageouts go very fast, and also because job ending is staggered, they don't happen all at once. This is the same with most of the large runs I've seen (of any application) - stageout tends not to be a problem (or at least, no where near the problems of stagein). All stageins happen over a period t=400 to t=1100 fairly smoothly. There's rate limiting still on file operations (100 max) and file transfers (2000 max) which is being hit still. I think there's two directions to proceed in here that make sense for actual use on single clusters running falkon (rather than trying to cut out stuff randomly to push up numbers): i) use some of the data placement features in falkon, rather than Swift's relatively simple data management that was designed more for running on the grid. ii) do stage-ins using symlinks rather than file copying. this makes sense when everything is living in a single filesystem, which again is not what Swift's data management was originally optimised for. I think option ii) is substantially easier to implement (on the order of days) and is generally useful in the single-cluster, local-source-data situation that appears to be what people want to do for running on the BG/P and scicortex (that is, pretty much ignoring anything grid-like at all). Option i) is much harder (on the order of months), needing a very different interface between Swift and Falkon than exists at the moment. -- From zhaozhang at uchicago.edu Sun Apr 13 15:03:19 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Sun, 13 Apr 2008 15:03:19 -0500 Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: References: <47FEB489.509@mcs.anl.gov> <47FEF3D5.9060707@cs.uchicago.edu> <47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu> <48025C64.9020502@mcs.anl.gov> Message-ID: <48026707.4070709@uchicago.edu> An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Sun Apr 13 15:23:39 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sun, 13 Apr 2008 15:23:39 -0500 Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: <48025C64.9020502@mcs.anl.gov> References: <47FEB489.509@mcs.anl.gov> <47FEEB6E.3060506@cs.uchicago.edu> <47FEF367.1060902@uchicago.edu> <47FEF3D5.9060707@cs.uchicago.edu> <47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu> <48025C64.9020502@mcs.anl.gov> Message-ID: <48026BCB.8060108@cs.uchicago.edu> Sorry for being late to the party, putting out other fires :) Here is what Falkon logs say for this run: 2544.996 0 0 35 2048 2048 0 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0 0 0 0.0 0.0 0 0 100 1536 1331 1536 2545.996 1 1 35 2048 2008 0 40 0 0 40 0 0 0.0 0 0 0 0 0 0 0 0 0 0.0 0.0 1 0 99 1536 1322 1536 ... 3814.999 1 1 35 2048 2047 0 1 0 0 1 0 6083 0.0 6083 0 0 0 0 0 0 0 0 0.0 0.0 0 0 100 1536 1291 1536 3815.999 0 1 35 2048 2048 0 0 0 0 0 0 6084 1.0 6084 0 0 0 0 0 0 0 0 0.0 0.0 1 1 98 1536 1291 1536 At 2545.996, it was the first time that Swift sent anything... and at 3815, it was the time that the last job exit code was reported to Swift. So, runtime of 1270 seconds. BTW, time 0 in the log maps back to //0-time is 1208032721191ms Also, the total CPU time from Falkon's point of view (accurate to the ms), is 1914115.25 CPU seconds, not 1190260. So, by my numbers, I get: 1914115.25 / (1270 * 2048 ) = 0.735926446 This is already looking OK, isn't it? Now, this doesn't actually look at the efficiency of the app, as it scaled up, which we would have to do by either repeating the same workload on 1 node, or taking a small sample of the workload and running on 1 node to compare against. Ioan Michael Wilde wrote: > Ben, can you point me to the graphs for this run? (Zhao's *99cy0z4g.log) > > Here's a high-level summary of this run: > > Swift end 16:42:17 > Swift start 16:09:07 > Runtime 33:10 = 1990 seconds > > 2048 cores > > Total app wall time = 1190260 seconds > > 1190260 / ( 1990 * 2048 ) = .29 efficiency > > Once stage-ins start to complete, are the corresponding jobs initiated > quickly, or is Swift doing mostly stage-ins for some period? > > Zhao indicated he saw data indicating there was about a 700 second lag > from workflow start time till the first Falkon jobs started, if I > understood correctly. Do the graphs confirm this or say something > different? > > If the 700-second delay figure is true, and stage-in was eliminated by > copying input files right to the /tmp workdir rather than first to > /shared, then we'd have: > > 1190260 / ( 1290 * 2048 ) = .45 efficiency > > A good gain, but only partway to a number that looks good. > > I assume we're paying the same staging price on the output side? > > What I think we learned from the MARS app run, which had no input data > and only tiny output data files (10 bytes vs 10K bytes), was that the > optimized wrapper achieved somewhere between .7 to .8 efficiency. > > I'd like to look at whatever data we can get from this or similar > subsequent runs to learn what steps we could take next to increase the > efficiency metric. Guidance welcome. > > Thanks, > > Mike > > > On 4/13/08 11:37 AM, Ben Clifford wrote: >> On Sat, 12 Apr 2008, Zhao Zhang wrote: >> >>> Hi, Ben >>> >>> I got a log file of 6084 successful runs on BGP. Check it here, >>> terninable:/home/zzhang/swift_file/dock2-20080412-1609-99cy0z4g.log >> >> This one runs better - it gets up to a peak of 5000 jobs submitted >> into Falkon simultaneously, and spends a considerable amount of time >> over the 2048 level that I suppose is what you need to be over to get >> all 2048 CPUs used. >> >> There's a lot of stage-in activity that probably could be eliminated >> / changed for the single-filesytem case. >> > > > - Mike > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From benc at hawaga.org.uk Sun Apr 13 15:30:16 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 13 Apr 2008 20:30:16 +0000 (GMT) Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: <48026BCB.8060108@cs.uchicago.edu> References: <47FEB489.509@mcs.anl.gov> <47FEF3D5.9060707@cs.uchicago.edu> <47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu> <48025C64.9020502@mcs.anl.gov> <48026BCB.8060108@cs.uchicago.edu> Message-ID: > At 2545.996, it was the first time that Swift sent anything... [...] > So, runtime of 1270 seconds. There is a period of about 500s before Swift sends anything to falkon, though. -- From wilde at mcs.anl.gov Sun Apr 13 15:35:23 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 13 Apr 2008 15:35:23 -0500 Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: References: <47FEB489.509@mcs.anl.gov> <47FEF3D5.9060707@cs.uchicago.edu> <47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu> <48025C64.9020502@mcs.anl.gov> Message-ID: <48026E8B.501@mcs.anl.gov> Ben, your analysis sounds very good. Some notes below, including questions for Zhao. On 4/13/08 2:57 PM, Ben Clifford wrote: > >> Ben, can you point me to the graphs for this run? (Zhao's *99cy0z4g.log) > > http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g > >> Once stage-ins start to complete, are the corresponding jobs initiated >> quickly, or is Swift doing mostly stage-ins for some period? > > In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to falkon) > pretty much right as the corresponding stagein completes. I have no deeper > information about when the worker actually starts to run. > >> Zhao indicated he saw data indicating there was about a 700 second lag from >> workflow start time till the first Falkon jobs started, if I understood >> correctly. Do the graphs confirm this or say something different? > > There is a period of about 500s or so until stuff starts to happen; I > haven't looked at it. That is before stage-ins start too, though, which > means that i think this... > >> If the 700-second delay figure is true, and stage-in was eliminated by copying >> input files right to the /tmp workdir rather than first to /shared, then we'd >> have: >> >> 1190260 / ( 1290 * 2048 ) = .45 efficiency > > calculation is not meaningful. > > I have not looked at what is going on during that 500s startup time, but I > plan to. Zhao, what SVN rev is your Swift at? Ben fixed an N^2 mapper logging problem a few weeks ago. Could that cause such a delay, Ben? It would be very obvious in the swift log. > >> I assume we're paying the same staging price on the output side? > > not really - the output stageouts go very fast, and also because job > ending is staggered, they don't happen all at once. > > This is the same with most of the large runs I've seen (of any > application) - stageout tends not to be a problem (or at least, no where > near the problems of stagein). > > All stageins happen over a period t=400 to t=1100 fairly smoothly. There's > rate limiting still on file operations (100 max) and file transfers (2000 > max) which is being hit still. I thought Zhao set file operations throttle to 2000 as well. Sounds like we can test with the latter higher, and find out what's limiting the former. Zhao, what are your settings for property throttle.file.operations? I assume you have throttle.transfers set to 2000. If its set right, any chance that Swift or Karajan is limiting it somewhere? > > I think there's two directions to proceed in here that make sense for > actual use on single clusters running falkon (rather than trying to cut > out stuff randomly to push up numbers): > > i) use some of the data placement features in falkon, rather than Swift's > relatively simple data management that was designed more for running > on the grid. Long term: we should consider how the Coaster implementation could eventually do a similar data placement approach. In the meantime (mid term) examining what interface changes are needed for Falkon data placement might help prepare for that. Need to discuss if that would be a good step or not. > > ii) do stage-ins using symlinks rather than file copying. this makes > sense when everything is living in a single filesystem, which again > is not what Swift's data management was originally optimised for. I assume you mean symlinks from shared/ back to the user's input files? That sounds worth testing: find out if symlink creation is fast on NFS and GPFS. Is another approach to copy direct from the user's files to the /tmp workdir (ie wrapper.sh pulls the data in)? Measurement will tell if symlinks alone get adequate performance. Symlinks do seem an easier first step. > I think option ii) is substantially easier to implement (on the order of > days) and is generally useful in the single-cluster, local-source-data > situation that appears to be what people want to do for running on the > BG/P and scicortex (that is, pretty much ignoring anything grid-like at > all). Grid-like might mean pulling data to the /tmp workdir directly by the wrapper - but that seems like a harder step, and would need measurement and prototyping of such code before attempting. Data transfer clients that the wrapper script can count on might be an obstacle. > > Option i) is much harder (on the order of months), needing a very > different interface between Swift and Falkon than exists at the moment. > > > From zhaozhang at uchicago.edu Sun Apr 13 15:39:07 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Sun, 13 Apr 2008 15:39:07 -0500 Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: <48026E8B.501@mcs.anl.gov> References: <47FEB489.509@mcs.anl.gov> <47FEF3D5.9060707@cs.uchicago.edu> <47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu> <48025C64.9020502@mcs.anl.gov> <48026E8B.501@mcs.anl.! gov> Message-ID: <48026F6B.9060300@uchicago.edu> An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Sun Apr 13 15:39:31 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sun, 13 Apr 2008 15:39:31 -0500 Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: References: <47FEB489.509@mcs.anl.gov> <47FEF3D5.9060707@cs.uchicago.edu> <47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu> <48025C64.9020502@mcs.anl.gov> Message-ID: <48026F83.4060507@cs.uchicago.edu> Ben Clifford wrote: > > I think there's two directions to proceed in here that make sense for > actual use on single clusters running falkon (rather than trying to cut > out stuff randomly to push up numbers): > > i) use some of the data placement features in falkon, rather than Swift's > relatively simple data management that was designed more for running > on the grid. > > ii) do stage-ins using symlinks rather than file copying. this makes > sense when everything is living in a single filesystem, which again > is not what Swift's data management was originally optimised for. > > I think option ii) is substantially easier to implement (on the order of > days) and is generally useful in the single-cluster, local-source-data > situation that appears to be what people want to do for running on the > BG/P and scicortex (that is, pretty much ignoring anything grid-like at > all). > I think this is worth a try, although it will probably be post SC (tomorrow night at midnight EST). > Option i) is much harder (on the order of months), needing a very > different interface between Swift and Falkon than exists at the moment. > I agree. I have another deadline on May 8th, but I think we can start the discussions in May, and hope to have some Swift apps running over the data diffusion mechanism in Falkon over the summer months. Ioan > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From iraicu at cs.uchicago.edu Sun Apr 13 15:43:46 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sun, 13 Apr 2008 15:43:46 -0500 Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: References: <47FEB489.509@mcs.anl.gov> <47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu> <48025C64.9020502@mcs.anl.gov> <48026BCB.8060108@cs.uchicago.edu> Message-ID: <48027082.1050609@cs.uchicago.edu> There are 6K jobs... how long would it take Swift to unroll a for loop with 6K iterations, and prep things to start sending these jobs to the falkon provider? For example, doing a for loop with 6K sleeps in a for loop, how long does that take to start? In this case, its more like 6K jobs, but 12K input files, so perhaps the 12K input files and checking dependencies between them (although there are none to be found) is what takes some time. I vaguely remember doing 32K or 64K sleep jobs from Swift, and having it take at least minutes, maybe more to start up and have activity showing up in Falkon... Ioan Ben Clifford wrote: >> At 2545.996, it was the first time that Swift sent anything... >> > [...] > >> So, runtime of 1270 seconds. >> > > There is a period of about 500s before Swift sends anything to falkon, > though. > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From iraicu at cs.uchicago.edu Sun Apr 13 15:51:21 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sun, 13 Apr 2008 15:51:21 -0500 Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: <48026F6B.9060300@uchicago.edu> References: <47FEB489.509@mcs.anl.gov> <47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu> <48025C64.9020502@mcs.anl.gov> <48026E8B.501@mcs.anl.! gov> <48026F6B.9060300@uchicago.ed u> Message-ID: <48027249.2070208@cs.uchicago.edu> But we have 2X input files as opposed to number of jobs and CPUs. We have 2048 CPUs, shouldn't we set all file I/O operations to at least 4096... and that means that files won't be ready for the next jobs once the first ones start completing... so we should really set things to twice that, so 8192 is the number I'd set on all file operations for this app on 2K CPUs. Ioan Zhao Zhang wrote: > Hi, Mike > > Michael Wilde wrote: >> Ben, your analysis sounds very good. Some notes below, including >> questions for Zhao. >> >> On 4/13/08 2:57 PM, Ben Clifford wrote: >>> >>>> Ben, can you point me to the graphs for this run? (Zhao's >>>> *99cy0z4g.log) >>> >>> http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g >>> >>>> Once stage-ins start to complete, are the corresponding jobs >>>> initiated quickly, or is Swift doing mostly stage-ins for some period? >>> >>> In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to >>> falkon) pretty much right as the corresponding stagein completes. I >>> have no deeper information about when the worker actually starts to >>> run. >>> >>>> Zhao indicated he saw data indicating there was about a 700 second >>>> lag from >>>> workflow start time till the first Falkon jobs started, if I >>>> understood >>>> correctly. Do the graphs confirm this or say something different? >>> >>> There is a period of about 500s or so until stuff starts to happen; >>> I haven't looked at it. That is before stage-ins start too, though, >>> which means that i think this... >>> >>>> If the 700-second delay figure is true, and stage-in was eliminated >>>> by copying >>>> input files right to the /tmp workdir rather than first to /shared, >>>> then we'd >>>> have: >>>> >>>> 1190260 / ( 1290 * 2048 ) = .45 efficiency >>> >>> calculation is not meaningful. >>> >>> I have not looked at what is going on during that 500s startup time, >>> but I plan to. >> >> Zhao, what SVN rev is your Swift at? Ben fixed an N^2 mapper logging >> problem a few weeks ago. Could that cause such a delay, Ben? It would >> be very obvious in the swift log. > The version is Swift svn swift-r1780 cog-r1956 >> >>> >>>> I assume we're paying the same staging price on the output side? >>> >>> not really - the output stageouts go very fast, and also because job >>> ending is staggered, they don't happen all at once. >>> >>> This is the same with most of the large runs I've seen (of any >>> application) - stageout tends not to be a problem (or at least, no >>> where near the problems of stagein). >>> >>> All stageins happen over a period t=400 to t=1100 fairly smoothly. >>> There's rate limiting still on file operations (100 max) and file >>> transfers (2000 max) which is being hit still. >> >> I thought Zhao set file operations throttle to 2000 as well. Sounds >> like we can test with the latter higher, and find out what's limiting >> the former. >> >> Zhao, what are your settings for property throttle.file.operations? >> I assume you have throttle.transfers set to 2000. >> >> If its set right, any chance that Swift or Karajan is limiting it >> somewhere? > 2000 for sure, > throttle.submit=off > throttle.host.submit=off > throttle.score.job.factor=off > throttle.transfers=2000 > throttle.file.operation=2000 >>> >>> I think there's two directions to proceed in here that make sense >>> for actual use on single clusters running falkon (rather than trying >>> to cut out stuff randomly to push up numbers): >>> >>> i) use some of the data placement features in falkon, rather than >>> Swift's >>> relatively simple data management that was designed more for >>> running >>> on the grid. >> >> Long term: we should consider how the Coaster implementation could >> eventually do a similar data placement approach. In the meantime (mid >> term) examining what interface changes are needed for Falkon data >> placement might help prepare for that. Need to discuss if that would >> be a good step or not. >> >>> >>> ii) do stage-ins using symlinks rather than file copying. this makes >>> sense when everything is living in a single filesystem, which >>> again >>> is not what Swift's data management was originally optimised for. >> >> I assume you mean symlinks from shared/ back to the user's input files? >> >> That sounds worth testing: find out if symlink creation is fast on >> NFS and GPFS. >> >> Is another approach to copy direct from the user's files to the /tmp >> workdir (ie wrapper.sh pulls the data in)? Measurement will tell if >> symlinks alone get adequate performance. Symlinks do seem an easier >> first step. >> >>> I think option ii) is substantially easier to implement (on the >>> order of days) and is generally useful in the single-cluster, >>> local-source-data situation that appears to be what people want to >>> do for running on the BG/P and scicortex (that is, pretty much >>> ignoring anything grid-like at all). >> >> Grid-like might mean pulling data to the /tmp workdir directly by the >> wrapper - but that seems like a harder step, and would need >> measurement and prototyping of such code before attempting. Data >> transfer clients that the wrapper script can count on might be an >> obstacle. >> >>> >>> Option i) is much harder (on the order of months), needing a very >>> different interface between Swift and Falkon than exists at the moment. >>> >>> >>> >> -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From benc at hawaga.org.uk Sun Apr 13 15:58:12 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 13 Apr 2008 20:58:12 +0000 (GMT) Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: References: <47FEB489.509@mcs.anl.gov> <47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu> <48025C64.9020502@mcs.anl.gov> Message-ID: On Sun, 13 Apr 2008, Ben Clifford wrote: > There is a period of about 500s or so until stuff starts to happen; I > haven't looked at it. What happens in the log file for this period is lots of DSHandle creation (vdl:new) - it creates 115596 datasets in 451 seconds, which is about 256 per second. That seems quite a low rate. -- From wilde at mcs.anl.gov Sun Apr 13 16:52:56 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 13 Apr 2008 16:52:56 -0500 Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: <48026F6B.9060300@uchicago.edu> References: <47FEB489.509@mcs.anl.gov> <47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu> <48025C64.9020502@mcs.anl.gov> <48026E8B.501@mcs.anl.! gov> <48026F6B.9060300@uchicago.e! du> Message-ID: <480280B8.9040605@mcs.anl.gov> >> If its set right, any chance that Swift or Karajan is limiting it >> somewhere? > 2000 for sure, > throttle.submit=off > throttle.host.submit=off > throttle.score.job.factor=off > throttle.transfers=2000 > throttle.file.operation=2000 Looks like a typo in your properties, Zhao - if the text above came from your swift.properties directly: throttle.file.operation=2000 vs operations with an s as per the properties doc: throttle.file.operations=8 #throttle.file.operations=off Which doesnt explain why we're seeing 100 when the default is 8 ??? - Mike On 4/13/08 3:39 PM, Zhao Zhang wrote: > Hi, Mike > > Michael Wilde wrote: >> Ben, your analysis sounds very good. Some notes below, including >> questions for Zhao. >> >> On 4/13/08 2:57 PM, Ben Clifford wrote: >>> >>>> Ben, can you point me to the graphs for this run? (Zhao's >>>> *99cy0z4g.log) >>> >>> http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g >>> >>>> Once stage-ins start to complete, are the corresponding jobs >>>> initiated quickly, or is Swift doing mostly stage-ins for some period? >>> >>> In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to >>> falkon) pretty much right as the corresponding stagein completes. I >>> have no deeper information about when the worker actually starts to run. >>> >>>> Zhao indicated he saw data indicating there was about a 700 second >>>> lag from >>>> workflow start time till the first Falkon jobs started, if I understood >>>> correctly. Do the graphs confirm this or say something different? >>> >>> There is a period of about 500s or so until stuff starts to happen; I >>> haven't looked at it. That is before stage-ins start too, though, >>> which means that i think this... >>> >>>> If the 700-second delay figure is true, and stage-in was eliminated >>>> by copying >>>> input files right to the /tmp workdir rather than first to /shared, >>>> then we'd >>>> have: >>>> >>>> 1190260 / ( 1290 * 2048 ) = .45 efficiency >>> >>> calculation is not meaningful. >>> >>> I have not looked at what is going on during that 500s startup time, >>> but I plan to. >> >> Zhao, what SVN rev is your Swift at? Ben fixed an N^2 mapper logging >> problem a few weeks ago. Could that cause such a delay, Ben? It would >> be very obvious in the swift log. > The version is Swift svn swift-r1780 cog-r1956 >> >>> >>>> I assume we're paying the same staging price on the output side? >>> >>> not really - the output stageouts go very fast, and also because job >>> ending is staggered, they don't happen all at once. >>> >>> This is the same with most of the large runs I've seen (of any >>> application) - stageout tends not to be a problem (or at least, no >>> where near the problems of stagein). >>> >>> All stageins happen over a period t=400 to t=1100 fairly smoothly. >>> There's rate limiting still on file operations (100 max) and file >>> transfers (2000 max) which is being hit still. >> >> I thought Zhao set file operations throttle to 2000 as well. Sounds >> like we can test with the latter higher, and find out what's limiting >> the former. >> >> Zhao, what are your settings for property throttle.file.operations? >> I assume you have throttle.transfers set to 2000. >> >> If its set right, any chance that Swift or Karajan is limiting it >> somewhere? > 2000 for sure, > throttle.submit=off > throttle.host.submit=off > throttle.score.job.factor=off > throttle.transfers=2000 > throttle.file.operation=2000 >>> >>> I think there's two directions to proceed in here that make sense for >>> actual use on single clusters running falkon (rather than trying to >>> cut out stuff randomly to push up numbers): >>> >>> i) use some of the data placement features in falkon, rather than >>> Swift's >>> relatively simple data management that was designed more for running >>> on the grid. >> >> Long term: we should consider how the Coaster implementation could >> eventually do a similar data placement approach. In the meantime (mid >> term) examining what interface changes are needed for Falkon data >> placement might help prepare for that. Need to discuss if that would >> be a good step or not. >> >>> >>> ii) do stage-ins using symlinks rather than file copying. this makes >>> sense when everything is living in a single filesystem, which again >>> is not what Swift's data management was originally optimised for. >> >> I assume you mean symlinks from shared/ back to the user's input files? >> >> That sounds worth testing: find out if symlink creation is fast on NFS >> and GPFS. >> >> Is another approach to copy direct from the user's files to the /tmp >> workdir (ie wrapper.sh pulls the data in)? Measurement will tell if >> symlinks alone get adequate performance. Symlinks do seem an easier >> first step. >> >>> I think option ii) is substantially easier to implement (on the order >>> of days) and is generally useful in the single-cluster, >>> local-source-data situation that appears to be what people want to do >>> for running on the BG/P and scicortex (that is, pretty much ignoring >>> anything grid-like at all). >> >> Grid-like might mean pulling data to the /tmp workdir directly by the >> wrapper - but that seems like a harder step, and would need >> measurement and prototyping of such code before attempting. Data >> transfer clients that the wrapper script can count on might be an >> obstacle. >> >>> >>> Option i) is much harder (on the order of months), needing a very >>> different interface between Swift and Falkon than exists at the moment. >>> >>> >>> >> From wilde at mcs.anl.gov Sun Apr 13 17:31:19 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 13 Apr 2008 17:31:19 -0500 Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: References: <47FEB489.509@mcs.anl.gov> <47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu> <48025C64.9020502@mcs.anl.gov> Message-ID: <480289B7.4050207@mcs.anl.gov> That might be a low rate, but its also not clear why its creating so many handles: I thought we only have about 6K jobs here, with 2 files in and 1 file out per job: doall(params pset[]) { foreach p in pset { DockIn ifile ; Mol2 mfile ; DockOut ofile ; ofile = dock(ifile, mfile); } } I would have expected more like 18,000 datasets. Are we calling the mapper incorrectly in this script? On 4/13/08 3:58 PM, Ben Clifford wrote: > On Sun, 13 Apr 2008, Ben Clifford wrote: > >> There is a period of about 500s or so until stuff starts to happen; I >> haven't looked at it. > > What happens in the log file for this period is lots of DSHandle creation > (vdl:new) - it creates 115596 datasets in 451 seconds, which is about 256 > per second. That seems quite a low rate. > From wilde at mcs.anl.gov Sun Apr 13 17:50:01 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 13 Apr 2008 17:50:01 -0500 Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: <48027249.2070208@cs.uchicago.edu> References: <47FEB489.509@mcs.anl.gov> <47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu> <48025C64.9020502@mcs.anl.gov> <48026E8B.501@mcs.anl.! gov> <48026F6B.9060300@uchicago.ed u> <48027249.2070208@cs.uchica! go.edu> Message-ID: <48028E19.7020400@mcs.anl.gov> Its not clear to me whats best here, for 3 reasons: 1) We should set file.transfers and file.operations to a value that prevents Swift from adversely impacting performance on shared resources. Since Swift must run on the login node and hits the shared cluster networks, we should test carefully. 2) Its not clear to me how many concurrent operations the login hosts can sustain before topping out, and how this number depends on file size. Do you know this from the GPFS benchmarks? And did you measure impact on system response during those benchmarks? I think that the overall system would top out well before 2000 concurrent transfers, but I could be wrong. Going much higher than the point where concurrency increases the data rate, it seems, would cause the rate to drop due to contention and context switching. 3) If I/O operations are fast compared to the job length and completion rate, you dont have to set these values to the same as the max number of input files that can be demanded at once. I think we want to set the I/O operation concurrency to a value that achieves the highest operation rate we can sustain while keeping overall system performance at some acceptable level (tbd). So first we need to find the concurrency level that maximizes ops/sec, (which may be filesize dependent) and then possibly back that off to reduce system impact. It seems to me that finding the right I/O concurrency setting is complex and non-obvious, and I'm interested in what Ben and Mihael suggest here. - Mike On 4/13/08 3:51 PM, Ioan Raicu wrote: > But we have 2X input files as opposed to number of jobs and CPUs. We > have 2048 CPUs, shouldn't we set all file I/O operations to at least > 4096... and that means that files won't be ready for the next jobs once > the first ones start completing... so we should really set things to > twice that, so 8192 is the number I'd set on all file operations for > this app on 2K CPUs. > Ioan > > Zhao Zhang wrote: >> Hi, Mike >> >> Michael Wilde wrote: >>> Ben, your analysis sounds very good. Some notes below, including >>> questions for Zhao. >>> >>> On 4/13/08 2:57 PM, Ben Clifford wrote: >>>> >>>>> Ben, can you point me to the graphs for this run? (Zhao's >>>>> *99cy0z4g.log) >>>> >>>> http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g >>>> >>>>> Once stage-ins start to complete, are the corresponding jobs >>>>> initiated quickly, or is Swift doing mostly stage-ins for some period? >>>> >>>> In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to >>>> falkon) pretty much right as the corresponding stagein completes. I >>>> have no deeper information about when the worker actually starts to >>>> run. >>>> >>>>> Zhao indicated he saw data indicating there was about a 700 second >>>>> lag from >>>>> workflow start time till the first Falkon jobs started, if I >>>>> understood >>>>> correctly. Do the graphs confirm this or say something different? >>>> >>>> There is a period of about 500s or so until stuff starts to happen; >>>> I haven't looked at it. That is before stage-ins start too, though, >>>> which means that i think this... >>>> >>>>> If the 700-second delay figure is true, and stage-in was eliminated >>>>> by copying >>>>> input files right to the /tmp workdir rather than first to /shared, >>>>> then we'd >>>>> have: >>>>> >>>>> 1190260 / ( 1290 * 2048 ) = .45 efficiency >>>> >>>> calculation is not meaningful. >>>> >>>> I have not looked at what is going on during that 500s startup time, >>>> but I plan to. >>> >>> Zhao, what SVN rev is your Swift at? Ben fixed an N^2 mapper logging >>> problem a few weeks ago. Could that cause such a delay, Ben? It would >>> be very obvious in the swift log. >> The version is Swift svn swift-r1780 cog-r1956 >>> >>>> >>>>> I assume we're paying the same staging price on the output side? >>>> >>>> not really - the output stageouts go very fast, and also because job >>>> ending is staggered, they don't happen all at once. >>>> >>>> This is the same with most of the large runs I've seen (of any >>>> application) - stageout tends not to be a problem (or at least, no >>>> where near the problems of stagein). >>>> >>>> All stageins happen over a period t=400 to t=1100 fairly smoothly. >>>> There's rate limiting still on file operations (100 max) and file >>>> transfers (2000 max) which is being hit still. >>> >>> I thought Zhao set file operations throttle to 2000 as well. Sounds >>> like we can test with the latter higher, and find out what's limiting >>> the former. >>> >>> Zhao, what are your settings for property throttle.file.operations? >>> I assume you have throttle.transfers set to 2000. >>> >>> If its set right, any chance that Swift or Karajan is limiting it >>> somewhere? >> 2000 for sure, >> throttle.submit=off >> throttle.host.submit=off >> throttle.score.job.factor=off >> throttle.transfers=2000 >> throttle.file.operation=2000 >>>> >>>> I think there's two directions to proceed in here that make sense >>>> for actual use on single clusters running falkon (rather than trying >>>> to cut out stuff randomly to push up numbers): >>>> >>>> i) use some of the data placement features in falkon, rather than >>>> Swift's >>>> relatively simple data management that was designed more for >>>> running >>>> on the grid. >>> >>> Long term: we should consider how the Coaster implementation could >>> eventually do a similar data placement approach. In the meantime (mid >>> term) examining what interface changes are needed for Falkon data >>> placement might help prepare for that. Need to discuss if that would >>> be a good step or not. >>> >>>> >>>> ii) do stage-ins using symlinks rather than file copying. this makes >>>> sense when everything is living in a single filesystem, which >>>> again >>>> is not what Swift's data management was originally optimised for. >>> >>> I assume you mean symlinks from shared/ back to the user's input files? >>> >>> That sounds worth testing: find out if symlink creation is fast on >>> NFS and GPFS. >>> >>> Is another approach to copy direct from the user's files to the /tmp >>> workdir (ie wrapper.sh pulls the data in)? Measurement will tell if >>> symlinks alone get adequate performance. Symlinks do seem an easier >>> first step. >>> >>>> I think option ii) is substantially easier to implement (on the >>>> order of days) and is generally useful in the single-cluster, >>>> local-source-data situation that appears to be what people want to >>>> do for running on the BG/P and scicortex (that is, pretty much >>>> ignoring anything grid-like at all). >>> >>> Grid-like might mean pulling data to the /tmp workdir directly by the >>> wrapper - but that seems like a harder step, and would need >>> measurement and prototyping of such code before attempting. Data >>> transfer clients that the wrapper script can count on might be an >>> obstacle. >>> >>>> >>>> Option i) is much harder (on the order of months), needing a very >>>> different interface between Swift and Falkon than exists at the moment. >>>> >>>> >>>> >>> > From zhaozhang at uchicago.edu Sun Apr 13 17:58:55 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Sun, 13 Apr 2008 17:58:55 -0500 Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: <480280B8.9040605@mcs.anl.gov> References: <47FEB489.509@mcs.anl.gov> <47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu> <48025C64.9020502@mcs.anl.gov> <48026E8B.501@mcs.anl.! gov> <48026F6B.9060300@uchicago.e! du> <480280B8.9040605@mcs.anl.! gov> Message-ID: <4802902F.7050704@uchicago.edu> Hi, Mike It is just a typo in the email. I my property file, it is "throttle.file.operations=2000". Thanks. zhao Michael Wilde wrote: > >> If its set right, any chance that Swift or Karajan is limiting it > >> somewhere? > > 2000 for sure, > > throttle.submit=off > > throttle.host.submit=off > > throttle.score.job.factor=off > > throttle.transfers=2000 > > throttle.file.operation=2000 > > > Looks like a typo in your properties, Zhao - if the text above came > from your swift.properties directly: > > throttle.file.operation=2000 > > vs operations with an s as per the properties doc: > > throttle.file.operations=8 > #throttle.file.operations=off > > Which doesnt explain why we're seeing 100 when the default is 8 ??? > > - Mike > > > On 4/13/08 3:39 PM, Zhao Zhang wrote: >> Hi, Mike >> >> Michael Wilde wrote: >>> Ben, your analysis sounds very good. Some notes below, including >>> questions for Zhao. >>> >>> On 4/13/08 2:57 PM, Ben Clifford wrote: >>>> >>>>> Ben, can you point me to the graphs for this run? (Zhao's >>>>> *99cy0z4g.log) >>>> >>>> http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g >>>> >>>>> Once stage-ins start to complete, are the corresponding jobs >>>>> initiated quickly, or is Swift doing mostly stage-ins for some >>>>> period? >>>> >>>> In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to >>>> falkon) pretty much right as the corresponding stagein completes. I >>>> have no deeper information about when the worker actually starts to >>>> run. >>>> >>>>> Zhao indicated he saw data indicating there was about a 700 second >>>>> lag from >>>>> workflow start time till the first Falkon jobs started, if I >>>>> understood >>>>> correctly. Do the graphs confirm this or say something different? >>>> >>>> There is a period of about 500s or so until stuff starts to happen; >>>> I haven't looked at it. That is before stage-ins start too, though, >>>> which means that i think this... >>>> >>>>> If the 700-second delay figure is true, and stage-in was >>>>> eliminated by copying >>>>> input files right to the /tmp workdir rather than first to >>>>> /shared, then we'd >>>>> have: >>>>> >>>>> 1190260 / ( 1290 * 2048 ) = .45 efficiency >>>> >>>> calculation is not meaningful. >>>> >>>> I have not looked at what is going on during that 500s startup >>>> time, but I plan to. >>> >>> Zhao, what SVN rev is your Swift at? Ben fixed an N^2 mapper >>> logging problem a few weeks ago. Could that cause such a delay, Ben? >>> It would be very obvious in the swift log. >> The version is Swift svn swift-r1780 cog-r1956 >>> >>>> >>>>> I assume we're paying the same staging price on the output side? >>>> >>>> not really - the output stageouts go very fast, and also because >>>> job ending is staggered, they don't happen all at once. >>>> >>>> This is the same with most of the large runs I've seen (of any >>>> application) - stageout tends not to be a problem (or at least, no >>>> where near the problems of stagein). >>>> >>>> All stageins happen over a period t=400 to t=1100 fairly smoothly. >>>> There's rate limiting still on file operations (100 max) and file >>>> transfers (2000 max) which is being hit still. >>> >>> I thought Zhao set file operations throttle to 2000 as well. Sounds >>> like we can test with the latter higher, and find out what's >>> limiting the former. >>> >>> Zhao, what are your settings for property throttle.file.operations? >>> I assume you have throttle.transfers set to 2000. >>> >>> If its set right, any chance that Swift or Karajan is limiting it >>> somewhere? >> 2000 for sure, >> throttle.submit=off >> throttle.host.submit=off >> throttle.score.job.factor=off >> throttle.transfers=2000 >> throttle.file.operation=2000 >>>> >>>> I think there's two directions to proceed in here that make sense >>>> for actual use on single clusters running falkon (rather than >>>> trying to cut out stuff randomly to push up numbers): >>>> >>>> i) use some of the data placement features in falkon, rather than >>>> Swift's >>>> relatively simple data management that was designed more for >>>> running >>>> on the grid. >>> >>> Long term: we should consider how the Coaster implementation could >>> eventually do a similar data placement approach. In the meantime >>> (mid term) examining what interface changes are needed for Falkon >>> data placement might help prepare for that. Need to discuss if that >>> would be a good step or not. >>> >>>> >>>> ii) do stage-ins using symlinks rather than file copying. this makes >>>> sense when everything is living in a single filesystem, which >>>> again >>>> is not what Swift's data management was originally optimised for. >>> >>> I assume you mean symlinks from shared/ back to the user's input files? >>> >>> That sounds worth testing: find out if symlink creation is fast on >>> NFS and GPFS. >>> >>> Is another approach to copy direct from the user's files to the /tmp >>> workdir (ie wrapper.sh pulls the data in)? Measurement will tell if >>> symlinks alone get adequate performance. Symlinks do seem an easier >>> first step. >>> >>>> I think option ii) is substantially easier to implement (on the >>>> order of days) and is generally useful in the single-cluster, >>>> local-source-data situation that appears to be what people want to >>>> do for running on the BG/P and scicortex (that is, pretty much >>>> ignoring anything grid-like at all). >>> >>> Grid-like might mean pulling data to the /tmp workdir directly by >>> the wrapper - but that seems like a harder step, and would need >>> measurement and prototyping of such code before attempting. Data >>> transfer clients that the wrapper script can count on might be an >>> obstacle. >>> >>>> >>>> Option i) is much harder (on the order of months), needing a very >>>> different interface between Swift and Falkon than exists at the >>>> moment. >>>> >>>> >>>> >>> > From hategan at mcs.anl.gov Sun Apr 13 17:06:11 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 13 Apr 2008 17:06:11 -0500 Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: <48028E19.7020400@mcs.anl.gov> References: <47FEB489.509@mcs.anl.gov> <47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu> <48025C64.9020502@mcs.anl.gov> <48026E8B.501@mcs.anl.! gov> <48026F6B.9060300@uchicago.ed u> <48027249.2070208@cs.uchica! go.edu> <48028E19.7020400@mcs.anl.gov> Message-ID: <1208124371.3191.4.camel@blabla.mcs.anl.gov> On Sun, 2008-04-13 at 17:50 -0500, Michael Wilde wrote: > Its not clear to me whats best here, for 3 reasons: > > 1) We should set file.transfers and file.operations to a value that > prevents Swift from adversely impacting performance on shared resources. > > Since Swift must run on the login node and hits the shared cluster > networks, we should test carefully. > > 2) Its not clear to me how many concurrent operations the login hosts > can sustain before topping out, and how this number depends on file > size. Do you know this from the GPFS benchmarks? And did you measure > impact on system response during those benchmarks? > > I think that the overall system would top out well before 2000 > concurrent transfers, but I could be wrong. Going much higher than the > point where concurrency increases the data rate, it seems, would cause > the rate to drop due to contention and context switching. The number is probably in the 10-100 range. With 2000 it's somewhat likely that the transfers are long done before all the gridftp connections can be started. Mihael > > 3) If I/O operations are fast compared to the job length and completion > rate, you dont have to set these values to the same as the max number of > input files that can be demanded at once. > > I think we want to set the I/O operation concurrency to a value that > achieves the highest operation rate we can sustain while keeping overall > system performance at some acceptable level (tbd). > > So first we need to find the concurrency level that maximizes ops/sec, > (which may be filesize dependent) and then possibly back that off to > reduce system impact. > > It seems to me that finding the right I/O concurrency setting is complex > and non-obvious, and I'm interested in what Ben and Mihael suggest here. > > - Mike > > On 4/13/08 3:51 PM, Ioan Raicu wrote: > > But we have 2X input files as opposed to number of jobs and CPUs. We > > have 2048 CPUs, shouldn't we set all file I/O operations to at least > > 4096... and that means that files won't be ready for the next jobs once > > the first ones start completing... so we should really set things to > > twice that, so 8192 is the number I'd set on all file operations for > > this app on 2K CPUs. > > Ioan > > > > Zhao Zhang wrote: > >> Hi, Mike > >> > >> Michael Wilde wrote: > >>> Ben, your analysis sounds very good. Some notes below, including > >>> questions for Zhao. > >>> > >>> On 4/13/08 2:57 PM, Ben Clifford wrote: > >>>> > >>>>> Ben, can you point me to the graphs for this run? (Zhao's > >>>>> *99cy0z4g.log) > >>>> > >>>> http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g > >>>> > >>>>> Once stage-ins start to complete, are the corresponding jobs > >>>>> initiated quickly, or is Swift doing mostly stage-ins for some period? > >>>> > >>>> In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to > >>>> falkon) pretty much right as the corresponding stagein completes. I > >>>> have no deeper information about when the worker actually starts to > >>>> run. > >>>> > >>>>> Zhao indicated he saw data indicating there was about a 700 second > >>>>> lag from > >>>>> workflow start time till the first Falkon jobs started, if I > >>>>> understood > >>>>> correctly. Do the graphs confirm this or say something different? > >>>> > >>>> There is a period of about 500s or so until stuff starts to happen; > >>>> I haven't looked at it. That is before stage-ins start too, though, > >>>> which means that i think this... > >>>> > >>>>> If the 700-second delay figure is true, and stage-in was eliminated > >>>>> by copying > >>>>> input files right to the /tmp workdir rather than first to /shared, > >>>>> then we'd > >>>>> have: > >>>>> > >>>>> 1190260 / ( 1290 * 2048 ) = .45 efficiency > >>>> > >>>> calculation is not meaningful. > >>>> > >>>> I have not looked at what is going on during that 500s startup time, > >>>> but I plan to. > >>> > >>> Zhao, what SVN rev is your Swift at? Ben fixed an N^2 mapper logging > >>> problem a few weeks ago. Could that cause such a delay, Ben? It would > >>> be very obvious in the swift log. > >> The version is Swift svn swift-r1780 cog-r1956 > >>> > >>>> > >>>>> I assume we're paying the same staging price on the output side? > >>>> > >>>> not really - the output stageouts go very fast, and also because job > >>>> ending is staggered, they don't happen all at once. > >>>> > >>>> This is the same with most of the large runs I've seen (of any > >>>> application) - stageout tends not to be a problem (or at least, no > >>>> where near the problems of stagein). > >>>> > >>>> All stageins happen over a period t=400 to t=1100 fairly smoothly. > >>>> There's rate limiting still on file operations (100 max) and file > >>>> transfers (2000 max) which is being hit still. > >>> > >>> I thought Zhao set file operations throttle to 2000 as well. Sounds > >>> like we can test with the latter higher, and find out what's limiting > >>> the former. > >>> > >>> Zhao, what are your settings for property throttle.file.operations? > >>> I assume you have throttle.transfers set to 2000. > >>> > >>> If its set right, any chance that Swift or Karajan is limiting it > >>> somewhere? > >> 2000 for sure, > >> throttle.submit=off > >> throttle.host.submit=off > >> throttle.score.job.factor=off > >> throttle.transfers=2000 > >> throttle.file.operation=2000 > >>>> > >>>> I think there's two directions to proceed in here that make sense > >>>> for actual use on single clusters running falkon (rather than trying > >>>> to cut out stuff randomly to push up numbers): > >>>> > >>>> i) use some of the data placement features in falkon, rather than > >>>> Swift's > >>>> relatively simple data management that was designed more for > >>>> running > >>>> on the grid. > >>> > >>> Long term: we should consider how the Coaster implementation could > >>> eventually do a similar data placement approach. In the meantime (mid > >>> term) examining what interface changes are needed for Falkon data > >>> placement might help prepare for that. Need to discuss if that would > >>> be a good step or not. > >>> > >>>> > >>>> ii) do stage-ins using symlinks rather than file copying. this makes > >>>> sense when everything is living in a single filesystem, which > >>>> again > >>>> is not what Swift's data management was originally optimised for. > >>> > >>> I assume you mean symlinks from shared/ back to the user's input files? > >>> > >>> That sounds worth testing: find out if symlink creation is fast on > >>> NFS and GPFS. > >>> > >>> Is another approach to copy direct from the user's files to the /tmp > >>> workdir (ie wrapper.sh pulls the data in)? Measurement will tell if > >>> symlinks alone get adequate performance. Symlinks do seem an easier > >>> first step. > >>> > >>>> I think option ii) is substantially easier to implement (on the > >>>> order of days) and is generally useful in the single-cluster, > >>>> local-source-data situation that appears to be what people want to > >>>> do for running on the BG/P and scicortex (that is, pretty much > >>>> ignoring anything grid-like at all). > >>> > >>> Grid-like might mean pulling data to the /tmp workdir directly by the > >>> wrapper - but that seems like a harder step, and would need > >>> measurement and prototyping of such code before attempting. Data > >>> transfer clients that the wrapper script can count on might be an > >>> obstacle. > >>> > >>>> > >>>> Option i) is much harder (on the order of months), needing a very > >>>> different interface between Swift and Falkon than exists at the moment. > >>>> > >>>> > >>>> > >>> > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Sun Apr 13 17:09:03 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 13 Apr 2008 17:09:03 -0500 Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: <4802902F.7050704@uchicago.edu> References: <47FEB489.509@mcs.anl.gov> <47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu> <48025C64.9020502@mcs.anl.gov> <48026E8B.501@mcs.anl.! gov> <48026F6B.9060300@uchicago.e! du> <480280B8.9040605@mcs.anl.! gov> <4802902F.7050704@uchicago.edu> Message-ID: <1208124543.3191.7.camel@blabla.mcs.anl.gov> Then my guess is that the system itself (swift + server + FS) cannot sustain a much higher rate than 100 things per second. In principle setting those throttles to 2000 pretty much means that you're trying to start 2000 gridftp connections and hence 2000 gridftp processes on the server. On Sun, 2008-04-13 at 17:58 -0500, Zhao Zhang wrote: > Hi, Mike > > It is just a typo in the email. I my property file, it is > "throttle.file.operations=2000". Thanks. > > zhao > > Michael Wilde wrote: > > >> If its set right, any chance that Swift or Karajan is limiting it > > >> somewhere? > > > 2000 for sure, > > > throttle.submit=off > > > throttle.host.submit=off > > > throttle.score.job.factor=off > > > throttle.transfers=2000 > > > throttle.file.operation=2000 > > > > > > Looks like a typo in your properties, Zhao - if the text above came > > from your swift.properties directly: > > > > throttle.file.operation=2000 > > > > vs operations with an s as per the properties doc: > > > > throttle.file.operations=8 > > #throttle.file.operations=off > > > > Which doesnt explain why we're seeing 100 when the default is 8 ??? > > > > - Mike > > > > > > On 4/13/08 3:39 PM, Zhao Zhang wrote: > >> Hi, Mike > >> > >> Michael Wilde wrote: > >>> Ben, your analysis sounds very good. Some notes below, including > >>> questions for Zhao. > >>> > >>> On 4/13/08 2:57 PM, Ben Clifford wrote: > >>>> > >>>>> Ben, can you point me to the graphs for this run? (Zhao's > >>>>> *99cy0z4g.log) > >>>> > >>>> http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g > >>>> > >>>>> Once stage-ins start to complete, are the corresponding jobs > >>>>> initiated quickly, or is Swift doing mostly stage-ins for some > >>>>> period? > >>>> > >>>> In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to > >>>> falkon) pretty much right as the corresponding stagein completes. I > >>>> have no deeper information about when the worker actually starts to > >>>> run. > >>>> > >>>>> Zhao indicated he saw data indicating there was about a 700 second > >>>>> lag from > >>>>> workflow start time till the first Falkon jobs started, if I > >>>>> understood > >>>>> correctly. Do the graphs confirm this or say something different? > >>>> > >>>> There is a period of about 500s or so until stuff starts to happen; > >>>> I haven't looked at it. That is before stage-ins start too, though, > >>>> which means that i think this... > >>>> > >>>>> If the 700-second delay figure is true, and stage-in was > >>>>> eliminated by copying > >>>>> input files right to the /tmp workdir rather than first to > >>>>> /shared, then we'd > >>>>> have: > >>>>> > >>>>> 1190260 / ( 1290 * 2048 ) = .45 efficiency > >>>> > >>>> calculation is not meaningful. > >>>> > >>>> I have not looked at what is going on during that 500s startup > >>>> time, but I plan to. > >>> > >>> Zhao, what SVN rev is your Swift at? Ben fixed an N^2 mapper > >>> logging problem a few weeks ago. Could that cause such a delay, Ben? > >>> It would be very obvious in the swift log. > >> The version is Swift svn swift-r1780 cog-r1956 > >>> > >>>> > >>>>> I assume we're paying the same staging price on the output side? > >>>> > >>>> not really - the output stageouts go very fast, and also because > >>>> job ending is staggered, they don't happen all at once. > >>>> > >>>> This is the same with most of the large runs I've seen (of any > >>>> application) - stageout tends not to be a problem (or at least, no > >>>> where near the problems of stagein). > >>>> > >>>> All stageins happen over a period t=400 to t=1100 fairly smoothly. > >>>> There's rate limiting still on file operations (100 max) and file > >>>> transfers (2000 max) which is being hit still. > >>> > >>> I thought Zhao set file operations throttle to 2000 as well. Sounds > >>> like we can test with the latter higher, and find out what's > >>> limiting the former. > >>> > >>> Zhao, what are your settings for property throttle.file.operations? > >>> I assume you have throttle.transfers set to 2000. > >>> > >>> If its set right, any chance that Swift or Karajan is limiting it > >>> somewhere? > >> 2000 for sure, > >> throttle.submit=off > >> throttle.host.submit=off > >> throttle.score.job.factor=off > >> throttle.transfers=2000 > >> throttle.file.operation=2000 > >>>> > >>>> I think there's two directions to proceed in here that make sense > >>>> for actual use on single clusters running falkon (rather than > >>>> trying to cut out stuff randomly to push up numbers): > >>>> > >>>> i) use some of the data placement features in falkon, rather than > >>>> Swift's > >>>> relatively simple data management that was designed more for > >>>> running > >>>> on the grid. > >>> > >>> Long term: we should consider how the Coaster implementation could > >>> eventually do a similar data placement approach. In the meantime > >>> (mid term) examining what interface changes are needed for Falkon > >>> data placement might help prepare for that. Need to discuss if that > >>> would be a good step or not. > >>> > >>>> > >>>> ii) do stage-ins using symlinks rather than file copying. this makes > >>>> sense when everything is living in a single filesystem, which > >>>> again > >>>> is not what Swift's data management was originally optimised for. > >>> > >>> I assume you mean symlinks from shared/ back to the user's input files? > >>> > >>> That sounds worth testing: find out if symlink creation is fast on > >>> NFS and GPFS. > >>> > >>> Is another approach to copy direct from the user's files to the /tmp > >>> workdir (ie wrapper.sh pulls the data in)? Measurement will tell if > >>> symlinks alone get adequate performance. Symlinks do seem an easier > >>> first step. > >>> > >>>> I think option ii) is substantially easier to implement (on the > >>>> order of days) and is generally useful in the single-cluster, > >>>> local-source-data situation that appears to be what people want to > >>>> do for running on the BG/P and scicortex (that is, pretty much > >>>> ignoring anything grid-like at all). > >>> > >>> Grid-like might mean pulling data to the /tmp workdir directly by > >>> the wrapper - but that seems like a harder step, and would need > >>> measurement and prototyping of such code before attempting. Data > >>> transfer clients that the wrapper script can count on might be an > >>> obstacle. > >>> > >>>> > >>>> Option i) is much harder (on the order of months), needing a very > >>>> different interface between Swift and Falkon than exists at the > >>>> moment. > >>>> > >>>> > >>>> > >>> > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From iraicu at cs.uchicago.edu Sun Apr 13 18:22:51 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sun, 13 Apr 2008 18:22:51 -0500 Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: <1208124543.3191.7.camel@blabla.mcs.anl.gov> References: <47FEB489.509@mcs.anl.gov> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu> <48025C64.9020502@mcs.anl.gov> <48026E8B.501@mcs.anl.! gov> <48026F6B.9060300@uchicago.e! du> <480280B8.9040605@mcs.anl.! gov> <4802902F.7050704@uchicago.edu> <1208124543.3191.7.camel@blabla.mcs.anl.gov> Message-ID: <480295CB.3010806@cs.uchicago.edu> We are not using GridFTP on the BG/P, where this test was done. Files are already on GPFS, so the stageins are probably just cp (or ln -s) from one place to another on GPFS. Is your suggestion still to set that 2000 back down to 100? Ioan Mihael Hategan wrote: > Then my guess is that the system itself (swift + server + FS) cannot > sustain a much higher rate than 100 things per second. In principle > setting those throttles to 2000 pretty much means that you're trying to > start 2000 gridftp connections and hence 2000 gridftp processes on the > server. > > On Sun, 2008-04-13 at 17:58 -0500, Zhao Zhang wrote: > >> Hi, Mike >> >> It is just a typo in the email. I my property file, it is >> "throttle.file.operations=2000". Thanks. >> >> zhao >> >> Michael Wilde wrote: >> >>>>> If its set right, any chance that Swift or Karajan is limiting it >>>>> somewhere? >>>>> >>>> 2000 for sure, >>>> throttle.submit=off >>>> throttle.host.submit=off >>>> throttle.score.job.factor=off >>>> throttle.transfers=2000 >>>> throttle.file.operation=2000 >>>> >>> Looks like a typo in your properties, Zhao - if the text above came >>> from your swift.properties directly: >>> >>> throttle.file.operation=2000 >>> >>> vs operations with an s as per the properties doc: >>> >>> throttle.file.operations=8 >>> #throttle.file.operations=off >>> >>> Which doesnt explain why we're seeing 100 when the default is 8 ??? >>> >>> - Mike >>> >>> >>> On 4/13/08 3:39 PM, Zhao Zhang wrote: >>> >>>> Hi, Mike >>>> >>>> Michael Wilde wrote: >>>> >>>>> Ben, your analysis sounds very good. Some notes below, including >>>>> questions for Zhao. >>>>> >>>>> On 4/13/08 2:57 PM, Ben Clifford wrote: >>>>> >>>>>>> Ben, can you point me to the graphs for this run? (Zhao's >>>>>>> *99cy0z4g.log) >>>>>>> >>>>>> http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g >>>>>> >>>>>> >>>>>>> Once stage-ins start to complete, are the corresponding jobs >>>>>>> initiated quickly, or is Swift doing mostly stage-ins for some >>>>>>> period? >>>>>>> >>>>>> In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to >>>>>> falkon) pretty much right as the corresponding stagein completes. I >>>>>> have no deeper information about when the worker actually starts to >>>>>> run. >>>>>> >>>>>> >>>>>>> Zhao indicated he saw data indicating there was about a 700 second >>>>>>> lag from >>>>>>> workflow start time till the first Falkon jobs started, if I >>>>>>> understood >>>>>>> correctly. Do the graphs confirm this or say something different? >>>>>>> >>>>>> There is a period of about 500s or so until stuff starts to happen; >>>>>> I haven't looked at it. That is before stage-ins start too, though, >>>>>> which means that i think this... >>>>>> >>>>>> >>>>>>> If the 700-second delay figure is true, and stage-in was >>>>>>> eliminated by copying >>>>>>> input files right to the /tmp workdir rather than first to >>>>>>> /shared, then we'd >>>>>>> have: >>>>>>> >>>>>>> 1190260 / ( 1290 * 2048 ) = .45 efficiency >>>>>>> >>>>>> calculation is not meaningful. >>>>>> >>>>>> I have not looked at what is going on during that 500s startup >>>>>> time, but I plan to. >>>>>> >>>>> Zhao, what SVN rev is your Swift at? Ben fixed an N^2 mapper >>>>> logging problem a few weeks ago. Could that cause such a delay, Ben? >>>>> It would be very obvious in the swift log. >>>>> >>>> The version is Swift svn swift-r1780 cog-r1956 >>>> >>>>>>> I assume we're paying the same staging price on the output side? >>>>>>> >>>>>> not really - the output stageouts go very fast, and also because >>>>>> job ending is staggered, they don't happen all at once. >>>>>> >>>>>> This is the same with most of the large runs I've seen (of any >>>>>> application) - stageout tends not to be a problem (or at least, no >>>>>> where near the problems of stagein). >>>>>> >>>>>> All stageins happen over a period t=400 to t=1100 fairly smoothly. >>>>>> There's rate limiting still on file operations (100 max) and file >>>>>> transfers (2000 max) which is being hit still. >>>>>> >>>>> I thought Zhao set file operations throttle to 2000 as well. Sounds >>>>> like we can test with the latter higher, and find out what's >>>>> limiting the former. >>>>> >>>>> Zhao, what are your settings for property throttle.file.operations? >>>>> I assume you have throttle.transfers set to 2000. >>>>> >>>>> If its set right, any chance that Swift or Karajan is limiting it >>>>> somewhere? >>>>> >>>> 2000 for sure, >>>> throttle.submit=off >>>> throttle.host.submit=off >>>> throttle.score.job.factor=off >>>> throttle.transfers=2000 >>>> throttle.file.operation=2000 >>>> >>>>>> I think there's two directions to proceed in here that make sense >>>>>> for actual use on single clusters running falkon (rather than >>>>>> trying to cut out stuff randomly to push up numbers): >>>>>> >>>>>> i) use some of the data placement features in falkon, rather than >>>>>> Swift's >>>>>> relatively simple data management that was designed more for >>>>>> running >>>>>> on the grid. >>>>>> >>>>> Long term: we should consider how the Coaster implementation could >>>>> eventually do a similar data placement approach. In the meantime >>>>> (mid term) examining what interface changes are needed for Falkon >>>>> data placement might help prepare for that. Need to discuss if that >>>>> would be a good step or not. >>>>> >>>>> >>>>>> ii) do stage-ins using symlinks rather than file copying. this makes >>>>>> sense when everything is living in a single filesystem, which >>>>>> again >>>>>> is not what Swift's data management was originally optimised for. >>>>>> >>>>> I assume you mean symlinks from shared/ back to the user's input files? >>>>> >>>>> That sounds worth testing: find out if symlink creation is fast on >>>>> NFS and GPFS. >>>>> >>>>> Is another approach to copy direct from the user's files to the /tmp >>>>> workdir (ie wrapper.sh pulls the data in)? Measurement will tell if >>>>> symlinks alone get adequate performance. Symlinks do seem an easier >>>>> first step. >>>>> >>>>> >>>>>> I think option ii) is substantially easier to implement (on the >>>>>> order of days) and is generally useful in the single-cluster, >>>>>> local-source-data situation that appears to be what people want to >>>>>> do for running on the BG/P and scicortex (that is, pretty much >>>>>> ignoring anything grid-like at all). >>>>>> >>>>> Grid-like might mean pulling data to the /tmp workdir directly by >>>>> the wrapper - but that seems like a harder step, and would need >>>>> measurement and prototyping of such code before attempting. Data >>>>> transfer clients that the wrapper script can count on might be an >>>>> obstacle. >>>>> >>>>> >>>>>> Option i) is much harder (on the order of months), needing a very >>>>>> different interface between Swift and Falkon than exists at the >>>>>> moment. >>>>>> >>>>>> >>>>>> >>>>>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From iraicu at cs.uchicago.edu Sun Apr 13 18:31:42 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sun, 13 Apr 2008 18:31:42 -0500 Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: <48028E19.7020400@mcs.anl.gov> References: <47FEB489.509@mcs.anl.gov> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu> <48025C64.9020502@mcs.anl.gov> <48026E8B.501@mcs.anl.! gov> <48026F6B.9060300@uchicago.ed u> <48027249.2070208@cs.uchica! go.edu> <48028E19.7020400@mcs.an l.gov> Message-ID: <480297DE.3010501@cs.uchicago.edu> Michael Wilde wrote: > Its not clear to me whats best here, for 3 reasons: > > 1) We should set file.transfers and file.operations to a value that > prevents Swift from adversely impacting performance on shared resources. > > Since Swift must run on the login node and hits the shared cluster > networks, we should test carefully. > > 2) Its not clear to me how many concurrent operations the login hosts > can sustain before topping out, and how this number depends on file > size. Do you know this from the GPFS benchmarks? 1 node can sustain about 71 file reads/sec (1B each), and 512 nodes can sustain 362 reads/sec. 10KB files are similar, 66/sec and 315/sec. Read+write for 1B is 31/sec and 79/sec, and for 10KB is 31/sec and 81/sec. Does this explain anything? How fast were stage-ins going? Does a stage-in mean copy a file from one place to another? Then we are looking at the 1 node performance of 10KB read+write, which would be 31 ops/sec. Would all the stage-in be happening during the 500 second idle time? If yes, then that is about 24 files/sec, which is awfully close to the 31 files/sec from our benchmark. If not, then its pure coincidence. > And did you measure impact on system response during those benchmarks? No. Ioan > > I think that the overall system would top out well before 2000 > concurrent transfers, but I could be wrong. Going much higher than the > point where concurrency increases the data rate, it seems, would cause > the rate to drop due to contention and context switching. > > 3) If I/O operations are fast compared to the job length and > completion rate, you dont have to set these values to the same as the > max number of input files that can be demanded at once. > > I think we want to set the I/O operation concurrency to a value that > achieves the highest operation rate we can sustain while keeping > overall system performance at some acceptable level (tbd). > > So first we need to find the concurrency level that maximizes ops/sec, > (which may be filesize dependent) and then possibly back that off to > reduce system impact. > > It seems to me that finding the right I/O concurrency setting is > complex and non-obvious, and I'm interested in what Ben and Mihael > suggest here. > > - Mike > > On 4/13/08 3:51 PM, Ioan Raicu wrote: >> But we have 2X input files as opposed to number of jobs and CPUs. We >> have 2048 CPUs, shouldn't we set all file I/O operations to at least >> 4096... and that means that files won't be ready for the next jobs >> once the first ones start completing... so we should really set >> things to twice that, so 8192 is the number I'd set on all file >> operations for this app on 2K CPUs. >> Ioan >> >> Zhao Zhang wrote: >>> Hi, Mike >>> >>> Michael Wilde wrote: >>>> Ben, your analysis sounds very good. Some notes below, including >>>> questions for Zhao. >>>> >>>> On 4/13/08 2:57 PM, Ben Clifford wrote: >>>>> >>>>>> Ben, can you point me to the graphs for this run? (Zhao's >>>>>> *99cy0z4g.log) >>>>> >>>>> http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g >>>>> >>>>>> Once stage-ins start to complete, are the corresponding jobs >>>>>> initiated quickly, or is Swift doing mostly stage-ins for some >>>>>> period? >>>>> >>>>> In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to >>>>> falkon) pretty much right as the corresponding stagein completes. >>>>> I have no deeper information about when the worker actually starts >>>>> to run. >>>>> >>>>>> Zhao indicated he saw data indicating there was about a 700 >>>>>> second lag from >>>>>> workflow start time till the first Falkon jobs started, if I >>>>>> understood >>>>>> correctly. Do the graphs confirm this or say something different? >>>>> >>>>> There is a period of about 500s or so until stuff starts to >>>>> happen; I haven't looked at it. That is before stage-ins start >>>>> too, though, which means that i think this... >>>>> >>>>>> If the 700-second delay figure is true, and stage-in was >>>>>> eliminated by copying >>>>>> input files right to the /tmp workdir rather than first to >>>>>> /shared, then we'd >>>>>> have: >>>>>> >>>>>> 1190260 / ( 1290 * 2048 ) = .45 efficiency >>>>> >>>>> calculation is not meaningful. >>>>> >>>>> I have not looked at what is going on during that 500s startup >>>>> time, but I plan to. >>>> >>>> Zhao, what SVN rev is your Swift at? Ben fixed an N^2 mapper >>>> logging problem a few weeks ago. Could that cause such a delay, >>>> Ben? It would be very obvious in the swift log. >>> The version is Swift svn swift-r1780 cog-r1956 >>>> >>>>> >>>>>> I assume we're paying the same staging price on the output side? >>>>> >>>>> not really - the output stageouts go very fast, and also because >>>>> job ending is staggered, they don't happen all at once. >>>>> >>>>> This is the same with most of the large runs I've seen (of any >>>>> application) - stageout tends not to be a problem (or at least, no >>>>> where near the problems of stagein). >>>>> >>>>> All stageins happen over a period t=400 to t=1100 fairly smoothly. >>>>> There's rate limiting still on file operations (100 max) and file >>>>> transfers (2000 max) which is being hit still. >>>> >>>> I thought Zhao set file operations throttle to 2000 as well. >>>> Sounds like we can test with the latter higher, and find out what's >>>> limiting the former. >>>> >>>> Zhao, what are your settings for property throttle.file.operations? >>>> I assume you have throttle.transfers set to 2000. >>>> >>>> If its set right, any chance that Swift or Karajan is limiting it >>>> somewhere? >>> 2000 for sure, >>> throttle.submit=off >>> throttle.host.submit=off >>> throttle.score.job.factor=off >>> throttle.transfers=2000 >>> throttle.file.operation=2000 >>>>> >>>>> I think there's two directions to proceed in here that make sense >>>>> for actual use on single clusters running falkon (rather than >>>>> trying to cut out stuff randomly to push up numbers): >>>>> >>>>> i) use some of the data placement features in falkon, rather than >>>>> Swift's >>>>> relatively simple data management that was designed more for >>>>> running >>>>> on the grid. >>>> >>>> Long term: we should consider how the Coaster implementation could >>>> eventually do a similar data placement approach. In the meantime >>>> (mid term) examining what interface changes are needed for Falkon >>>> data placement might help prepare for that. Need to discuss if that >>>> would be a good step or not. >>>> >>>>> >>>>> ii) do stage-ins using symlinks rather than file copying. this makes >>>>> sense when everything is living in a single filesystem, which >>>>> again >>>>> is not what Swift's data management was originally optimised >>>>> for. >>>> >>>> I assume you mean symlinks from shared/ back to the user's input >>>> files? >>>> >>>> That sounds worth testing: find out if symlink creation is fast on >>>> NFS and GPFS. >>>> >>>> Is another approach to copy direct from the user's files to the >>>> /tmp workdir (ie wrapper.sh pulls the data in)? Measurement will >>>> tell if symlinks alone get adequate performance. Symlinks do seem >>>> an easier first step. >>>> >>>>> I think option ii) is substantially easier to implement (on the >>>>> order of days) and is generally useful in the single-cluster, >>>>> local-source-data situation that appears to be what people want to >>>>> do for running on the BG/P and scicortex (that is, pretty much >>>>> ignoring anything grid-like at all). >>>> >>>> Grid-like might mean pulling data to the /tmp workdir directly by >>>> the wrapper - but that seems like a harder step, and would need >>>> measurement and prototyping of such code before attempting. Data >>>> transfer clients that the wrapper script can count on might be an >>>> obstacle. >>>> >>>>> >>>>> Option i) is much harder (on the order of months), needing a very >>>>> different interface between Swift and Falkon than exists at the >>>>> moment. >>>>> >>>>> >>>>> >>>> >> > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From hategan at mcs.anl.gov Sun Apr 13 17:41:20 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 13 Apr 2008 17:41:20 -0500 Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: <480295CB.3010806@cs.uchicago.edu> References: <47FEB489.509@mcs.anl.gov> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu> <48025C64.9020502@mcs.anl.gov> <48026E8B.501@mcs.anl.! gov> <48026F6B.9060300@uchicago.e! du> <480280B8.9040605@mcs.anl.! gov> <4802902F.7050704@uchicago.edu> <1208124543.3191.7.camel@blabla.mcs.anl.gov> <480295CB.3010806@cs.uchicago.edu> Message-ID: <1208126480.3732.3.camel@blabla.mcs.anl.gov> On Sun, 2008-04-13 at 18:22 -0500, Ioan Raicu wrote: > We are not using GridFTP on the BG/P, where this test was done. Files > are already on GPFS, so the stageins are probably just cp (or ln -s) > from one place to another on GPFS. Is your suggestion still to set that > 2000 back down to 100? I see. So it's the local provider. Well, mileage may vary. 100 concurrent transfers doesn't seem very far from what I'd expect if we're talking about small files. > > Ioan > > Mihael Hategan wrote: > > Then my guess is that the system itself (swift + server + FS) cannot > > sustain a much higher rate than 100 things per second. In principle > > setting those throttles to 2000 pretty much means that you're trying to > > start 2000 gridftp connections and hence 2000 gridftp processes on the > > server. > > > > On Sun, 2008-04-13 at 17:58 -0500, Zhao Zhang wrote: > > > >> Hi, Mike > >> > >> It is just a typo in the email. I my property file, it is > >> "throttle.file.operations=2000". Thanks. > >> > >> zhao > >> > >> Michael Wilde wrote: > >> > >>>>> If its set right, any chance that Swift or Karajan is limiting it > >>>>> somewhere? > >>>>> > >>>> 2000 for sure, > >>>> throttle.submit=off > >>>> throttle.host.submit=off > >>>> throttle.score.job.factor=off > >>>> throttle.transfers=2000 > >>>> throttle.file.operation=2000 > >>>> > >>> Looks like a typo in your properties, Zhao - if the text above came > >>> from your swift.properties directly: > >>> > >>> throttle.file.operation=2000 > >>> > >>> vs operations with an s as per the properties doc: > >>> > >>> throttle.file.operations=8 > >>> #throttle.file.operations=off > >>> > >>> Which doesnt explain why we're seeing 100 when the default is 8 ??? > >>> > >>> - Mike > >>> > >>> > >>> On 4/13/08 3:39 PM, Zhao Zhang wrote: > >>> > >>>> Hi, Mike > >>>> > >>>> Michael Wilde wrote: > >>>> > >>>>> Ben, your analysis sounds very good. Some notes below, including > >>>>> questions for Zhao. > >>>>> > >>>>> On 4/13/08 2:57 PM, Ben Clifford wrote: > >>>>> > >>>>>>> Ben, can you point me to the graphs for this run? (Zhao's > >>>>>>> *99cy0z4g.log) > >>>>>>> > >>>>>> http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g > >>>>>> > >>>>>> > >>>>>>> Once stage-ins start to complete, are the corresponding jobs > >>>>>>> initiated quickly, or is Swift doing mostly stage-ins for some > >>>>>>> period? > >>>>>>> > >>>>>> In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to > >>>>>> falkon) pretty much right as the corresponding stagein completes. I > >>>>>> have no deeper information about when the worker actually starts to > >>>>>> run. > >>>>>> > >>>>>> > >>>>>>> Zhao indicated he saw data indicating there was about a 700 second > >>>>>>> lag from > >>>>>>> workflow start time till the first Falkon jobs started, if I > >>>>>>> understood > >>>>>>> correctly. Do the graphs confirm this or say something different? > >>>>>>> > >>>>>> There is a period of about 500s or so until stuff starts to happen; > >>>>>> I haven't looked at it. That is before stage-ins start too, though, > >>>>>> which means that i think this... > >>>>>> > >>>>>> > >>>>>>> If the 700-second delay figure is true, and stage-in was > >>>>>>> eliminated by copying > >>>>>>> input files right to the /tmp workdir rather than first to > >>>>>>> /shared, then we'd > >>>>>>> have: > >>>>>>> > >>>>>>> 1190260 / ( 1290 * 2048 ) = .45 efficiency > >>>>>>> > >>>>>> calculation is not meaningful. > >>>>>> > >>>>>> I have not looked at what is going on during that 500s startup > >>>>>> time, but I plan to. > >>>>>> > >>>>> Zhao, what SVN rev is your Swift at? Ben fixed an N^2 mapper > >>>>> logging problem a few weeks ago. Could that cause such a delay, Ben? > >>>>> It would be very obvious in the swift log. > >>>>> > >>>> The version is Swift svn swift-r1780 cog-r1956 > >>>> > >>>>>>> I assume we're paying the same staging price on the output side? > >>>>>>> > >>>>>> not really - the output stageouts go very fast, and also because > >>>>>> job ending is staggered, they don't happen all at once. > >>>>>> > >>>>>> This is the same with most of the large runs I've seen (of any > >>>>>> application) - stageout tends not to be a problem (or at least, no > >>>>>> where near the problems of stagein). > >>>>>> > >>>>>> All stageins happen over a period t=400 to t=1100 fairly smoothly. > >>>>>> There's rate limiting still on file operations (100 max) and file > >>>>>> transfers (2000 max) which is being hit still. > >>>>>> > >>>>> I thought Zhao set file operations throttle to 2000 as well. Sounds > >>>>> like we can test with the latter higher, and find out what's > >>>>> limiting the former. > >>>>> > >>>>> Zhao, what are your settings for property throttle.file.operations? > >>>>> I assume you have throttle.transfers set to 2000. > >>>>> > >>>>> If its set right, any chance that Swift or Karajan is limiting it > >>>>> somewhere? > >>>>> > >>>> 2000 for sure, > >>>> throttle.submit=off > >>>> throttle.host.submit=off > >>>> throttle.score.job.factor=off > >>>> throttle.transfers=2000 > >>>> throttle.file.operation=2000 > >>>> > >>>>>> I think there's two directions to proceed in here that make sense > >>>>>> for actual use on single clusters running falkon (rather than > >>>>>> trying to cut out stuff randomly to push up numbers): > >>>>>> > >>>>>> i) use some of the data placement features in falkon, rather than > >>>>>> Swift's > >>>>>> relatively simple data management that was designed more for > >>>>>> running > >>>>>> on the grid. > >>>>>> > >>>>> Long term: we should consider how the Coaster implementation could > >>>>> eventually do a similar data placement approach. In the meantime > >>>>> (mid term) examining what interface changes are needed for Falkon > >>>>> data placement might help prepare for that. Need to discuss if that > >>>>> would be a good step or not. > >>>>> > >>>>> > >>>>>> ii) do stage-ins using symlinks rather than file copying. this makes > >>>>>> sense when everything is living in a single filesystem, which > >>>>>> again > >>>>>> is not what Swift's data management was originally optimised for. > >>>>>> > >>>>> I assume you mean symlinks from shared/ back to the user's input files? > >>>>> > >>>>> That sounds worth testing: find out if symlink creation is fast on > >>>>> NFS and GPFS. > >>>>> > >>>>> Is another approach to copy direct from the user's files to the /tmp > >>>>> workdir (ie wrapper.sh pulls the data in)? Measurement will tell if > >>>>> symlinks alone get adequate performance. Symlinks do seem an easier > >>>>> first step. > >>>>> > >>>>> > >>>>>> I think option ii) is substantially easier to implement (on the > >>>>>> order of days) and is generally useful in the single-cluster, > >>>>>> local-source-data situation that appears to be what people want to > >>>>>> do for running on the BG/P and scicortex (that is, pretty much > >>>>>> ignoring anything grid-like at all). > >>>>>> > >>>>> Grid-like might mean pulling data to the /tmp workdir directly by > >>>>> the wrapper - but that seems like a harder step, and would need > >>>>> measurement and prototyping of such code before attempting. Data > >>>>> transfer clients that the wrapper script can count on might be an > >>>>> obstacle. > >>>>> > >>>>> > >>>>>> Option i) is much harder (on the order of months), needing a very > >>>>>> different interface between Swift and Falkon than exists at the > >>>>>> moment. > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > >> > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > -- > =================================================== > Ioan Raicu > Ph.D. Candidate > =================================================== > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > =================================================== > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dev.globus.org/wiki/Incubator/Falkon > http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > =================================================== > =================================================== > > From iraicu at cs.uchicago.edu Sun Apr 13 18:45:28 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sun, 13 Apr 2008 18:45:28 -0500 Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: <1208126480.3732.3.camel@blabla.mcs.anl.gov> References: <47FEB489.509@mcs.anl.gov> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu> <48025C64.9020502@mcs.anl.gov> <48026E8B.501@mcs.anl.! gov> <48026F6B.9060300@uchicago.e! du> <480280B8.9040605@mcs.anl.! gov> <4802902F.7050704@uchicago.edu> <1208124543.3191.7.camel@blabla.mcs.anl.gov> <480295CB.3010806@cs.uchicago.edu> <1208126480.3732.3.camel@blabla.mcs.anl.gov> Message-ID: <48029B18.5080900@cs.uchicago.edu> OK, sounds good. Zhao, when you get the chance, maybe you could try 100 instead of 2000 for the file limits? I don't know if this is a top priority right now, but certainly for after SC. "throttle.file.operations=100" Ioan Mihael Hategan wrote: > On Sun, 2008-04-13 at 18:22 -0500, Ioan Raicu wrote: > >> We are not using GridFTP on the BG/P, where this test was done. Files >> are already on GPFS, so the stageins are probably just cp (or ln -s) >> from one place to another on GPFS. Is your suggestion still to set that >> 2000 back down to 100? >> > > I see. So it's the local provider. Well, mileage may vary. 100 > concurrent transfers doesn't seem very far from what I'd expect if we're > talking about small files. > > >> Ioan >> >> Mihael Hategan wrote: >> >>> Then my guess is that the system itself (swift + server + FS) cannot >>> sustain a much higher rate than 100 things per second. In principle >>> setting those throttles to 2000 pretty much means that you're trying to >>> start 2000 gridftp connections and hence 2000 gridftp processes on the >>> server. >>> >>> On Sun, 2008-04-13 at 17:58 -0500, Zhao Zhang wrote: >>> >>> >>>> Hi, Mike >>>> >>>> It is just a typo in the email. I my property file, it is >>>> "throttle.file.operations=2000". Thanks. >>>> >>>> zhao >>>> >>>> Michael Wilde wrote: >>>> >>>> >>>>>>> If its set right, any chance that Swift or Karajan is limiting it >>>>>>> somewhere? >>>>>>> >>>>>>> >>>>>> 2000 for sure, >>>>>> throttle.submit=off >>>>>> throttle.host.submit=off >>>>>> throttle.score.job.factor=off >>>>>> throttle.transfers=2000 >>>>>> throttle.file.operation=2000 >>>>>> >>>>>> >>>>> Looks like a typo in your properties, Zhao - if the text above came >>>>> from your swift.properties directly: >>>>> >>>>> throttle.file.operation=2000 >>>>> >>>>> vs operations with an s as per the properties doc: >>>>> >>>>> throttle.file.operations=8 >>>>> #throttle.file.operations=off >>>>> >>>>> Which doesnt explain why we're seeing 100 when the default is 8 ??? >>>>> >>>>> - Mike >>>>> >>>>> >>>>> On 4/13/08 3:39 PM, Zhao Zhang wrote: >>>>> >>>>> >>>>>> Hi, Mike >>>>>> >>>>>> Michael Wilde wrote: >>>>>> >>>>>> >>>>>>> Ben, your analysis sounds very good. Some notes below, including >>>>>>> questions for Zhao. >>>>>>> >>>>>>> On 4/13/08 2:57 PM, Ben Clifford wrote: >>>>>>> >>>>>>> >>>>>>>>> Ben, can you point me to the graphs for this run? (Zhao's >>>>>>>>> *99cy0z4g.log) >>>>>>>>> >>>>>>>>> >>>>>>>> http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Once stage-ins start to complete, are the corresponding jobs >>>>>>>>> initiated quickly, or is Swift doing mostly stage-ins for some >>>>>>>>> period? >>>>>>>>> >>>>>>>>> >>>>>>>> In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to >>>>>>>> falkon) pretty much right as the corresponding stagein completes. I >>>>>>>> have no deeper information about when the worker actually starts to >>>>>>>> run. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Zhao indicated he saw data indicating there was about a 700 second >>>>>>>>> lag from >>>>>>>>> workflow start time till the first Falkon jobs started, if I >>>>>>>>> understood >>>>>>>>> correctly. Do the graphs confirm this or say something different? >>>>>>>>> >>>>>>>>> >>>>>>>> There is a period of about 500s or so until stuff starts to happen; >>>>>>>> I haven't looked at it. That is before stage-ins start too, though, >>>>>>>> which means that i think this... >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> If the 700-second delay figure is true, and stage-in was >>>>>>>>> eliminated by copying >>>>>>>>> input files right to the /tmp workdir rather than first to >>>>>>>>> /shared, then we'd >>>>>>>>> have: >>>>>>>>> >>>>>>>>> 1190260 / ( 1290 * 2048 ) = .45 efficiency >>>>>>>>> >>>>>>>>> >>>>>>>> calculation is not meaningful. >>>>>>>> >>>>>>>> I have not looked at what is going on during that 500s startup >>>>>>>> time, but I plan to. >>>>>>>> >>>>>>>> >>>>>>> Zhao, what SVN rev is your Swift at? Ben fixed an N^2 mapper >>>>>>> logging problem a few weeks ago. Could that cause such a delay, Ben? >>>>>>> It would be very obvious in the swift log. >>>>>>> >>>>>>> >>>>>> The version is Swift svn swift-r1780 cog-r1956 >>>>>> >>>>>> >>>>>>>>> I assume we're paying the same staging price on the output side? >>>>>>>>> >>>>>>>>> >>>>>>>> not really - the output stageouts go very fast, and also because >>>>>>>> job ending is staggered, they don't happen all at once. >>>>>>>> >>>>>>>> This is the same with most of the large runs I've seen (of any >>>>>>>> application) - stageout tends not to be a problem (or at least, no >>>>>>>> where near the problems of stagein). >>>>>>>> >>>>>>>> All stageins happen over a period t=400 to t=1100 fairly smoothly. >>>>>>>> There's rate limiting still on file operations (100 max) and file >>>>>>>> transfers (2000 max) which is being hit still. >>>>>>>> >>>>>>>> >>>>>>> I thought Zhao set file operations throttle to 2000 as well. Sounds >>>>>>> like we can test with the latter higher, and find out what's >>>>>>> limiting the former. >>>>>>> >>>>>>> Zhao, what are your settings for property throttle.file.operations? >>>>>>> I assume you have throttle.transfers set to 2000. >>>>>>> >>>>>>> If its set right, any chance that Swift or Karajan is limiting it >>>>>>> somewhere? >>>>>>> >>>>>>> >>>>>> 2000 for sure, >>>>>> throttle.submit=off >>>>>> throttle.host.submit=off >>>>>> throttle.score.job.factor=off >>>>>> throttle.transfers=2000 >>>>>> throttle.file.operation=2000 >>>>>> >>>>>> >>>>>>>> I think there's two directions to proceed in here that make sense >>>>>>>> for actual use on single clusters running falkon (rather than >>>>>>>> trying to cut out stuff randomly to push up numbers): >>>>>>>> >>>>>>>> i) use some of the data placement features in falkon, rather than >>>>>>>> Swift's >>>>>>>> relatively simple data management that was designed more for >>>>>>>> running >>>>>>>> on the grid. >>>>>>>> >>>>>>>> >>>>>>> Long term: we should consider how the Coaster implementation could >>>>>>> eventually do a similar data placement approach. In the meantime >>>>>>> (mid term) examining what interface changes are needed for Falkon >>>>>>> data placement might help prepare for that. Need to discuss if that >>>>>>> would be a good step or not. >>>>>>> >>>>>>> >>>>>>> >>>>>>>> ii) do stage-ins using symlinks rather than file copying. this makes >>>>>>>> sense when everything is living in a single filesystem, which >>>>>>>> again >>>>>>>> is not what Swift's data management was originally optimised for. >>>>>>>> >>>>>>>> >>>>>>> I assume you mean symlinks from shared/ back to the user's input files? >>>>>>> >>>>>>> That sounds worth testing: find out if symlink creation is fast on >>>>>>> NFS and GPFS. >>>>>>> >>>>>>> Is another approach to copy direct from the user's files to the /tmp >>>>>>> workdir (ie wrapper.sh pulls the data in)? Measurement will tell if >>>>>>> symlinks alone get adequate performance. Symlinks do seem an easier >>>>>>> first step. >>>>>>> >>>>>>> >>>>>>> >>>>>>>> I think option ii) is substantially easier to implement (on the >>>>>>>> order of days) and is generally useful in the single-cluster, >>>>>>>> local-source-data situation that appears to be what people want to >>>>>>>> do for running on the BG/P and scicortex (that is, pretty much >>>>>>>> ignoring anything grid-like at all). >>>>>>>> >>>>>>>> >>>>>>> Grid-like might mean pulling data to the /tmp workdir directly by >>>>>>> the wrapper - but that seems like a harder step, and would need >>>>>>> measurement and prototyping of such code before attempting. Data >>>>>>> transfer clients that the wrapper script can count on might be an >>>>>>> obstacle. >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Option i) is much harder (on the order of months), needing a very >>>>>>>> different interface between Swift and Falkon than exists at the >>>>>>>> moment. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> >>>> >>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> >>> >> -- >> =================================================== >> Ioan Raicu >> Ph.D. Candidate >> =================================================== >> Distributed Systems Laboratory >> Computer Science Department >> University of Chicago >> 1100 E. 58th Street, Ryerson Hall >> Chicago, IL 60637 >> =================================================== >> Email: iraicu at cs.uchicago.edu >> Web: http://www.cs.uchicago.edu/~iraicu >> http://dev.globus.org/wiki/Incubator/Falkon >> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page >> =================================================== >> =================================================== >> >> >> > > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From benc at hawaga.org.uk Sun Apr 13 23:26:15 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 14 Apr 2008 04:26:15 +0000 (GMT) Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: <480289B7.4050207@mcs.anl.gov> References: <47FEB489.509@mcs.anl.gov> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu> <48025C64.9020502@mcs.anl.gov> <480289B7.4050207@mcs.anl.gov> Message-ID: On Sun, 13 Apr 2008, Michael Wilde wrote: > That might be a low rate, but its also not clear why its creating so many > handles: I thought we only have about 6K jobs here, with 2 files in and 1 file > out per job: all the local variables, intermediate values and individual pieces of structures count as datasets. 18 for each iteration of the loop seems roughly correct. -- From benc at hawaga.org.uk Sun Apr 13 23:28:19 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 14 Apr 2008 04:28:19 +0000 (GMT) Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: <1208124543.3191.7.camel@blabla.mcs.anl.gov> References: <47FEB489.509@mcs.anl.gov> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu> <48025C64.9020502@mcs.anl.gov> <48026E8B.501@mcs.anl.! gov> <48026F6B.9060300@uchicago.e! du> <480280B8.9040605@mcs.anl.! gov> <4802902F.7050704@uchicago.edu> <1208124543.3191.7.camel@blabla.mcs.anl.gov> Message-ID: On Sun, 13 Apr 2008, Mihael Hategan wrote: > Then my guess is that the system itself (swift + server + FS) cannot > sustain a much higher rate than 100 things per second The graph shows it maxing out at exactly 100, which is suspicious if its a soft load limit. -- From wilde at mcs.anl.gov Mon Apr 14 13:49:24 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 14 Apr 2008 13:49:24 -0500 Subject: [Swift-devel] Can not reach advertised gridftp server on Abe Message-ID: <4803A734.9030407@mcs.anl.gov> Hi Help Team, When I try to reach the gridftp server on Abe advertised in the Userinfo pages at, which is: gridftp-abe.ncsa.teragrid.org on page: http://www.teragrid.org/userinfo/hardware/resources.php?type=compute&select=single&id=50&PHPSESSID=2379360d3ce483f8f90532609354cd73 then I get the following error: # my cert is ok: login$ globus-job-run grid-abe.ncsa.teragrid.org /bin/hostname honest4.ncsa.uiuc.edu # this fails: login$ globus-url-copy file:///etc/passwd gsiftp://gridftp-abe.ncsa.teragrid.org/cfs/scratch/users/wilde/fromci error: globus_ftp_client: the server responded with an error 530 530-Login incorrect. : IPC connection failed. 530-globus_xio_gsi: gss_init_sec_context failed. 530-GSS Major Status: Unexpected Gatekeeper or Service Name 530-globus_gsi_gssapi: Authorization denied: The name of the remote host (abe-ipib-gw02.ncsa.uiuc.edu), and the expected name for the remote host (abe-gw02) do not match. This happens when the name in the host certificate does not match the information obtained from DNS and is often a DNS configuration problem. 530 End. # while gridftp to the GRAM gatekeeper host works: login$ globus-url-copy file:///etc/passwd gsiftp://grid-abe.ncsa.teragrid.org/cfs/scratch/users/wilde/fromci login$ -- A problem due to multi-homed hosts or multiple hostname aliases? Is gridftp-abe a beefier data server than grid-abe, and if so should the problem above get fixed? Thanks, - Mike From benc at hawaga.org.uk Mon Apr 14 14:31:41 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 14 Apr 2008 19:31:41 +0000 (GMT) Subject: [Swift-devel] Re: Another performance comparison of DOCK In-Reply-To: References: <47FEB489.509@mcs.anl.gov> <47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov> <47FFA226.3070303@uchicago.edu> <47FFCD24.2040704@uchicago.edu> <48008BAD.20501@uchicago.edu> <4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu> <48025C64.9020502@mcs.anl.gov> Message-ID: > What happens in the log file for this period is lots of DSHandle creation > (vdl:new) - it creates 115596 datasets in 451 seconds Swift r1790 introduces constant interning. When constants are used in a foreach or iterate loop, this should reduce the number of DSHandles created (once per SwiftScript program per constant rather than once per iteration per constant in from Michael Wilde on Mon, 14 Apr 2008 13:49:24 -0500 Message-ID: <200804142004.m3EK49sM024366@zanamavir.ncsa.uiuc.edu> FROM: Jackson, Weddie (Concerning ticket No. 154784) ============================== Hello Michael, Our Grid Services Folks are working on the issue, in the meantime you can use login-abe.ncsa.teragrid.org instead of gridftp-abe.ncsa.teragrid.org and we will notify you once the issue has been resolved. We appologize for any inconvenience. Thanks, -Weddie ------------------------ Weddie Jackson NCSA Consulting Services ------------------------ Michael Wilde writes: >Hi Help Team, > >When I try to reach the gridftp server on Abe advertised in the Userinfo >pages at, which is: > > gridftp-abe.ncsa.teragrid.org > >on page: > >http://www.teragrid.org/userinfo/hardware/resources.php? type=compute&select=single&id=50&PHPSESSID=2379360d3ce483f8f90532609354cd73 > >then I get the following error: > ># my cert is ok: > >login$ globus-job-run grid-abe.ncsa.teragrid.org /bin/hostname >honest4.ncsa.uiuc.edu > ># this fails: > >login$ globus-url-copy file:///etc/passwd >gsiftp://gridftp-abe.ncsa.teragrid.org/cfs/scratch/users/wilde/fromci > >error: globus_ftp_client: the server responded with an error >530 530-Login incorrect. : IPC connection failed. >530-globus_xio_gsi: gss_init_sec_context failed. >530-GSS Major Status: Unexpected Gatekeeper or Service Name >530-globus_gsi_gssapi: Authorization denied: The name of the remote host >(abe-ipib-gw02.ncsa.uiuc.edu), and the expected name for the remote host >(abe-gw02) do not match. This happens when the name in the host >certificate does not match the information obtained from DNS and is >often a DNS configuration problem. >530 End. > > ># while gridftp to the GRAM gatekeeper host works: > >login$ globus-url-copy file:///etc/passwd >gsiftp://grid-abe.ncsa.teragrid.org/cfs/scratch/users/wilde/fromci >login$ From help at teragrid.org Mon Apr 14 15:38:28 2008 From: help at teragrid.org (help at teragrid.org) Date: Mon, 14 Apr 2008 15:38:28 -0500 Subject: [Swift-devel] [Fwd: Re: Can not reach advertised gridftp server on Abe ] Message-ID: <200804142038.m3EKcSp6019304@amantadine.ncsa.uiuc.edu> FROM: Jackson, Weddie (Concerning ticket No. 154784) ============================== Hello Michael, Can you try reach the Abe's GridFTP server "gridftp-abe.ncsa.teragrid.org" again, our Grid Folks beleived that they have resolved the issue. Please let us know whether or not you are still seeing issues when using "gridftp-abe.ncsa.teragrid.org". Thanks, -Weddie ------------------------ Weddie Jackson NCSA Consulting Services ------------------------ __________Original Message__________ From: help at teragrid.org To: Michael Wilde Subj: Re: Can not reach advertised gridftp server on Abe Cc: swift-devel , Mike Kubal FROM: Jackson, Weddie (Concerning ticket No. 154784) ============================== Hello Michael, Our Grid Services Folks are working on the issue, in the meantime you can use login-abe.ncsa.teragrid.org instead of gridftp-abe.ncsa.teragrid.org and we will notify you once the issue has been resolved. We appologize for any inconvenience. Thanks, -Weddie ------------------------ Weddie Jackson NCSA Consulting Services ------------------------ Michael Wilde writes: >Hi Help Team, > >When I try to reach the gridftp server on Abe advertised in the Userinfo >pages at, which is: > > gridftp-abe.ncsa.teragrid.org > >on page: > >http://www.teragrid.org/userinfo/hardware/resources.php? type=compute&select=single&id=50&PHPSESSID=2379360d3ce483f8f90532609354cd73 > >then I get the following error: > ># my cert is ok: > >login$ globus-job-run grid-abe.ncsa.teragrid.org /bin/hostname >honest4.ncsa.uiuc.edu > ># this fails: > >login$ globus-url-copy file:///etc/passwd >gsiftp://gridftp-abe.ncsa.teragrid.org/cfs/scratch/users/wilde/fromci > >error: globus_ftp_client: the server responded with an error >530 530-Login incorrect. : IPC connection failed. >530-globus_xio_gsi: gss_init_sec_context failed. >530-GSS Major Status: Unexpected Gatekeeper or Service Name >530-globus_gsi_gssapi: Authorization denied: The name of the remote host >(abe-ipib-gw02.ncsa.uiuc.edu), and the expected name for the remote host >(abe-gw02) do not match. This happens when the name in the host >certificate does not match the information obtained from DNS and is >often a DNS configuration problem. >530 End. > > ># while gridftp to the GRAM gatekeeper host works: > >login$ globus-url-copy file:///etc/passwd >gsiftp://grid-abe.ncsa.teragrid.org/cfs/scratch/users/wilde/fromci >login$ From benc at hawaga.org.uk Mon Apr 14 18:21:17 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 14 Apr 2008 23:21:17 +0000 (GMT) Subject: [Swift-devel] hardlinks instead of copies on local file systems Message-ID: I hacked up a quick provider which uses unix hard links instead of copying in order to transfer files. This is a dirty hack to see if it has any performance improvements of copying, and lacks error handling. Most notably, Swift will fail in strange ways when: i) an output file already exists (other providers tend to overwrite) and ii) when the input data file is on a different file system (so hard links cannot work) to the site shared working directory. To try this out: i) untar http://www.ci.uchicago.edu/~benc/provider-ln-20080414.tar.gz into cog/modules/ ii) edit cog/modules/vdsk/dependencies.xml to include a new target provider-ln (like the existing karajan, provider-localscheduler and provider-dcache targets). iii) ant redist in vdsk/ iv) set your sites file to refer to provider-ln, like this: /var/tmp v) fire! I've tested this on my laptop. I haven't tested it on GPFS. I deliberately use hard links rather than symlinks here: i) when hard linking, the new link is a first order reference to the file, just like the original. deleting the original link does not delete the file. this is important for stageout - the output file needs to stay on the file system, not be deleted with the site working directory. ii) symlinks require access to the original directory, whilst hardlinks go straight to the inode without indirecting via the original directory. this is probably important for GPFS scalability - it means there is one less filesystem object to interact with when opening the file. -- From wilde at mcs.anl.gov Tue Apr 15 00:27:25 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 15 Apr 2008 00:27:25 -0500 Subject: [Swift-devel] hardlinks instead of copies on local file systems In-Reply-To: References: Message-ID: <48043CBD.5080004@mcs.anl.gov> Excellent! Hope to try later in the week. - Mike On 4/14/08 6:21 PM, Ben Clifford wrote: > I hacked up a quick provider which uses unix hard links instead of copying > in order to transfer files. This is a dirty hack to see if it has any > performance improvements of copying, and lacks error handling. Most > notably, Swift will fail in strange ways when: i) an output file already > exists (other providers tend to overwrite) and ii) when the input data > file is on a different file system (so hard links cannot work) to the site > shared working directory. > > To try this out: > > i) untar http://www.ci.uchicago.edu/~benc/provider-ln-20080414.tar.gz into > cog/modules/ > > ii) edit cog/modules/vdsk/dependencies.xml to include a new target > provider-ln (like the existing karajan, provider-localscheduler and > provider-dcache targets). > > iii) ant redist in vdsk/ > > iv) set your sites file to refer to provider-ln, like this: > > > > > /var/tmp > > > v) fire! > > I've tested this on my laptop. I haven't tested it on GPFS. > > I deliberately use hard links rather than symlinks here: > > i) when hard linking, the new link is a first order reference to the > file, just like the original. deleting the original link does not delete > the file. this is important for stageout - the output file needs to stay > on the file system, not be deleted with the site working directory. > > ii) symlinks require access to the original directory, whilst hardlinks > go straight to the inode without indirecting via the original directory. > this is probably important for GPFS scalability - it means there is one > less filesystem object to interact with when opening the file. > From duxu at mcs.anl.gov Tue Apr 15 08:47:48 2008 From: duxu at mcs.anl.gov (Xu Du) Date: Tue, 15 Apr 2008 08:47:48 -0500 Subject: [Swift-devel] SWIFT INNOVATION FOR BOINC: prototype has been worked out Message-ID: <001001c89eff$59890460$9a01a8c0@karen> Dear Mike, By hard working of last week, a prototype has been worked out, of course, we will continue to improve it. The following is the last weekly report. Any suggestion and comment are welcome. Regards, Xu -------------------------------------------------------------------------------------- Weekly Report Mar.7-Apr.13 Done: 1. A prototype has been worked out. Up to now, applications can be dispatched from swift to BOINC normally, and the results can also be returned correctly after the jobs are computed by BOINC. Issues 2. A lot parameters required by BOINC are not added at this moment, such as deadline, and resource consuming specification, etc. It seems that Swift is not very open to add additional parameters. To Do: 1. Test and debug the system; 2. Update the design document; 3. Draft a document about how to write and modify providers. From benc at hawaga.org.uk Wed Apr 16 14:42:39 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 16 Apr 2008 19:42:39 +0000 (GMT) Subject: [Swift-devel] Swift 0.5 released. Message-ID: Swift 0.5 is now available for download from http://www.ci.uchicago.edu/swift/packages/vdsk-0.5.tar.gz This is intended to address a number of bugs that were present in 0.4, most notably data channel reuse in GridFTP and a number of problems with recent compiler enhancements. For more information about Swift, visit http://www.ci.uchicago.edu/swift/ -- From skenny at uchicago.edu Thu Apr 17 11:17:20 2008 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Thu, 17 Apr 2008 11:17:20 -0500 (CDT) Subject: [Swift-devel] cleanup.sh Message-ID: <20080417111720.BDP20122@m4500-02.uchicago.edu> hey kids, i've got a simple little script for cleaning up my project directory after a run and also committing log files to ben's repository...mike i think you mentioned wanting to include such a thing with swift (?) does anyone want this? if so where should i put it? sarah From mikekubal at yahoo.com Thu Apr 17 11:56:47 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Thu, 17 Apr 2008 09:56:47 -0700 (PDT) Subject: [Swift-devel] syntax for mapping files on remote resource Message-ID: <98521.23636.qm@web52312.mail.re2.yahoo.com> Hypothetically, what would the Swift syntax be if I wanted to map all the files in a directory on the remote resource, say /cfs/scratch/users/mkubal/Compound_Databases/Zinc_Drug_Like, into file fls[] ? Thanks, Mike ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ From benc at hawaga.org.uk Thu Apr 17 12:54:30 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 17 Apr 2008 17:54:30 +0000 (GMT) Subject: [Swift-devel] syntax for mapping files on remote resource In-Reply-To: <98521.23636.qm@web52312.mail.re2.yahoo.com> References: <98521.23636.qm@web52312.mail.re2.yahoo.com> Message-ID: > Hypothetically, what would the Swift syntax be if I > wanted to map all the files in a directory on the > remote resource, say > /cfs/scratch/users/mkubal/Compound_Databases/Zinc_Drug_Like, > into file fls[] ? try something like: file fls[] I think something like that should work, though I haven't tried it out as I'm working on not-Swift this week. -- From wilde at mcs.anl.gov Thu Apr 17 13:31:33 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 17 Apr 2008 13:31:33 -0500 Subject: [Swift-devel] Re: cleanup.sh In-Reply-To: <20080417111720.BDP20122@m4500-02.uchicago.edu> References: <20080417111720.BDP20122@m4500-02.uchicago.edu> Message-ID: <48079785.900@mcs.anl.gov> Hi Sarah, Yes, everyone wants this! For the moment, make a SwiftTools directory in the Swift SVN at the same level as the main directory, just like Ben's log tools. As soon as the tools are "production ready" we should move them to swift/bin. Keeping them in SwiftTools while theyre young will help us remind users that they are preliminary. We should discuss how to document commands. For the moment, having the command emit a man-page-like --help note would be a good start. - Mike ps. I'll send you my old version of this function for you to peruse for additional things to capture. On 4/17/08 11:17 AM, skenny at uchicago.edu wrote: > hey kids, i've got a simple little script for cleaning > up my project directory after a run and also committing log > files to ben's repository...mike i think you mentioned wanting > to include such a thing with swift (?) > > does anyone want this? if so where should i put it? > > sarah > > From mikekubal at yahoo.com Thu Apr 17 13:51:38 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Thu, 17 Apr 2008 11:51:38 -0700 (PDT) Subject: [Swift-devel] syntax for mapping files on remote resource In-Reply-To: Message-ID: <430068.70879.qm@web52306.mail.re2.yahoo.com> Hi Ben, Please take a look at the Test_FRED logs in ~mkubal/Swift_for_LigandAtlas on wiggum. I'm attempting to run on NCSA's Abe. I tried different variations on the syntax: file fls[]; I get the error: java.lang.RuntimeException: org.globus.cog.abstraction.impl.file.FileResourceException: Could not get list of files in cfs/scratch/users/mkubal/Compound_Databases/Zinc_Drug_Like/zinc.drug_like.chunks3500 from server Caused by: Server refused performing the request. Custom message: (error code 1) [Nested exception message: Custom message: Unexpected reply: 451 refusing to store with active mode org.globus.ftp.exception.DataChannelException: setPassive() must match store() and setActive() - retrieve() (error code 2) org.globus.ftp.exception.DataChannelException: setPassive() must match store() and setActive() - retrieve() (error code 2) Thanks, Mike --- Ben Clifford wrote: > > > Hypothetically, what would the Swift syntax be if > I > > wanted to map all the files in a directory on the > > remote resource, say > > > /cfs/scratch/users/mkubal/Compound_Databases/Zinc_Drug_Like, > > into file fls[] ? > > try something like: > > file fls[] > > > I think something like that should work, though I > haven't tried it out as > I'm working on not-Swift this week. > > -- > ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ From benc at hawaga.org.uk Thu Apr 17 13:55:10 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 17 Apr 2008 18:55:10 +0000 (GMT) Subject: [Swift-devel] syntax for mapping files on remote resource In-Reply-To: <430068.70879.qm@web52306.mail.re2.yahoo.com> References: <430068.70879.qm@web52306.mail.re2.yahoo.com> Message-ID: > setPassive() must match store() and setActive() - > retrieve() (error code 2) I've seen errors like that with some problems that were fixed with data channel reuse for file transfers recently (after 0.4 and before 0.5). How recent/old is your Swift install? (paste the version line from the start of a run) -- From mikekubal at yahoo.com Thu Apr 17 14:04:39 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Thu, 17 Apr 2008 12:04:39 -0700 (PDT) Subject: [Swift-devel] syntax for mapping files on remote resource In-Reply-To: Message-ID: <929699.91539.qm@web52307.mail.re2.yahoo.com> Swift svn swift-r1771 cog-r1936 --- Ben Clifford wrote: > > > setPassive() must match store() and setActive() - > > > retrieve() (error code 2) > > I've seen errors like that with some problems that > were fixed with data > channel reuse for file transfers recently (after 0.4 > and before 0.5). > > How recent/old is your Swift install? (paste the > version line from the > start of a run) > > -- > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ From benc at hawaga.org.uk Thu Apr 17 14:51:19 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 17 Apr 2008 19:51:19 +0000 (GMT) Subject: [Swift-devel] syntax for mapping files on remote resource In-Reply-To: <929699.91539.qm@web52307.mail.re2.yahoo.com> References: <929699.91539.qm@web52307.mail.re2.yahoo.com> Message-ID: On Thu, 17 Apr 2008, Mike Kubal wrote: > Swift svn swift-r1771 cog-r1936 ok. You need cog at least r1956 to get the fixes in data channel reuse that stopped this error message in other situations. Can you get the latest swift and cog SVNs and see if those fix this. -- From wilde at mcs.anl.gov Thu Apr 17 14:53:46 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 17 Apr 2008 14:53:46 -0500 Subject: [Swift-devel] syntax for mapping files on remote resource In-Reply-To: References: <929699.91539.qm@web52307.mail.re2.yahoo.com> Message-ID: <4807AACA.7030105@mcs.anl.gov> Mike, Ive been tied up, but I'll get this installed on Abe etc - tomorrow i hope. But feel free to push on ahead of me. Mike On 4/17/08 2:51 PM, Ben Clifford wrote: > On Thu, 17 Apr 2008, Mike Kubal wrote: > >> Swift svn swift-r1771 cog-r1936 > > ok. You need cog at least r1956 to get the fixes in data channel reuse > that stopped this error message in other situations. Can you get the > latest swift and cog SVNs and see if those fix this. > From mikekubal at yahoo.com Thu Apr 17 15:44:53 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Thu, 17 Apr 2008 13:44:53 -0700 (PDT) Subject: [Swift-devel] syntax for mapping files on remote resource In-Reply-To: Message-ID: <583991.50629.qm@web52312.mail.re2.yahoo.com> I updated to Swift svn swift-r1791 cog-r1962 but I'm still getting the java.lang.RuntimeException: org.globus.cog.abstraction.impl.file.FileResourceException: Could not get list of files in cfs/scratch/users/mkubal/Compound_Databases/Zinc_Drug_Like/zinc.drug_like.chunks3500 from server Caused by: Server refused performing the request. Custom message: (error code 1) [Nested exception message: Custom message: Unexpected reply: 500-Command failed. : System error in stat: No such file or directory Thanks, Mike --- Ben Clifford wrote: > > On Thu, 17 Apr 2008, Mike Kubal wrote: > > > Swift svn swift-r1771 cog-r1936 > > ok. You need cog at least r1956 to get the fixes in > data channel reuse > that stopped this error message in other situations. > Can you get the > latest swift and cog SVNs and see if those fix this. > > -- > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ From hategan at mcs.anl.gov Thu Apr 17 15:48:29 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 17 Apr 2008 15:48:29 -0500 Subject: [Swift-devel] syntax for mapping files on remote resource In-Reply-To: <583991.50629.qm@web52312.mail.re2.yahoo.com> References: <583991.50629.qm@web52312.mail.re2.yahoo.com> Message-ID: <1208465309.9676.12.camel@localhost> Right. You'll need a few more slashes after the host name: grid-abe....org//cfs/... or maybe even grid-abe....org///cfs/.... Mihael On Thu, 2008-04-17 at 13:44 -0700, Mike Kubal wrote: > I updated to Swift svn swift-r1791 cog-r1962 > > but I'm still getting the > > java.lang.RuntimeException: > org.globus.cog.abstraction.impl.file.FileResourceException: > Could not get list of files in > cfs/scratch/users/mkubal/Compound_Databases/Zinc_Drug_Like/zinc.drug_like.chunks3500 > from server > Caused by: > Server refused performing the request. Custom > message: (error code 1) [Nested exception message: > Custom message: Unexpected reply: 500-Command failed. > : System error in stat: No such file or directory > > Thanks, > > Mike > --- Ben Clifford wrote: > > > > > On Thu, 17 Apr 2008, Mike Kubal wrote: > > > > > Swift svn swift-r1771 cog-r1936 > > > > ok. You need cog at least r1956 to get the fixes in > > data channel reuse > > that stopped this error message in other situations. > > Can you get the > > latest swift and cog SVNs and see if those fix this. > > > > -- > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > ____________________________________________________________________________________ > Be a better friend, newshound, and > know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Thu Apr 17 15:49:22 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 17 Apr 2008 20:49:22 +0000 (GMT) Subject: [Swift-devel] syntax for mapping files on remote resource In-Reply-To: <583991.50629.qm@web52312.mail.re2.yahoo.com> References: <583991.50629.qm@web52312.mail.re2.yahoo.com> Message-ID: try gsiftp://hostname//cfs/scratch/whever with *two* / after the hostname and before cfs. On Thu, 17 Apr 2008, Mike Kubal wrote: > I updated to Swift svn swift-r1791 cog-r1962 > > but I'm still getting the > > java.lang.RuntimeException: > org.globus.cog.abstraction.impl.file.FileResourceException: > Could not get list of files in > cfs/scratch/users/mkubal/Compound_Databases/Zinc_Drug_Like/zinc.drug_like.chunks3500 > from server > Caused by: > Server refused performing the request. Custom > message: (error code 1) [Nested exception message: > Custom message: Unexpected reply: 500-Command failed. > : System error in stat: No such file or directory > > Thanks, > > Mike > --- Ben Clifford wrote: > > > > > On Thu, 17 Apr 2008, Mike Kubal wrote: > > > > > Swift svn swift-r1771 cog-r1936 > > > > ok. You need cog at least r1956 to get the fixes in > > data channel reuse > > that stopped this error message in other situations. > > Can you get the > > latest swift and cog SVNs and see if those fix this. > > > > -- > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > ____________________________________________________________________________________ > Be a better friend, newshound, and > know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > From mikekubal at yahoo.com Thu Apr 17 15:58:16 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Thu, 17 Apr 2008 13:58:16 -0700 (PDT) Subject: [Swift-devel] syntax for mapping files on remote resource In-Reply-To: <1208465309.9676.12.camel@localhost> Message-ID: <699514.86613.qm@web52309.mail.re2.yahoo.com> I had tried //, so I tired /// but no luck, but at least a different error with the latest cog and swift: RunID: 20080417-1555-t9mx93ad Execution failed: java.lang.RuntimeException: java.lang.NullPointerException Caused by: java.lang.NullPointerException .... --- Mihael Hategan wrote: > Right. You'll need a few more slashes after the host > name: > grid-abe....org//cfs/... > > or maybe even > > grid-abe....org///cfs/.... > > Mihael > > On Thu, 2008-04-17 at 13:44 -0700, Mike Kubal wrote: > > I updated to Swift svn swift-r1791 cog-r1962 > > > > but I'm still getting the > > > > java.lang.RuntimeException: > > > org.globus.cog.abstraction.impl.file.FileResourceException: > > Could not get list of files in > > > cfs/scratch/users/mkubal/Compound_Databases/Zinc_Drug_Like/zinc.drug_like.chunks3500 > > from server > > Caused by: > > Server refused performing the request. > Custom > > message: (error code 1) [Nested exception > message: > > Custom message: Unexpected reply: 500-Command > failed. > > : System error in stat: No such file or directory > > > > Thanks, > > > > Mike > > --- Ben Clifford wrote: > > > > > > > > On Thu, 17 Apr 2008, Mike Kubal wrote: > > > > > > > Swift svn swift-r1771 cog-r1936 > > > > > > ok. You need cog at least r1956 to get the fixes > in > > > data channel reuse > > > that stopped this error message in other > situations. > > > Can you get the > > > latest swift and cog SVNs and see if those fix > this. > > > > > > -- > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > Be a better friend, newshound, and > > know-it-all with Yahoo! Mobile. Try it now. > http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ From benc at hawaga.org.uk Thu Apr 17 15:53:23 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 17 Apr 2008 20:53:23 +0000 (GMT) Subject: [Swift-devel] syntax for mapping files on remote resource In-Reply-To: <583991.50629.qm@web52312.mail.re2.yahoo.com> References: <583991.50629.qm@web52312.mail.re2.yahoo.com> Message-ID: > but I'm still getting the different error, btw - before it was gridftp error code 451; now its 500. not sure which is better ;) -- From hategan at mcs.anl.gov Thu Apr 17 16:08:24 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 17 Apr 2008 16:08:24 -0500 Subject: [Swift-devel] syntax for mapping files on remote resource In-Reply-To: <699514.86613.qm@web52309.mail.re2.yahoo.com> References: <699514.86613.qm@web52309.mail.re2.yahoo.com> Message-ID: <1208466504.10990.0.camel@localhost> The log file may have a full stack trace. On Thu, 2008-04-17 at 13:58 -0700, Mike Kubal wrote: > I had tried //, so I tired /// but no luck, but at > least a different error with the latest cog and swift: > > RunID: 20080417-1555-t9mx93ad > Execution failed: > java.lang.RuntimeException: > java.lang.NullPointerException > Caused by: > java.lang.NullPointerException .... > > > --- Mihael Hategan wrote: > > > Right. You'll need a few more slashes after the host > > name: > > grid-abe....org//cfs/... > > > > or maybe even > > > > grid-abe....org///cfs/.... > > > > Mihael > > > > On Thu, 2008-04-17 at 13:44 -0700, Mike Kubal wrote: > > > I updated to Swift svn swift-r1791 cog-r1962 > > > > > > but I'm still getting the > > > > > > java.lang.RuntimeException: > > > > > > org.globus.cog.abstraction.impl.file.FileResourceException: > > > Could not get list of files in > > > > > > cfs/scratch/users/mkubal/Compound_Databases/Zinc_Drug_Like/zinc.drug_like.chunks3500 > > > from server > > > Caused by: > > > Server refused performing the request. > > Custom > > > message: (error code 1) [Nested exception > > message: > > > Custom message: Unexpected reply: 500-Command > > failed. > > > : System error in stat: No such file or directory > > > > > > Thanks, > > > > > > Mike > > > --- Ben Clifford wrote: > > > > > > > > > > > On Thu, 17 Apr 2008, Mike Kubal wrote: > > > > > > > > > Swift svn swift-r1771 cog-r1936 > > > > > > > > ok. You need cog at least r1956 to get the fixes > > in > > > > data channel reuse > > > > that stopped this error message in other > > situations. > > > > Can you get the > > > > latest swift and cog SVNs and see if those fix > > this. > > > > > > > > -- > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > Be a better friend, newshound, and > > > know-it-all with Yahoo! Mobile. Try it now. > > > http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > ____________________________________________________________________________________ > Be a better friend, newshound, and > know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > From mikekubal at yahoo.com Thu Apr 17 16:08:54 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Thu, 17 Apr 2008 14:08:54 -0700 (PDT) Subject: [Swift-devel] syntax for mapping files on remote resource In-Reply-To: <699514.86613.qm@web52309.mail.re2.yahoo.com> Message-ID: <635441.69220.qm@web52312.mail.re2.yahoo.com> It may be something with Abe's ftp server? --- Mike Kubal wrote: > I had tried //, so I tired /// but no luck, but at > least a different error with the latest cog and > swift: > > RunID: 20080417-1555-t9mx93ad > Execution failed: > java.lang.RuntimeException: > java.lang.NullPointerException > Caused by: > java.lang.NullPointerException .... > > > --- Mihael Hategan wrote: > > > Right. You'll need a few more slashes after the > host > > name: > > grid-abe....org//cfs/... > > > > or maybe even > > > > grid-abe....org///cfs/.... > > > > Mihael > > > > On Thu, 2008-04-17 at 13:44 -0700, Mike Kubal > wrote: > > > I updated to Swift svn swift-r1791 cog-r1962 > > > > > > but I'm still getting the > > > > > > java.lang.RuntimeException: > > > > > > org.globus.cog.abstraction.impl.file.FileResourceException: > > > Could not get list of files in > > > > > > cfs/scratch/users/mkubal/Compound_Databases/Zinc_Drug_Like/zinc.drug_like.chunks3500 > > > from server > > > Caused by: > > > Server refused performing the request. > > Custom > > > message: (error code 1) [Nested exception > > message: > > > Custom message: Unexpected reply: 500-Command > > failed. > > > : System error in stat: No such file or > directory > > > > > > Thanks, > > > > > > Mike > > > --- Ben Clifford wrote: > > > > > > > > > > > On Thu, 17 Apr 2008, Mike Kubal wrote: > > > > > > > > > Swift svn swift-r1771 cog-r1936 > > > > > > > > ok. You need cog at least r1956 to get the > fixes > > in > > > > data channel reuse > > > > that stopped this error message in other > > situations. > > > > Can you get the > > > > latest swift and cog SVNs and see if those fix > > this. > > > > > > > > -- > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > Be a better friend, newshound, and > > > know-it-all with Yahoo! Mobile. Try it now. > > > http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > ____________________________________________________________________________________ > Be a better friend, newshound, and > know-it-all with Yahoo! Mobile. Try it now. > http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ From hategan at mcs.anl.gov Thu Apr 17 16:14:46 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 17 Apr 2008 16:14:46 -0500 Subject: [Swift-devel] syntax for mapping files on remote resource In-Reply-To: <635441.69220.qm@web52312.mail.re2.yahoo.com> References: <635441.69220.qm@web52312.mail.re2.yahoo.com> Message-ID: <1208466886.10990.10.camel@localhost> Perhaps, but swift isn't right either in throwing a null pointer exception. On Thu, 2008-04-17 at 14:08 -0700, Mike Kubal wrote: > It may be something with Abe's ftp server? > > --- Mike Kubal wrote: > > > I had tried //, so I tired /// but no luck, but at > > least a different error with the latest cog and > > swift: > > > > RunID: 20080417-1555-t9mx93ad > > Execution failed: > > java.lang.RuntimeException: > > java.lang.NullPointerException > > Caused by: > > java.lang.NullPointerException .... > > > > > > --- Mihael Hategan wrote: > > > > > Right. You'll need a few more slashes after the > > host > > > name: > > > grid-abe....org//cfs/... > > > > > > or maybe even > > > > > > grid-abe....org///cfs/.... > > > > > > Mihael > > > > > > On Thu, 2008-04-17 at 13:44 -0700, Mike Kubal > > wrote: > > > > I updated to Swift svn swift-r1791 cog-r1962 > > > > > > > > but I'm still getting the > > > > > > > > java.lang.RuntimeException: > > > > > > > > > > org.globus.cog.abstraction.impl.file.FileResourceException: > > > > Could not get list of files in > > > > > > > > > > cfs/scratch/users/mkubal/Compound_Databases/Zinc_Drug_Like/zinc.drug_like.chunks3500 > > > > from server > > > > Caused by: > > > > Server refused performing the request. > > > Custom > > > > message: (error code 1) [Nested exception > > > message: > > > > Custom message: Unexpected reply: 500-Command > > > failed. > > > > : System error in stat: No such file or > > directory > > > > > > > > Thanks, > > > > > > > > Mike > > > > --- Ben Clifford wrote: > > > > > > > > > > > > > > On Thu, 17 Apr 2008, Mike Kubal wrote: > > > > > > > > > > > Swift svn swift-r1771 cog-r1936 > > > > > > > > > > ok. You need cog at least r1956 to get the > > fixes > > > in > > > > > data channel reuse > > > > > that stopped this error message in other > > > situations. > > > > > Can you get the > > > > > latest swift and cog SVNs and see if those fix > > > this. > > > > > > > > > > -- > > > > > > > > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > > Be a better friend, newshound, and > > > > know-it-all with Yahoo! Mobile. Try it now. > > > > > > http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > Be a better friend, newshound, and > > know-it-all with Yahoo! Mobile. Try it now. > > > http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > ____________________________________________________________________________________ > Be a better friend, newshound, and > know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > From hategan at mcs.anl.gov Fri Apr 18 15:57:16 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 18 Apr 2008 15:57:16 -0500 Subject: [Swift-devel] assignment Message-ID: <1208552236.5064.6.camel@localhost> We have to define what it means to make a mapped-var to mapped-var assignment. And it should probably be a file copy. Mihael From benc at hawaga.org.uk Fri Apr 18 16:24:49 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 18 Apr 2008 21:24:49 +0000 (GMT) Subject: [Swift-devel] assignment In-Reply-To: <1208552236.5064.6.camel@localhost> References: <1208552236.5064.6.camel@localhost> Message-ID: On Fri, 18 Apr 2008, Mihael Hategan wrote: > We have to define what it means to make a mapped-var to mapped-var > assignment. And it should probably be a file copy. yes. -- From benc at hawaga.org.uk Sat Apr 19 08:01:46 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 19 Apr 2008 13:01:46 +0000 (GMT) Subject: [Swift-devel] r1793 restoring of sites.xml LRC Message-ID: r1793 puts back in the sites.xml element. I removed that deliberately as a config option that does nothing, has done nothing, and likely will not do anything either ever or for a long time. Its displeasing to my sense of user interface aesthetics to have configuration options that (deliberately) do nothing; they take up documentation space (should anyone bother documenting them); lead to user confusion when a user experiments with changing it to no effect; they lead to cruft buildup in the code and in configuration files. If this config option is going to stay, then I think it should at least print a warning indicating that it is ignored when specified rather than silently being ignored. -- From hategan at mcs.anl.gov Sat Apr 19 08:50:55 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 19 Apr 2008 08:50:55 -0500 Subject: [Swift-devel] Re: r1793 restoring of sites.xml LRC In-Reply-To: References: Message-ID: <1208613055.9907.1.camel@localhost> I forgot we had it before. The idea was to allow old sites.xml files to be used. I'll remove it. On Sat, 2008-04-19 at 13:01 +0000, Ben Clifford wrote: > r1793 puts back in the sites.xml element. I removed that deliberately as a > config option that does nothing, has done nothing, and likely will not do > anything either ever or for a long time. > > Its displeasing to my sense of user interface aesthetics to have > configuration options that (deliberately) do nothing; they take up > documentation space (should anyone bother documenting them); lead to user > confusion when a user experiments with changing it to no effect; they lead > to cruft buildup in the code and in configuration files. > > If this config option is going to stay, then I think it should at least > print a warning indicating that it is ignored when specified rather than > silently being ignored. > From benc at hawaga.org.uk Sun Apr 20 12:05:45 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 20 Apr 2008 17:05:45 +0000 (GMT) Subject: [Swift-devel] CLASSPATH construction order Message-ID: The bin/swift wrapper currently constructs a classpath for swift automatically, as whatever is already on the classpath in the current environment followed by all of the swift classes. This order seems to cause more problems than it solves - specifically, when there are overlapping classes specified in the environment (which I have seen with falkon and pegasus users, and potentially is also a problem for people with the Globus Toolkit installed). I think it would be better to construct the classpath the other way round. This would remove the ability for people to override internal Swift classes by presetting classpath, which, in the above cases, happens accidentally and produces obscure errors. If there is a desire to still be able to override swift classes from the environment, which I think there is not, then another Swift specific environment variable should be used for the prefix. -- From hategan at mcs.anl.gov Mon Apr 21 22:12:54 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 21 Apr 2008 22:12:54 -0500 Subject: [Swift-devel] swift 0.5 and gt2 Message-ID: <1208833974.9368.1.camel@localhost> It may be that if you submit jobs through gt2 with swift 0.5 they may be run with fork even though other job managers are specified. This needs to be checked, but that's what my code shows. From hategan at mcs.anl.gov Mon Apr 21 22:22:31 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 21 Apr 2008 22:22:31 -0500 Subject: [Swift-devel] swift 0.5 and gt2 In-Reply-To: <1208833974.9368.1.camel@localhost> References: <1208833974.9368.1.camel@localhost> Message-ID: <1208834551.9368.3.camel@localhost> This is fixed in cog r1964. On Mon, 2008-04-21 at 22:12 -0500, Mihael Hategan wrote: > It may be that if you submit jobs through gt2 with swift 0.5 they may be > run with fork even though other job managers are specified. > > This needs to be checked, but that's what my code shows. > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Mon Apr 21 22:54:07 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 22 Apr 2008 03:54:07 +0000 (GMT) Subject: [Swift-devel] swift 0.5 and gt2 In-Reply-To: <1208833974.9368.1.camel@localhost> References: <1208833974.9368.1.camel@localhost> Message-ID: On Mon, 21 Apr 2008, Mihael Hategan wrote: > It may be that if you submit jobs through gt2 with swift 0.5 they may be > run with fork even though other job managers are specified. > > This needs to be checked, but that's what my code shows. That's contrary to what the per-site testin in tests/sites/ shows. I just checked again with the 0.5 tarball using tests/sites/tgtacc-lsf-gram2.xml and I see jobs going into LSF there. Did you use a jobmanager specification syntax that differs from the syntax used in that file? If so, what? -- From hategan at mcs.anl.gov Tue Apr 22 07:49:36 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 22 Apr 2008 07:49:36 -0500 Subject: [Swift-devel] swift 0.5 and gt2 In-Reply-To: References: <1208833974.9368.1.camel@localhost> Message-ID: <1208868576.10512.3.camel@localhost> On Tue, 2008-04-22 at 03:54 +0000, Ben Clifford wrote: > > On Mon, 21 Apr 2008, Mihael Hategan wrote: > > > It may be that if you submit jobs through gt2 with swift 0.5 they may be > > run with fork even though other job managers are specified. > > > > This needs to be checked, but that's what my code shows. > > That's contrary to what the per-site testin in tests/sites/ shows. I just > checked again with the 0.5 tarball using tests/sites/tgtacc-lsf-gram2.xml > and I see jobs going into LSF there. > > Did you use a jobmanager specification syntax that differs from the syntax > used in that file? If so, what? The gt2 provider was using undocumented stuff to read the job manager from a description. And that undocumented stuff has changed. Can it be that LSF is the default there? > From benc at hawaga.org.uk Tue Apr 22 08:16:51 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 22 Apr 2008 13:16:51 +0000 (GMT) Subject: [Swift-devel] swift 0.5 and gt2 In-Reply-To: <1208868576.10512.3.camel@localhost> References: <1208833974.9368.1.camel@localhost> <1208868576.10512.3.camel@localhost> Message-ID: On Tue, 22 Apr 2008, Mihael Hategan wrote: > The gt2 provider was using undocumented stuff to read the job manager > from a description. And that undocumented stuff has changed. Can it be > that LSF is the default there? If I use jobmanager-fii instead of jobmanager-lsf, I get this: Caused by: Cannot submit job Caused by: The gatekeeper failed to find the requested service -- From hategan at mcs.anl.gov Tue Apr 22 08:23:31 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 22 Apr 2008 08:23:31 -0500 Subject: [Swift-devel] swift 0.5 and gt2 In-Reply-To: References: <1208833974.9368.1.camel@localhost> <1208868576.10512.3.camel@localhost> Message-ID: <1208870611.11068.0.camel@localhost> But not in the url. I'm talking about using the jobManager attribute to or . If you put it in the url, it's fine. On Tue, 2008-04-22 at 13:16 +0000, Ben Clifford wrote: > On Tue, 22 Apr 2008, Mihael Hategan wrote: > > > The gt2 provider was using undocumented stuff to read the job manager > > from a description. And that undocumented stuff has changed. Can it be > > that LSF is the default there? > > If I use jobmanager-fii instead of jobmanager-lsf, I get this: > > Caused by: > Cannot submit job > Caused by: > The gatekeeper failed to find the requested service > > From hategan at mcs.anl.gov Tue Apr 22 08:37:55 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 22 Apr 2008 08:37:55 -0500 Subject: [Swift-devel] swift 0.5 and gt2 In-Reply-To: <1208870611.11068.0.camel@localhost> References: <1208833974.9368.1.camel@localhost> <1208868576.10512.3.camel@localhost> <1208870611.11068.0.camel@localhost> Message-ID: <1208871475.11385.0.camel@localhost> Maybe we should keep a "known issues" file for each version. On Tue, 2008-04-22 at 08:23 -0500, Mihael Hategan wrote: > But not in the url. I'm talking about using the jobManager attribute to > or . If you put it in the url, it's fine. > > On Tue, 2008-04-22 at 13:16 +0000, Ben Clifford wrote: > > On Tue, 22 Apr 2008, Mihael Hategan wrote: > > > > > The gt2 provider was using undocumented stuff to read the job manager > > > from a description. And that undocumented stuff has changed. Can it be > > > that LSF is the default there? > > > > If I use jobmanager-fii instead of jobmanager-lsf, I get this: > > > > Caused by: > > Cannot submit job > > Caused by: > > The gatekeeper failed to find the requested service > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Tue Apr 22 09:12:22 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 22 Apr 2008 14:12:22 +0000 (GMT) Subject: [Swift-devel] swift 0.5 and gt2 In-Reply-To: <1208870611.11068.0.camel@localhost> References: <1208833974.9368.1.camel@localhost> <1208868576.10512.3.camel@localhost> <1208870611.11068.0.camel@localhost> Message-ID: On Tue, 22 Apr 2008, Mihael Hategan wrote: > But not in the url. I'm talking about using the jobManager attribute to > or . If you put it in the url, it's fine. ok. That is what I was trying to ascertain. -- From hategan at mcs.anl.gov Tue Apr 22 09:15:39 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 22 Apr 2008 09:15:39 -0500 Subject: [Swift-devel] coasters Message-ID: <1208873739.12384.7.camel@localhost> I committed a preliminary coaster code to SVN. It's called provider-coaster, and it works pretty much like any other provider, with the following notes: The job manager is made of 2 or 3 parts: :[:]. So if you want to, say, start the service on teraport using gt2, then start workers using gt4 on PBS, you'd say: jobManager="gt2:gt4:pbs". Or if you wanted to start the service with ssh and then use the local (to the service) pbs provider to start workers, it would be jobManager="ssh:pbs". It's missing a bunch of things. One of them is that the service, once started, won't shut down by itself, so you should log into the machine you started it on, and kill it. Another is a better strategy for allocating workers than "as many as there are jobs". And so on... org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.main() contains some code to test this. Please be careful if running on a cluster: don't submit too many jobs. From benc at hawaga.org.uk Tue Apr 22 09:35:17 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 22 Apr 2008 14:35:17 +0000 (GMT) Subject: [Swift-devel] coasters In-Reply-To: <1208873739.12384.7.camel@localhost> References: <1208873739.12384.7.camel@localhost> Message-ID: I tried this with swift, by adding a dependency on provider-coaster and setting: This is on my os x laptop on UC wireless. examples/first.swift fails like this: Execution failed: Exception in echo: Arguments: [Hello, world!] Host: localhost Directory: first-20080422-0932-v5v2jxa7/jobs/p/echo-p9ytklri stderr.txt: stdout.txt: ---- Caused by: Could not submit job Caused by: Failed to start channel GSSC-https://:1984 Caused by: port out of range:-1 -- From hategan at mcs.anl.gov Tue Apr 22 10:23:52 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 22 Apr 2008 10:23:52 -0500 Subject: [Swift-devel] coasters In-Reply-To: References: <1208873739.12384.7.camel@localhost> Message-ID: <1208877832.13505.3.camel@localhost> You need a valid IP in your cog.properties. On Tue, 2008-04-22 at 14:35 +0000, Ben Clifford wrote: > I tried this with swift, by adding a dependency on provider-coaster and > setting: > > > > This is on my os x laptop on UC wireless. > > examples/first.swift fails like this: > > Execution failed: > Exception in echo: > Arguments: [Hello, world!] > Host: localhost > Directory: first-20080422-0932-v5v2jxa7/jobs/p/echo-p9ytklri > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Could not submit job > Caused by: > Failed to start channel GSSC-https://:1984 > Caused by: > port out of range:-1 > > From benc at hawaga.org.uk Tue Apr 22 10:41:52 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 22 Apr 2008 15:41:52 +0000 (GMT) Subject: [Swift-devel] coasters In-Reply-To: <1208877832.13505.3.camel@localhost> References: <1208873739.12384.7.camel@localhost> <1208877832.13505.3.camel@localhost> Message-ID: On Tue, 22 Apr 2008, Mihael Hategan wrote: > You need a valid IP in your cog.properties. Is there a way for me to set that in a way that doesn't involve fiddling with files outside of my install/run tree? (neither GLOBUS_HOSTNAME env or -ip.addr swift command line parameter changes the error) -- From hategan at mcs.anl.gov Tue Apr 22 10:46:14 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 22 Apr 2008 10:46:14 -0500 Subject: [Swift-devel] coasters In-Reply-To: References: <1208873739.12384.7.camel@localhost> <1208877832.13505.3.camel@localhost> Message-ID: <1208879174.13851.2.camel@localhost> GLOBUS_HOSTNAME should work, so it might be another problem. Send log file. Also, try invoking the integrated test (org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.main()). On Tue, 2008-04-22 at 15:41 +0000, Ben Clifford wrote: > > On Tue, 22 Apr 2008, Mihael Hategan wrote: > > > You need a valid IP in your cog.properties. > > Is there a way for me to set that in a way that doesn't involve fiddling > with files outside of my install/run tree? > > (neither GLOBUS_HOSTNAME env or -ip.addr swift command line parameter > changes the error) > From benc at hawaga.org.uk Tue Apr 22 10:58:26 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 22 Apr 2008 15:58:26 +0000 (GMT) Subject: [Swift-devel] coasters In-Reply-To: <1208879174.13851.2.camel@localhost> References: <1208873739.12384.7.camel@localhost> <1208877832.13505.3.camel@localhost> <1208879174.13851.2.camel@localhost> Message-ID: On Tue, 22 Apr 2008, Mihael Hategan wrote: > Also, try invoking the integrated test > (org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.main()). I transplated that class name into the swift wrapper script to get the environmental setup instead of running Loader: $ diff swift coaster-test 3c3 < EXEC=org.griphyn.vdl.karajan.Loader --- > EXEC=org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler and I get this error: $ ./coaster-test Started local service: 128.135.199.187:50000 org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job at org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:81) at org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submitTask(JobSubmissionTaskHandler.java:229) at org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.main(JobSubmissionTaskHandler.java:238) Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not start coaster service at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.reserveService(ServiceManager.java:80) at org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:68) ... 2 more Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not find bootstrap script in classpath at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.buildTask(ServiceManager.java:163) at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.startService(ServiceManager.java:102) at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.reserveService(ServiceManager.java:74) ... 3 more Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not find bootstrap script in classpath at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.loadBootstrapScript(ServiceManager.java:175) at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.buildTask(ServiceManager.java:143) ... 5 more -- From hategan at mcs.anl.gov Tue Apr 22 11:01:20 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 22 Apr 2008 11:01:20 -0500 Subject: [Swift-devel] coasters In-Reply-To: References: <1208873739.12384.7.camel@localhost> <1208877832.13505.3.camel@localhost> <1208879174.13851.2.camel@localhost> Message-ID: <1208880080.14131.0.camel@localhost> Hmm. Right. Copy libexec/bootstrap.sh to resources/ On Tue, 2008-04-22 at 15:58 +0000, Ben Clifford wrote: > On Tue, 22 Apr 2008, Mihael Hategan wrote: > > > Also, try invoking the integrated test > > (org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.main()). > > I transplated that class name into the swift wrapper script to get the > environmental setup instead of running Loader: > > $ diff swift coaster-test > 3c3 > < EXEC=org.griphyn.vdl.karajan.Loader > --- > > > EXEC=org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler > > and I get this error: > > $ ./coaster-test > Started local service: 128.135.199.187:50000 > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could > not submit job > at > org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:81) > at > org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submitTask(JobSubmissionTaskHandler.java:229) > at > org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.main(JobSubmissionTaskHandler.java:238) > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could > not start coaster service > at > org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.reserveService(ServiceManager.java:80) > at > org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:68) > ... 2 more > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could > not find bootstrap script in classpath > at > org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.buildTask(ServiceManager.java:163) > at > org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.startService(ServiceManager.java:102) > at > org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.reserveService(ServiceManager.java:74) > ... 3 more > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could > not find bootstrap script in classpath > at > org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.loadBootstrapScript(ServiceManager.java:175) > at > org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.buildTask(ServiceManager.java:143) > ... 5 more > From hategan at mcs.anl.gov Tue Apr 22 11:04:08 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 22 Apr 2008 11:04:08 -0500 Subject: [Swift-devel] coasters In-Reply-To: <1208880080.14131.0.camel@localhost> References: <1208873739.12384.7.camel@localhost> <1208877832.13505.3.camel@localhost> <1208879174.13851.2.camel@localhost> <1208880080.14131.0.camel@localhost> Message-ID: <1208880248.14131.2.camel@localhost> Or update to 1970 and re-compile. On Tue, 2008-04-22 at 11:01 -0500, Mihael Hategan wrote: > Hmm. Right. Copy libexec/bootstrap.sh to resources/ > > On Tue, 2008-04-22 at 15:58 +0000, Ben Clifford wrote: > > On Tue, 22 Apr 2008, Mihael Hategan wrote: > > > > > Also, try invoking the integrated test > > > (org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.main()). > > > > I transplated that class name into the swift wrapper script to get the > > environmental setup instead of running Loader: > > > > $ diff swift coaster-test > > 3c3 > > < EXEC=org.griphyn.vdl.karajan.Loader > > --- > > > > > EXEC=org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler > > > > and I get this error: > > > > $ ./coaster-test > > Started local service: 128.135.199.187:50000 > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could > > not submit job > > at > > org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:81) > > at > > org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submitTask(JobSubmissionTaskHandler.java:229) > > at > > org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.main(JobSubmissionTaskHandler.java:238) > > Caused by: > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could > > not start coaster service > > at > > org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.reserveService(ServiceManager.java:80) > > at > > org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:68) > > ... 2 more > > Caused by: > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could > > not find bootstrap script in classpath > > at > > org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.buildTask(ServiceManager.java:163) > > at > > org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.startService(ServiceManager.java:102) > > at > > org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.reserveService(ServiceManager.java:74) > > ... 3 more > > Caused by: > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could > > not find bootstrap script in classpath > > at > > org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.loadBootstrapScript(ServiceManager.java:175) > > at > > org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.buildTask(ServiceManager.java:143) > > ... 5 more > > From benc at hawaga.org.uk Tue Apr 22 11:06:38 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 22 Apr 2008 16:06:38 +0000 (GMT) Subject: [Swift-devel] coasters In-Reply-To: References: <1208873739.12384.7.camel@localhost> <1208877832.13505.3.camel@localhost> <1208879174.13851.2.camel@localhost> Message-ID: here is a log file for a swift run attempt: http://www.ci.uchicago.edu/~benc/tmp/first-20080422-1104-491mdor6.log -- From benc at hawaga.org.uk Tue Apr 22 11:37:13 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 22 Apr 2008 16:37:13 +0000 (GMT) Subject: [Swift-devel] coasters In-Reply-To: <1208880248.14131.2.camel@localhost> References: <1208873739.12384.7.camel@localhost> <1208877832.13505.3.camel@localhost> <1208879174.13851.2.camel@localhost> <1208880080.14131.0.camel@localhost> <1208880248.14131.2.camel@localhost> Message-ID: On Tue, 22 Apr 2008, Mihael Hategan wrote: > Or update to 1970 and re-compile. so now that test class outputs the following, veeery slowly. And then seems to do sit and do nothing. On my laptop it doesn't seem to be using CPU, and on 128.135.125.118, tp-grid1, I see nothing in the PBS queue. On tp-grid1, I see these java processes running: /soft/java-1.5.0_06-sun-r1/bin/java -Djava.home=/soft/java-1.5.0_06-sun-r1 -DX509_USER_PROXY=/home/benc/.globus/job/tp-grid1.ci.uchicago.edu/7656.1208881472/x509_up -DGLOBUS_HOSTNAME=tp-grid1.ci.uchicago. edu -jar bootstrap.lP7914 http://128.135.199.187:50001 4da44a90a961d5f9f4965b1a8a2ce85e https://128.135.199.187:50000 357791723 9835 ? Sl 0:04 /soft/java-1.5.0_06-sun-r1/bin/java -DX509_USER_PROXY=/home/benc/.globus/job/tp-grid1.ci.uchicago.edu/7656.1208881472/x509_up -DGLOBUS_HOSTNAME=tp-grid1.ci.uchicago.edu -cp /home/benc/.globus/coasters/cac he/cog-provider-coaster-0.1-1139af49204eed1884ffa46465f9704f.jar:/home/benc/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/benc/.globus/coasters/cache/cog-abstraction-common-2.2-a4301bae7 66fd4d64d0fefabb539e3f7.jar:/home/benc/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/benc/.globus/coasters/cache/cog-karajan-0.36-dev-b695c30b273bc90fb60b5b62764cdfe4.jar:/home/benc/.globu s/coasters/cache/cog-provider-gt2-2.3-cd8fd68d5d520178a507723c4027885b.jar:/home/benc/.globus/coasters/cache/cog-provider-gt4_0_0-2.4-79fe87623a5d5a052d546a1735b09aad.jar:/home/benc/.globus/coasters/cache/cog-provider-local-2.1-9a4 1ac57fae7d518e5ae7fae894c457c.jar:/home/benc/.globus/coasters/cache/cog-provider-localscheduler-0.2-6cf4a8df6e05d1a0547de8a31c2eca7c.jar:/home/benc/.globus/coasters/cache/cog-provider-ssh-2.3-8d82acf0a5048350e7ef89119027890a.jar:/h ome/benc/.globus/coasters/cache/cog-util-0.92-0e560de7e37434887f39f389de6fac57.jar:/home/benc/.globus/coasters/cache/commons-logging-1.1-6b62417e77b000a87de66ee3935edbf5.jar:/home/benc/.globus/coasters/cache/cryptix-asn1-87c4cf848c 81d102bd29e33681b80e8a.jar:/home/benc/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/benc/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/benc/.globus/coasters/cache/j2ssh-comm on-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/benc/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/benc/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/benc/ .globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/benc/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/benc/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c.ja r:/home/benc/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar org.globus.cog.abstraction.coaster.service.CoasterService https://128.135.199.187:50000 357791723 Neither of them have any child processes (and one is a child of the other). $ ./coaster-test Started local service: 128.135.199.187:50000 Socket bound. URL is http://128.135.199.187:50001 [/128.135.125.118:35262]GET /coaster-bootstrap.jar HTTP/1.0 [/128.135.125.118:35329]GET /list HTTP/1.1 [/128.135.125.118:35334]GET /backport-util-concurrent.jar HTTP/1.1 org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job at org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:81) at org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submitTask(JobSubmissionTaskHandler.java:229) at org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.main(JobSubmissionTaskHandler.java:238) Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not start coaster service at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.reserveService(ServiceManager.java:80) at org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:68) ... 2 more Caused by: java.io.IOException: Timed out waiting for registration for 357791723 at org.globus.cog.abstraction.coaster.service.local.LocalService.waitForRegistration(LocalService.java:71) at org.globus.cog.abstraction.coaster.service.local.LocalService.waitForRegistration(LocalService.java:61) at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.startService(ServiceManager.java:104) at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.reserveService(ServiceManager.java:74) ... 3 more [/128.135.125.118:35508]GET /cog-abstraction-common-2.2.jar HTTP/1.1 [/128.135.125.118:35578]GET /cog-jglobus-dev-080222.jar HTTP/1.1 [/128.135.125.118:35580]GET /cog-karajan-0.36-dev.jar HTTP/1.1 [/128.135.125.118:35583]GET /cog-provider-coaster-0.1.jar HTTP/1.1 [/128.135.125.118:35584]GET /cog-provider-gt2-2.3.jar HTTP/1.1 [/128.135.125.118:35586]GET /cog-provider-gt4_0_0-2.4.jar HTTP/1.1 [/128.135.125.118:35589]GET /cog-provider-local-2.1.jar HTTP/1.1 [/128.135.125.118:35593]GET /cog-provider-localscheduler-0.2.jar HTTP/1.1 [/128.135.125.118:35595]GET /cog-provider-ssh-2.3.jar HTTP/1.1 [/128.135.125.118:35596]GET /cog-util-0.92.jar HTTP/1.1 [/128.135.125.118:35597]GET /commons-logging-1.1.jar HTTP/1.1 [/128.135.125.118:35598]GET /cryptix-asn1.jar HTTP/1.1 [/128.135.125.118:35599]GET /cryptix.jar HTTP/1.1 [/128.135.125.118:35601]GET /cryptix32.jar HTTP/1.1 [/128.135.125.118:35605]GET /j2ssh-common-0.2.2.jar HTTP/1.1 [/128.135.125.118:35609]GET /j2ssh-core-0.2.2-patched.jar HTTP/1.1 [/128.135.125.118:35610]GET /jaxrpc.jar HTTP/1.1 [/128.135.125.118:35611]GET /jce-jdk13-131.jar HTTP/1.1 [/128.135.125.118:35612]GET /jgss.jar HTTP/1.1 [/128.135.125.118:35613]GET /log4j-1.2.8.jar HTTP/1.1 [/128.135.125.118:35617]GET /puretls.jar HTTP/1.1 nullChannel started Channel id: -2c8466ac:11976f60889:-8000:-7cd08a9f:11976f5f16c:-8000 MetaChannel: 8254578 -> null.bind -> GSSC-null -- From benc at hawaga.org.uk Tue Apr 22 11:41:54 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 22 Apr 2008 16:41:54 +0000 (GMT) Subject: [Swift-devel] coasters In-Reply-To: References: <1208873739.12384.7.camel@localhost> <1208877832.13505.3.camel@localhost> <1208879174.13851.2.camel@localhost> <1208880080.14131.0.camel@localhost> <1208880248.14131.2.camel@localhost> Message-ID: also with r1970, I get the same 'Failed to start channel GSSC-https://:1984' error. -- From hategan at mcs.anl.gov Tue Apr 22 13:13:09 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 22 Apr 2008 13:13:09 -0500 Subject: [Swift-devel] coasters In-Reply-To: References: <1208873739.12384.7.camel@localhost> <1208877832.13505.3.camel@localhost> <1208879174.13851.2.camel@localhost> <1208880080.14131.0.camel@localhost> <1208880248.14131.2.camel@localhost> Message-ID: <1208887989.17457.0.camel@localhost> 1971 is out. On Tue, 2008-04-22 at 16:41 +0000, Ben Clifford wrote: > also with r1970, I get the same 'Failed to start channel > GSSC-https://:1984' error. From benc at hawaga.org.uk Wed Apr 23 16:15:57 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 23 Apr 2008 21:15:57 +0000 (GMT) Subject: [Swift-devel] making this list no longer require subscription approval Message-ID: This list has traditionally required subscription approval; I'd like to make it so that admins do not have to approve people for subscription any more - it takes time and I don't see that there is any value gained by this. So I'm going to remove the requirement from the config for this list - subscriptions will then act like swift-user subscriptions do now. -- From benc at hawaga.org.uk Thu Apr 24 09:23:24 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 24 Apr 2008 14:23:24 +0000 (GMT) Subject: [Swift-devel] today's coaster report Message-ID: Using cog r1977, the test coaster client will run jobs successfully but then issues a warning: WARN - Failed to shut down service https://127.0.0.1:55731 and a stack trace beginning: org.globus.cog.karajan.workflow.service.ProtocolException: at org.globus.cog.karajan.workflow.service.commands.Command.execute(Command.java:118) at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager$ServiceReaper.run(ServiceManager.java:274) (and indeed, something continues to listen on that port) Also, GLOBUS_HOSTNAME appears to be respected, but GLOBUS_TCP_PORT_RANGE appears to not be in some cases, using a standard cog launcher around the test client - some stuff listens on a port in the port range (eg. the server where the jar files get downloaded from); but whatever service is referred to in: > WARN - Failed to shut down service https://127.0.0.1:55751 doesn't listen on that port range. -- From benc at hawaga.org.uk Thu Apr 24 09:42:33 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 24 Apr 2008 14:42:33 +0000 (GMT) Subject: [Swift-devel] Re: today's coaster report In-Reply-To: References: Message-ID: also, I had to change the path to md5sum that is hard coded in the bowels of the coaster code to point to the appropriate md5sum executable on my machine. I have successfully run a job through swift into the coaster mechanism. hurrah. -- From hategan at mcs.anl.gov Thu Apr 24 10:05:23 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 24 Apr 2008 10:05:23 -0500 Subject: [Swift-devel] today's coaster report In-Reply-To: References: Message-ID: <1209049523.18858.2.camel@localhost> On Thu, 2008-04-24 at 14:23 +0000, Ben Clifford wrote: > Using cog r1977, the test coaster client will run jobs successfully but > then issues a warning: > > WARN - Failed to shut down service https://127.0.0.1:55731 > > and a stack trace beginning: > org.globus.cog.karajan.workflow.service.ProtocolException: > at > org.globus.cog.karajan.workflow.service.commands.Command.execute(Command.java:118) > at > org.globus.cog.abstraction.impl.execution.coaster.ServiceManager$ServiceReaper.run(ServiceManager.java:274) > > > > (and indeed, something continues to listen on that port) > > Also, GLOBUS_HOSTNAME appears to be respected, but GLOBUS_TCP_PORT_RANGE > appears to not be in some cases, using a standard cog launcher around the > test client - some stuff listens on a port in the port range (eg. the > server where the jar files get downloaded from); but whatever service is > referred to in: > > > WARN - Failed to shut down service https://127.0.0.1:55751 > > doesn't listen on that port range. Right. That's the remote service. The port range does not apply for it. > From benc at hawaga.org.uk Thu Apr 24 10:06:18 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 24 Apr 2008 15:06:18 +0000 (GMT) Subject: [Swift-devel] Re: today's coaster report In-Reply-To: References: Message-ID: though the lack of service shutdown makes my laptop happy when I run the 97 language behaviour tests through coaster... kaboom! -- From benc at hawaga.org.uk Thu Apr 24 10:10:19 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 24 Apr 2008 15:10:19 +0000 (GMT) Subject: [Swift-devel] today's coaster report In-Reply-To: <1209049523.18858.2.camel@localhost> References: <1209049523.18858.2.camel@localhost> Message-ID: On Thu, 24 Apr 2008, Mihael Hategan wrote: > Right. That's the remote service. The port range does not apply for it. In real deployment, the remote service should probably be made to use the GLOBUS_TCP_PORT_RANGE that it inherits from the environment on the remote side; if the system administrator of the remote site has configured a global GLOBUS_TCP_PORT_RANGE in the environment of a job to suit that site's firewall configuration, then the remote service part of coasters should probably respect that. -- From hategan at mcs.anl.gov Thu Apr 24 10:14:15 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 24 Apr 2008 10:14:15 -0500 Subject: [Swift-devel] today's coaster report In-Reply-To: References: <1209049523.18858.2.camel@localhost> Message-ID: <1209050055.18858.12.camel@localhost> On Thu, 2008-04-24 at 15:10 +0000, Ben Clifford wrote: > On Thu, 24 Apr 2008, Mihael Hategan wrote: > > > Right. That's the remote service. The port range does not apply for it. > > In real deployment, the remote service should probably be made to use the > GLOBUS_TCP_PORT_RANGE that it inherits from the environment on the remote > side; if the system administrator of the remote site has configured a > global GLOBUS_TCP_PORT_RANGE in the environment of a job to suit that > site's firewall configuration, then the remote service part of coasters > should probably respect that. Good point. I'll pass that on to the programmers. > From hategan at mcs.anl.gov Thu Apr 24 10:16:02 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 24 Apr 2008 10:16:02 -0500 Subject: [Swift-devel] Re: today's coaster report In-Reply-To: References: Message-ID: <1209050162.18858.15.camel@localhost> On Thu, 2008-04-24 at 15:06 +0000, Ben Clifford wrote: > though the lack of service shutdown makes my laptop happy happy or unhappy? I reckon 97*2 JVMs can't be a very good thing. > when I run the > 97 language behaviour tests through coaster... kaboom! From bugzilla-daemon at mcs.anl.gov Thu Apr 24 17:34:45 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 24 Apr 2008 17:34:45 -0500 (CDT) Subject: [Swift-devel] [Bug 110] move OPTIONS out of swift executable In-Reply-To: Message-ID: <20080424223445.7986D164BB@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=110 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from benc at hawaga.org.uk 2008-04-24 17:34 ------- documented as of r1803 -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From hategan at mcs.anl.gov Thu Apr 24 21:27:19 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 24 Apr 2008 21:27:19 -0500 Subject: [Swift-devel] coasters update Message-ID: <1209090439.4719.4.camel@localhost> A bunch of fixes and updates went in. I was able to submit 512 jobs to TGUC (using gt2+local pbs). There are occasional "job was killed because it exceeded walltime" mails. When shutting down the service, running and queued workers are killed, but it seems like some fall through the cracks. I guess that part needs more work. From benc at hawaga.org.uk Fri Apr 25 06:23:47 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 25 Apr 2008 11:23:47 +0000 (GMT) Subject: [Swift-devel] swift 0.5 and gt2 In-Reply-To: <1208870611.11068.0.camel@localhost> References: <1208833974.9368.1.camel@localhost> <1208868576.10512.3.camel@localhost> <1208870611.11068.0.camel@localhost> Message-ID: On Tue, 22 Apr 2008, Mihael Hategan wrote: > But not in the url. I'm talking about using the jobManager attribute to > or . If you put it in the url, it's fine. r1806 adds a second test of TGUC+gt2+pbs to the tests/sites/ directory, using and the jobmanager attribute; the other, already existing test, uses The jobmanager element doesn't take a jobmanager attribute. -- From benc at hawaga.org.uk Fri Apr 25 06:40:31 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 25 Apr 2008 11:40:31 +0000 (GMT) Subject: [Swift-devel] swift 0.5 and gt2 In-Reply-To: <1208871475.11385.0.camel@localhost> References: <1208833974.9368.1.camel@localhost> <1208868576.10512.3.camel@localhost> <1208870611.11068.0.camel@localhost> <1208871475.11385.0.camel@localhost> Message-ID: On Tue, 22 Apr 2008, Mihael Hategan wrote: > Maybe we should keep a "known issues" file for each version. r1807 puts a release notes file for 0.5 into SVN (the 0.4 release notes were kept as an on-webserver file) with this issue listed. -- From mikekubal at yahoo.com Fri Apr 25 10:08:23 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Fri, 25 Apr 2008 08:08:23 -0700 (PDT) Subject: [Swift-devel] code to test for file existence in swift Message-ID: <229543.28080.qm@web52304.mail.re2.yahoo.com> Could someone suggest swift code for testing for the existence of a file? Thanks, Mike ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ From wilde at mcs.anl.gov Fri Apr 25 10:19:34 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 25 Apr 2008 10:19:34 -0500 Subject: [Swift-devel] code to test for file existence in swift In-Reply-To: <229543.28080.qm@web52304.mail.re2.yahoo.com> References: <229543.28080.qm@web52304.mail.re2.yahoo.com> Message-ID: <4811F686.8000504@mcs.anl.gov> Mike, could you elaborate on the use case for this? Do you want the swift code to execute a procedure only if a file exists? Exists on the submit host? One way is write a tiny shell script that returns 1 or True if the file exists and zero otherwise. You'll need, I think, to use @extractint: the return value (from the test) in fact needs to go into a file, then extractint can return t/f based on the value of that file: @extractint(file) will read the specified file, parse an integer from the file contents and return that integer. if, switch or iterate could then be used to act on the exists/doesnt-exist condition. I'm not sure if there's a more elegant way to do this. Depends a bit on your actual use case. - Mike On 4/25/08 10:08 AM, Mike Kubal wrote: > Could someone suggest swift code for testing for the > existence of a file? > > Thanks, > > Mike > > > ____________________________________________________________________________________ > Be a better friend, newshound, and > know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From hategan at mcs.anl.gov Fri Apr 25 10:20:13 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 25 Apr 2008 10:20:13 -0500 Subject: [Swift-devel] code to test for file existence in swift In-Reply-To: <229543.28080.qm@web52304.mail.re2.yahoo.com> References: <229543.28080.qm@web52304.mail.re2.yahoo.com> Message-ID: <1209136813.27187.1.camel@localhost> Can you provide more details about the scenario? In principle, no such facilities exist (or should exist) in swift. Mihael On Fri, 2008-04-25 at 08:08 -0700, Mike Kubal wrote: > Could someone suggest swift code for testing for the > existence of a file? > > Thanks, > > Mike > > > ____________________________________________________________________________________ > Be a better friend, newshound, and > know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From mikekubal at yahoo.com Fri Apr 25 10:37:36 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Fri, 25 Apr 2008 08:37:36 -0700 (PDT) Subject: [Swift-devel] code to test for file existence in swift In-Reply-To: <1209136813.27187.1.camel@localhost> Message-ID: <77710.52229.qm@web52309.mail.re2.yahoo.com> I was running a job that did not finish due to my grid-proxy expiring (I should have set it longer in the first place) that iterates through 670 input files. I wanted to add some code to my swift script that would check to see if the corresponding output file existed so as not to resubmit when iterating through the input files. if(results_file exist){ print("done already")} else{run job(input_file)} Mihael Hategan wrote: Can you provide more details about the scenario? In principle, no such facilities exist (or should exist) in swift. Mihael On Fri, 2008-04-25 at 08:08 -0700, Mike Kubal wrote: > Could someone suggest swift code for testing for the > existence of a file? > > Thanks, > > Mike > > > ____________________________________________________________________________________ > Be a better friend, newshound, and > know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel --------------------------------- Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Fri Apr 25 10:45:41 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 25 Apr 2008 15:45:41 +0000 (GMT) Subject: [Swift-devel] code to test for file existence in swift In-Reply-To: <77710.52229.qm@web52309.mail.re2.yahoo.com> References: <77710.52229.qm@web52309.mail.re2.yahoo.com> Message-ID: What you might want there is for bug 107 to get fixed, at which point there would be a restart mechanism that would do this sort of thing for you. http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=107 On Fri, 25 Apr 2008, Mike Kubal wrote: > I was running a job that did not finish due to my grid-proxy expiring (I should have set it longer in the first place) that iterates through 670 input files. I wanted to add some code to my swift script that would check to see if the corresponding output file existed so as not to resubmit when iterating through the input files. > > if(results_file exist){ print("done already")} > else{run job(input_file)} > > Mihael Hategan wrote: Can you provide more details about the scenario? > In principle, no such facilities exist (or should exist) in swift. > > Mihael > > On Fri, 2008-04-25 at 08:08 -0700, Mike Kubal wrote: > > Could someone suggest swift code for testing for the > > existence of a file? > > > > Thanks, > > > > Mike > > > > > > ____________________________________________________________________________________ > > Be a better friend, newshound, and > > know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > --------------------------------- > Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. From wilde at mcs.anl.gov Fri Apr 25 10:49:14 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 25 Apr 2008 10:49:14 -0500 Subject: [Swift-devel] code to test for file existence in swift In-Reply-To: <77710.52229.qm@web52309.mail.re2.yahoo.com> References: <77710.52229.qm@web52309.mail.re2.yahoo.com> Message-ID: <4811FD7A.2000202@mcs.anl.gov> Ah - this kind of situation is different: it should be covered by the Swift retry mechanism, I think. You should be able to restart the workflows after (most) failures and it should pick up where it left off, which is I think what you want in this case, right? In other words, this kind of file existence checking should require no code. The state of the workflow - based on the file mapping that were done when the workflow still ran - should be retained in the Swift recovery log. See: http://www.ci.uchicago.edu/swift/guides/userguide.php#restart At one point I think there was a bug in Swift recovery. Mihael or Ben, do you know what the state of this feature is? - Mike On 4/25/08 10:37 AM, Mike Kubal wrote: > I was running a job that did not finish due to my grid-proxy expiring (I > should have set it longer in the first place) that iterates through 670 > input files. I wanted to add some code to my swift script that would > check to see if the corresponding output file existed so as not to > resubmit when iterating through the input files. > > if(results_file exist){ print("done already")} > else{run job(input_file)} > > */Mihael Hategan /* wrote: > > Can you provide more details about the scenario? > In principle, no such facilities exist (or should exist) in swift. > > Mihael > > On Fri, 2008-04-25 at 08:08 -0700, Mike Kubal wrote: > > Could someone suggest swift code for testing for the > > existence of a file? > > > > Thanks, > > > > Mike > > > > > > > ____________________________________________________________________________________ > > Be a better friend, newshound, and > > know-it-all with Yahoo! Mobile. Try it now. > http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > ------------------------------------------------------------------------ > Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try > it now. > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Mon Apr 28 12:55:17 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 28 Apr 2008 12:55:17 -0500 Subject: [Swift-devel] data errors Message-ID: <1209405317.28655.0.camel@localhost> They just look ugly: Execution failed: java.lang.RuntimeException: Data set initialization failed for org.griphyn.vdl.mapping.RootDataNode identifier tag:benc @ci.uchicago.edu,2008:swift:dataset:20080428-1253-1p0peih1:720000000006 with no value at dataset=f32 (closed). Missing r equired field: g From bugzilla-daemon at mcs.anl.gov Mon Apr 28 14:58:49 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 28 Apr 2008 14:58:49 -0500 (CDT) Subject: [Swift-devel] [Bug 107] restarts broken (by generalisation of data file handling) In-Reply-To: Message-ID: <20080428195849.2688F164CF@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=107 hategan at mcs.anl.gov changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED ------- Comment #2 from hategan at mcs.anl.gov 2008-04-28 14:58 ------- This should be fixed in r1819. It now saves symbolic swift variable names (relies on dbgname). It will break if the swift script is re-compiled. In order to avoid that, I would need the mappers to support forced file names. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Apr 28 18:48:09 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 28 Apr 2008 18:48:09 -0500 (CDT) Subject: [Swift-devel] [Bug 107] restarts broken (by generalisation of data file handling) In-Reply-To: Message-ID: <20080428234809.BCA1E164CF@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=107 ------- Comment #3 from benc at hawaga.org.uk 2008-04-28 18:48 ------- r1822 introduces a test for restarts -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From benc at hawaga.org.uk Mon Apr 28 18:57:12 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 28 Apr 2008 23:57:12 +0000 (GMT) Subject: [Swift-devel] code to test for file existence in swift In-Reply-To: <77710.52229.qm@web52309.mail.re2.yahoo.com> References: <77710.52229.qm@web52309.mail.re2.yahoo.com> Message-ID: > I wanted to add some code to my swift script that would > check to see if the corresponding output file existed so as not to > resubmit when iterating through the input files. Restarts (at least to the extent that I have tested them, which is not in a huge depth) look like they work now (with recent cog and swift, after Mihael's changes today). When a run fails, you will get a .rlog file for that run in pwd. You can restart by saying: $ swift -resume foo-99999.rlog foo.swift -- From benc at hawaga.org.uk Tue Apr 29 07:43:53 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 29 Apr 2008 12:43:53 +0000 (GMT) Subject: [Swift-devel] per-site scheduler parameters Message-ID: I implemented the following primarily for skenny and Xi Li. Building swift with cog >=r1991 will give two new profile keys that can be used in the sites.xml site catalog: initialScore - this allows the initial score for a site to be set to something other than 0, so that initial job submission rate can be set higher without setting the throttle factor higher. jobThrottle - this behaves like the job throttle set in swift.properties, but affects only the site for which it is set. This should allow (for example) local execution to have a throttle of 0, GRAM2 sites to have a throttle of 0.2, GRAM4 sites to have a throttle of 4 and Falkon to have the throttle off, defined in the site definitions for those sites rather than having to reconfigure it manually when switching between sites (the numbers here being my rough preferred values for each type of site). -- From iraicu at cs.uchicago.edu Tue Apr 29 11:24:10 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 29 Apr 2008 11:24:10 -0500 Subject: [Swift-devel] talk today's talk: Swift Innovation for BOINC Message-ID: <48174BAA.7040005@cs.uchicago.edu> Hi all, Today's talk will be by Xu Du, and is titled "Swift Innovation for BOINC". The talk abstract is: Swift is a professional distributed computing platform which performs excellently at workflow control. But the previous swift system can only work with Grid sites. However, besides of Grid, volunteer distributed computing systems are also important computing resources. Among those volunteer distributed computing systems, BOINC is the most outstanding one. So it is significant that Swift can not only work with Grid but also work with BOINC. Swift Innovation for BOINC is such a project that innovates Swift so that it can work with BOINC. The main task of the project is to design and implement a ?BOINC provider? in SWIFT and a ?SWIFT adapter? in BOINC, by which Swift can submit jobs to BOINC and get back result after the jobs are executed. Up to now, a prototype has been worked out. This talk will first introduce the design and implement of Swift Innovation for BOINC, and then demo the prototype. See you at 4:30PM in RI405! Cheers, Ioan http://dsl-wiki.cs.uchicago.edu/index.php/Wiki:ScheduleSpring08 -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From benc at hawaga.org.uk Tue Apr 29 18:39:23 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 29 Apr 2008 23:39:23 +0000 (GMT) Subject: [Swift-devel] coaster wget dependency Message-ID: in the excitement of a new OS install, I have discovered that coaster has a wget dependency somewhere on the service side. that is common but not always installed everywhere (especially in places that prefer curl...). the swift+coaster docs (if/when they appear) should probably document additional dependencies like this (and like GNU md5sum) that are potentially not installed places. -- From hategan at mcs.anl.gov Tue Apr 29 18:42:07 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Apr 2008 18:42:07 -0500 Subject: [Swift-devel] coaster wget dependency In-Reply-To: References: Message-ID: <1209512527.30126.1.camel@localhost> On Tue, 2008-04-29 at 23:39 +0000, Ben Clifford wrote: > in the excitement of a new OS install, I have discovered that coaster has > a wget dependency somewhere on the service side. that is common but not > always installed everywhere (especially in places that prefer curl...). The bootstrap script can be made to use curl instead of wget. > > the swift+coaster docs (if/when they appear) should probably document > additional dependencies like this (and like GNU md5sum) that are > potentially not installed places. It's unfortunate, but some means to download a jar and some means to do and md5 sum must exist. If not, the service would need to be started manually on the target site. > From benc at hawaga.org.uk Tue Apr 29 18:46:36 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 29 Apr 2008 23:46:36 +0000 (GMT) Subject: [Swift-devel] coaster wget dependency In-Reply-To: <1209512527.30126.1.camel@localhost> References: <1209512527.30126.1.camel@localhost> Message-ID: > > the swift+coaster docs (if/when they appear) should probably document > > additional dependencies like this (and like GNU md5sum) that are > > potentially not installed places. > > It's unfortunate, but some means to download a jar and some means to do > and md5 sum must exist. If not, the service would need to be started > manually on the target site. right. its fine to have those as prerequisites, I think, but its nicer to discover them in a README than in a process of of "log into remote site, figure out where is log file, read error message" -- From hategan at mcs.anl.gov Tue Apr 29 20:13:31 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Apr 2008 20:13:31 -0500 Subject: [Swift-devel] coaster wget dependency In-Reply-To: References: <1209512527.30126.1.camel@localhost> Message-ID: <1209518011.30623.1.camel@localhost> On Tue, 2008-04-29 at 23:46 +0000, Ben Clifford wrote: > > > the swift+coaster docs (if/when they appear) should probably document > > > additional dependencies like this (and like GNU md5sum) that are > > > potentially not installed places. > > > > It's unfortunate, but some means to download a jar and some means to do > > and md5 sum must exist. If not, the service would need to be started > > manually on the target site. > > right. its fine to have those as prerequisites, I think, but its nicer to > discover them in a README than in a process of of "log into remote site, > figure out where is log file, read error message" Or... it could be an error message saying "no wget or curl were found". As long as we keep that script below the max argv character length of the various pieces involved. > From benc at hawaga.org.uk Tue Apr 29 21:20:06 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 30 Apr 2008 02:20:06 +0000 (GMT) Subject: [Swift-devel] coaster->tg-ncsa Message-ID: I just tried to run coasters on TG NCSA (my first attempt to run them outside of my laptop). I tried with gt2 and with gt4 to submit: I put the two sites files in tests/sites/coaster/ - the two filenames should be apparent upon observing that directory with ls. Using gt2 and pbs, I get this error: Caused by: Could not submit job Caused by: Could not start coaster service Caused by: Cannot parse the given RSL Caused by: Problems while creating a Gass Server Caused by: Could not determine this host's IP address. Please set an IP address in cog.properties Its not clear from that message which host 'this host' is. There are potentially at least three involved. A hostname or some other useful information might be nice there. For the attempt with gt4, I get this failure: Caused by: Could not submit job Caused by: Could not start coaster service Caused by: The gt4.0.0 provider does not support redirection -- From hategan at mcs.anl.gov Tue Apr 29 21:33:58 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Apr 2008 21:33:58 -0500 Subject: [Swift-devel] Re: coaster->tg-ncsa In-Reply-To: References: Message-ID: <1209522838.702.8.camel@localhost> On Wed, 2008-04-30 at 02:20 +0000, Ben Clifford wrote: > I just tried to run coasters on TG NCSA (my first attempt to run them > outside of my laptop). I tried with gt2 and with gt4 to submit: > > I put the two sites files in tests/sites/coaster/ - the two filenames > should be apparent upon observing that directory with ls. > > Using gt2 and pbs, I get this error: > > Caused by: > Could not submit job > Caused by: > Could not start coaster service > Caused by: > Cannot parse the given RSL > Caused by: > Problems while creating a Gass Server > Caused by: > Could not determine this host's IP address. Please set an IP > address in cog.properties > > Its not clear from that message which host 'this host' is. Uhm... yes... it's 127.0.0.1 :) Although given that it's not a remote exception, it looks like your machine. > There are > potentially at least three involved. A hostname or some other useful > information might be nice there. > > For the attempt with gt4, I get this failure: > > Caused by: > Could not submit job > Caused by: > Could not start coaster service > Caused by: > The gt4.0.0 provider does not support redirection Grr. Right. I removed the redirections (r1992). Unfortunately it is now harder to troubleshoot. Maybe the ws-gram provider should only print a warning if redirection is requested instead of failing. > From benc at hawaga.org.uk Tue Apr 29 21:47:58 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 30 Apr 2008 02:47:58 +0000 (GMT) Subject: [Swift-devel] Re: coaster->tg-ncsa In-Reply-To: <1209522838.702.8.camel@localhost> References: <1209522838.702.8.camel@localhost> Message-ID: I set GLOBUS_HOSTNAME and ran with gram2 again. It got further... sits and waits for quite a long time and then the below error (because I deliberately don't have a default project on teragrid sites, instead setting it via a profile entry in the appropriate sites file to whichever of the three I'm in) Caused by: ERROR: you must either specify an account (project=xxx) or log in to the system to set a default account. Here are your accounts: NCSA Days Service Units Avail for Proj TG Account Type Status Remaining Remaining Batch Jobs ---- ---------- ---- ------ --------- ------------- --------- bdd TG-CDA070002 prb Active 1 28387 yes brn TG-CCR080001 prb Active 155 28970 yes brv TG-CCR080002 prb Active 155 29559 yes *** Job not submitted *** null org.globus.gram.GramException: The job failed when the job manager attempted to run it at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:476) at org.globus.gram.GramJob.setStatus(GramJob.java:184) at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) at java.lang.Thread.run(Thread.java:534) -- From hategan at mcs.anl.gov Tue Apr 29 21:51:01 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Apr 2008 21:51:01 -0500 Subject: [Swift-devel] Re: coaster->tg-ncsa In-Reply-To: References: <1209522838.702.8.camel@localhost> Message-ID: <1209523861.1278.0.camel@localhost> Odd. Attributes should be copied from the original task. I'll look into that. On Wed, 2008-04-30 at 02:47 +0000, Ben Clifford wrote: > I set GLOBUS_HOSTNAME and ran with gram2 again. It got further... sits and > waits for quite a long time and then the below error (because I > deliberately don't have a default project on teragrid sites, instead > setting it via a profile entry in the appropriate sites file to whichever > of the three I'm in) > > Caused by: > > > ERROR: you must either specify an account (project=xxx) or log in to > the system to set a default account. Here are your accounts: > > NCSA Days Service Units Avail for > Proj TG Account Type Status Remaining Remaining Batch Jobs > ---- ---------- ---- ------ --------- ------------- --------- > bdd TG-CDA070002 prb Active 1 28387 yes > brn TG-CCR080001 prb Active 155 28970 yes > brv TG-CCR080002 prb Active 155 29559 yes > > *** Job not submitted *** > > null > org.globus.gram.GramException: The job failed when the job manager > attempted to run it > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:476) > at org.globus.gram.GramJob.setStatus(GramJob.java:184) > at > org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) > at java.lang.Thread.run(Thread.java:534) > From benc at hawaga.org.uk Tue Apr 29 21:55:17 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 30 Apr 2008 02:55:17 +0000 (GMT) Subject: [Swift-devel] Re: coaster->tg-ncsa In-Reply-To: <1209523861.1278.0.camel@localhost> References: <1209522838.702.8.camel@localhost> <1209523861.1278.0.camel@localhost> Message-ID: On Tue, 29 Apr 2008, Mihael Hategan wrote: > Odd. Attributes should be copied from the original task. I'll look into > that. also, this using , which I modified from SVN to use gt2:gt2:pbs rather than gt2:pbs as is in SVN now. If properties should be copied across then I guess either of those should have worked if the project key is propagated. -- From hategan at mcs.anl.gov Tue Apr 29 21:55:49 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Apr 2008 21:55:49 -0500 Subject: [Swift-devel] Re: coaster->tg-ncsa In-Reply-To: References: <1209522838.702.8.camel@localhost> <1209523861.1278.0.camel@localhost> Message-ID: <1209524149.1377.0.camel@localhost> On Wed, 2008-04-30 at 02:55 +0000, Ben Clifford wrote: > On Tue, 29 Apr 2008, Mihael Hategan wrote: > > > Odd. Attributes should be copied from the original task. I'll look into > > that. > > also, this using url="grid-hg.ncsa.teragrid.org" jobManager="gt2:gt2:pbs" />, which I > modified from SVN to use gt2:gt2:pbs rather than gt2:pbs as is in SVN now. > > If properties should be copied across then I guess either of those should > have worked if the project key is propagated. Right. So it must mean they are not. >