From benc at hawaga.org.uk  Tue Apr  1 03:58:52 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 1 Apr 2008 08:58:52 +0000 (GMT)
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <47E88D8C.4090207@mcs.anl.gov>
References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov>
	<47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov>
	<1206381792.11561.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>
	<47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0804010855270.9854@dildano.hawaga.org.uk>


On Tue, 25 Mar 2008, Michael Wilde wrote:

> With this fixed, the total time in wrapper.sh including the app is now about
> 15 seconds, with 3 being in the app-wrapper itself. The time seems about
> evenly spread over the several wrapper.sh operations, which is not surprising
> when 500 wrappers hit NFS all at once.

Does this machine have a higher (/different) performance shared file 
system such as PVFS or GPFS? We spent some time in november layout out the 
filesystem to be sympathetic to GPFS to help avoid bottlenecks like you 
are seeing here. It would be kinda sad if either it isn't available or you 
aren't using it even though it is available.

-- 


From benc at hawaga.org.uk  Tue Apr  1 05:05:46 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 1 Apr 2008 10:05:46 +0000 (GMT)
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <Pine.LNX.4.64.0804010855270.9854@dildano.hawaga.org.uk>
References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov>
	<47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov>
	<1206381792.11561.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>
	<47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov>
	<Pine.LNX.4.64.0804010855270.9854@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0804011004080.5372@dildano.hawaga.org.uk>


On Tue, 1 Apr 2008, Ben Clifford wrote:

> > With this fixed, the total time in wrapper.sh including the app is now about
> > 15 seconds, with 3 being in the app-wrapper itself. The time seems about
> > evenly spread over the several wrapper.sh operations, which is not surprising
> > when 500 wrappers hit NFS all at once.
> 
> Does this machine have a higher (/different) performance shared file 
> system such as PVFS or GPFS? We spent some time in november layout out the 
> filesystem to be sympathetic to GPFS to help avoid bottlenecks like you 
> are seeing here. It would be kinda sad if either it isn't available or you 
> aren't using it even though it is available.

>From what I can tell from the web, PVFS and/or GPFS are available on all 
of the Argonne Blue Gene machines. Is this true? I don't want to provide 
more scalability support for NFS-on-bluegene if it is.

-- 


From wilde at mcs.anl.gov  Tue Apr  1 08:04:04 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 01 Apr 2008 08:04:04 -0500
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <Pine.LNX.4.64.0804011004080.5372@dildano.hawaga.org.uk>
References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov>
	<47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov>
	<1206381792.11561.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>
	<47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov>
	<Pine.LNX.4.64.0804010855270.9854@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804011004080.5372@dildano.hawaga.org.uk>
Message-ID: <47F232C4.3080607@mcs.anl.gov>

We're only working on the BG/P system, and GPFS is the only shared 
filesystem there.

GPFS access, however, remains a big scalabiity issue. Frequent small 
accesses to GPFS in our measurements really slow down the workflow. We 
did a lot of micro-benchmark tests.

Zhao, can you gather a set of these tests into a small suite and post 
numbers so the Swift developers can get an understanding of the system's 
GPFS access performance?

Also note: the only local filesystem is RAM disk on /tmp or /dev/shm. 
(Ioan and Zhao should confirm if they verified that /tmp is on RAM).

- Mike

On 4/1/08 5:05 AM, Ben Clifford wrote:
> On Tue, 1 Apr 2008, Ben Clifford wrote:
> 
>>> With this fixed, the total time in wrapper.sh including the app is now about
>>> 15 seconds, with 3 being in the app-wrapper itself. The time seems about
>>> evenly spread over the several wrapper.sh operations, which is not surprising
>>> when 500 wrappers hit NFS all at once.
>> Does this machine have a higher (/different) performance shared file 
>> system such as PVFS or GPFS? We spent some time in november layout out the 
>> filesystem to be sympathetic to GPFS to help avoid bottlenecks like you 
>> are seeing here. It would be kinda sad if either it isn't available or you 
>> aren't using it even though it is available.
> 
>>From what I can tell from the web, PVFS and/or GPFS are available on all 
> of the Argonne Blue Gene machines. Is this true? I don't want to provide 
> more scalability support for NFS-on-bluegene if it is.
> 


From iraicu at cs.uchicago.edu  Tue Apr  1 08:37:52 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Tue, 01 Apr 2008 08:37:52 -0500
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <Pine.LNX.4.64.0804010855270.9854@dildano.hawaga.org.uk>
References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov>
	<47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov>
	<1206381792.11561.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>
	<47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov>
	<Pine.LNX.4.64.0804010855270.9854@dildano.hawaga.org.uk>
Message-ID: <47F23AB0.8010109@cs.uchicago.edu>

Ben,
The #s below were from the SiCortex, which only has NFS.  We are using 
the latest Swift from SVN, so if the Swift improvements to avoid these 
bottlenecks are enabled by default in Swift, then we are using them already!

On the BG/P, we have GPFS and PVFS, but we found GPFS to handle 
meta-data better, so we are using GPFS for all our tests.

Ioan

Ben Clifford wrote:
> On Tue, 25 Mar 2008, Michael Wilde wrote:
>
>   
>> With this fixed, the total time in wrapper.sh including the app is now about
>> 15 seconds, with 3 being in the app-wrapper itself. The time seems about
>> evenly spread over the several wrapper.sh operations, which is not surprising
>> when 500 wrappers hit NFS all at once.
>>     
>
> Does this machine have a higher (/different) performance shared file 
> system such as PVFS or GPFS? We spent some time in november layout out the 
> filesystem to be sympathetic to GPFS to help avoid bottlenecks like you 
> are seeing here. It would be kinda sad if either it isn't available or you 
> aren't using it even though it is available.
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080401/ea20680f/attachment.html>

From iraicu at cs.uchicago.edu  Tue Apr  1 08:39:04 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Tue, 01 Apr 2008 08:39:04 -0500
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <Pine.LNX.4.64.0804011004080.5372@dildano.hawaga.org.uk>
References: <47E7DC3D.6040704@cs.uchicago.edu>
	<47E7E30A.2020208@mcs.anl.gov>	<47E7E684.5070303@cs.uchicago.edu>
	<47E7EC03.1090609@mcs.anl.gov>	<1206381792.11561.1.camel@blabla.mcs.anl.gov>	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>	<47E86E5A.6080303@mcs.anl.gov>
	<47E88D8C.4090207@mcs.anl.gov>	<Pine.LNX.4.64.0804010855270.9854@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804011004080.5372@dildano.hawaga.org.uk>
Message-ID: <47F23AF8.2000508@cs.uchicago.edu>

NFS on SiCortex ( in the future, there will be PVFS, but that is not today)
GPFS on BG/P

Ioan

Ben Clifford wrote:
> On Tue, 1 Apr 2008, Ben Clifford wrote:
>
>   
>>> With this fixed, the total time in wrapper.sh including the app is now about
>>> 15 seconds, with 3 being in the app-wrapper itself. The time seems about
>>> evenly spread over the several wrapper.sh operations, which is not surprising
>>> when 500 wrappers hit NFS all at once.
>>>       
>> Does this machine have a higher (/different) performance shared file 
>> system such as PVFS or GPFS? We spent some time in november layout out the 
>> filesystem to be sympathetic to GPFS to help avoid bottlenecks like you 
>> are seeing here. It would be kinda sad if either it isn't available or you 
>> aren't using it even though it is available.
>>     
>
> >From what I can tell from the web, PVFS and/or GPFS are available on all 
> of the Argonne Blue Gene machines. Is this true? I don't want to provide 
> more scalability support for NFS-on-bluegene if it is.
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080401/a3f9134f/attachment.html>

From iraicu at cs.uchicago.edu  Tue Apr  1 10:26:45 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Tue, 01 Apr 2008 10:26:45 -0500
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <47F232C4.3080607@mcs.anl.gov>
References: <47E7DC3D.6040704@cs.uchicago.edu>
	<47E7E30A.2020208@mcs.anl.gov>	<47E7E684.5070303@cs.uchicago.edu>
	<47E7EC03.1090609@mcs.anl.gov>	<1206381792.11561.1.camel@blabla.mcs.anl.gov>	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>	<47E86E5A.6080303@mcs.anl.gov>
	<47E88D8C.4090207@mcs.anl.gov>	<Pine.LNX.4.64.0804010855270.9854@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0804011004080.5372@dildano.hawaga.org.uk>
	<47F232C4.3080607@mcs.anl.gov>
Message-ID: <47F25435.8080105@cs.uchicago.edu>


Michael Wilde wrote:
> We're only working on the BG/P system, and GPFS is the only shared 
> filesystem there.
There is PVFS, but that performed even worse in our tests.
>
> GPFS access, however, remains a big scalabiity issue. Frequent small 
> accesses to GPFS in our measurements really slow down the workflow. We 
> did a lot of micro-benchmark tests.
Yes!  The BG/P's GPFS probably performs the worst out of all GPFSes I 
have worked on, in terms of small granular accesses.  For example, 
reading 1 byte files, invoking a trivial script (i.e. exit 0), etc... 
all perform extremely poor, to the point that we need to move away from 
GPFS almost completely.  For example, the things that we eventually need 
to avoid on GPFS for the BG/P are:
invoking wrapper.sh
mkdir
any logging to GPFS

There are probably others.
>
> Zhao, can you gather a set of these tests into a small suite and post 
> numbers so the Swift developers can get an understanding of the 
> system's GPFS access performance?
>
> Also note: the only local filesystem is RAM disk on /tmp or /dev/shm. 
> (Ioan and Zhao should confirm if they verified that /tmp is on RAM).
Yes, there are no local disks on either BG/P or SiCortex.  Both machines 
have /tmp and dev/shm mounted as ram disks.

Ioan
>
> - Mike
>
> On 4/1/08 5:05 AM, Ben Clifford wrote:
>> On Tue, 1 Apr 2008, Ben Clifford wrote:
>>
>>>> With this fixed, the total time in wrapper.sh including the app is 
>>>> now about
>>>> 15 seconds, with 3 being in the app-wrapper itself. The time seems 
>>>> about
>>>> evenly spread over the several wrapper.sh operations, which is not 
>>>> surprising
>>>> when 500 wrappers hit NFS all at once.
>>> Does this machine have a higher (/different) performance shared file 
>>> system such as PVFS or GPFS? We spent some time in november layout 
>>> out the filesystem to be sympathetic to GPFS to help avoid 
>>> bottlenecks like you are seeing here. It would be kinda sad if 
>>> either it isn't available or you aren't using it even though it is 
>>> available.
>>
>>> From what I can tell from the web, PVFS and/or GPFS are available on 
>>> all 
>> of the Argonne Blue Gene machines. Is this true? I don't want to 
>> provide more scalability support for NFS-on-bluegene if it is.
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From hategan at mcs.anl.gov  Tue Apr  1 10:32:03 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 01 Apr 2008 10:32:03 -0500
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <47F25435.8080105@cs.uchicago.edu>
References: <47E7DC3D.6040704@cs.uchicago.edu>
	<47E7E30A.2020208@mcs.anl.gov>	<47E7E684.5070303@cs.uchicago.edu>
	<47E7EC03.1090609@mcs.anl.gov>
	<1206381792.11561.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>
	<47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov>
	<Pine.LNX.4.64.0804010855270.9854@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804011004080.5372@dildano.hawaga.org.uk>
	<47F232C4.3080607@mcs.anl.gov>  <47F25435.8080105@cs.uchicago.edu>
Message-ID: <1207063923.30798.0.camel@blabla.mcs.anl.gov>


On Tue, 2008-04-01 at 10:26 -0500, Ioan Raicu wrote:
> 
> Michael Wilde wrote:
> > We're only working on the BG/P system, and GPFS is the only shared 
> > filesystem there.
> There is PVFS, but that performed even worse in our tests.
> >
> > GPFS access, however, remains a big scalabiity issue. Frequent small 
> > accesses to GPFS in our measurements really slow down the workflow. We 
> > did a lot of micro-benchmark tests.
> Yes!  The BG/P's GPFS probably performs the worst out of all GPFSes I 
> have worked on, in terms of small granular accesses.  For example, 
> reading 1 byte files, invoking a trivial script (i.e. exit 0), etc... 
> all perform extremely poor, to the point that we need to move away from 
> GPFS almost completely.  For example, the things that we eventually need 
> to avoid on GPFS for the BG/P are:
> invoking wrapper.sh
> mkdir
> any logging to GPFS

Doing nothing can be incredibly fast.

> 
> There are probably others.
> >
> > Zhao, can you gather a set of these tests into a small suite and post 
> > numbers so the Swift developers can get an understanding of the 
> > system's GPFS access performance?
> >
> > Also note: the only local filesystem is RAM disk on /tmp or /dev/shm. 
> > (Ioan and Zhao should confirm if they verified that /tmp is on RAM).
> Yes, there are no local disks on either BG/P or SiCortex.  Both machines 
> have /tmp and dev/shm mounted as ram disks.
> 
> Ioan
> >
> > - Mike
> >
> > On 4/1/08 5:05 AM, Ben Clifford wrote:
> >> On Tue, 1 Apr 2008, Ben Clifford wrote:
> >>
> >>>> With this fixed, the total time in wrapper.sh including the app is 
> >>>> now about
> >>>> 15 seconds, with 3 being in the app-wrapper itself. The time seems 
> >>>> about
> >>>> evenly spread over the several wrapper.sh operations, which is not 
> >>>> surprising
> >>>> when 500 wrappers hit NFS all at once.
> >>> Does this machine have a higher (/different) performance shared file 
> >>> system such as PVFS or GPFS? We spent some time in november layout 
> >>> out the filesystem to be sympathetic to GPFS to help avoid 
> >>> bottlenecks like you are seeing here. It would be kinda sad if 
> >>> either it isn't available or you aren't using it even though it is 
> >>> available.
> >>
> >>> From what I can tell from the web, PVFS and/or GPFS are available on 
> >>> all 
> >> of the Argonne Blue Gene machines. Is this true? I don't want to 
> >> provide more scalability support for NFS-on-bluegene if it is.
> >>
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> 


From iraicu at cs.uchicago.edu  Tue Apr  1 10:43:16 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Tue, 01 Apr 2008 10:43:16 -0500
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <1207063923.30798.0.camel@blabla.mcs.anl.gov>
References: <47E7DC3D.6040704@cs.uchicago.edu>	
	<47E7E30A.2020208@mcs.anl.gov>	<47E7E684.5070303@cs.uchicago.edu>	
	<47E7EC03.1090609@mcs.anl.gov>	
	<1206381792.11561.1.camel@blabla.mcs.anl.gov>	
	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>	
	<47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov>	
	<Pine.LNX.4.64.0804010855270.9854@dildano.hawaga.org.uk>	
	<Pine.LNX.4.64.0804011004080.5372@dildano.hawaga.org.uk>	
	<47F232C4.3080607@mcs.anl.gov> <47F25435.8080105@cs.uchicago.edu>
	<1207063923.30798.0.camel@blabla.mcs.anl.gov>
Message-ID: <47F25814.8050301@cs.uchicago.edu>


Mihael Hategan wrote:
> On Tue, 2008-04-01 at 10:26 -0500, Ioan Raicu wrote:
>   
>> Michael Wilde wrote:
>>     
>>> We're only working on the BG/P system, and GPFS is the only shared 
>>> filesystem there.
>>>       
>> There is PVFS, but that performed even worse in our tests.
>>     
>>> GPFS access, however, remains a big scalabiity issue. Frequent small 
>>> accesses to GPFS in our measurements really slow down the workflow. We 
>>> did a lot of micro-benchmark tests.
>>>       
>> Yes!  The BG/P's GPFS probably performs the worst out of all GPFSes I 
>> have worked on, in terms of small granular accesses.  For example, 
>> reading 1 byte files, invoking a trivial script (i.e. exit 0), etc... 
>> all perform extremely poor, to the point that we need to move away from 
>> GPFS almost completely.  For example, the things that we eventually need 
>> to avoid on GPFS for the BG/P are:
>> invoking wrapper.sh
>> mkdir
>> any logging to GPFS
>>     
>
> Doing nothing can be incredibly fast.
>   
What I meant is that we need to move these operations to the local file 
system, i.e. RAM.  We have run applications on BG/P via Falkon only, and 
implemented a caching strategy that caches all scripts, binaries, and 
input data, to RAM... once the task execution (all from RAM) completes, 
and has written its output to RAM, then there is a single copy operation 
of the output data from RAM to GPFS.  We control how frequently this 
copy operation occurs, so we can essentially scale quite nicely and 
linearly with this approach.  The hope is that we can eventually work 
this kind of functionality in the wrapper.sh, or in Swift itself.  So, a 
reply to your statement, we would like to preserve the functionality of 
the wrapper.sh, but move as much as possible of that functionality from 
a shared file system to a local disk.

Ioan
>   
>> There are probably others.
>>     
>>> Zhao, can you gather a set of these tests into a small suite and post 
>>> numbers so the Swift developers can get an understanding of the 
>>> system's GPFS access performance?
>>>
>>> Also note: the only local filesystem is RAM disk on /tmp or /dev/shm. 
>>> (Ioan and Zhao should confirm if they verified that /tmp is on RAM).
>>>       
>> Yes, there are no local disks on either BG/P or SiCortex.  Both machines 
>> have /tmp and dev/shm mounted as ram disks.
>>
>> Ioan
>>     
>>> - Mike
>>>
>>> On 4/1/08 5:05 AM, Ben Clifford wrote:
>>>       
>>>> On Tue, 1 Apr 2008, Ben Clifford wrote:
>>>>
>>>>         
>>>>>> With this fixed, the total time in wrapper.sh including the app is 
>>>>>> now about
>>>>>> 15 seconds, with 3 being in the app-wrapper itself. The time seems 
>>>>>> about
>>>>>> evenly spread over the several wrapper.sh operations, which is not 
>>>>>> surprising
>>>>>> when 500 wrappers hit NFS all at once.
>>>>>>             
>>>>> Does this machine have a higher (/different) performance shared file 
>>>>> system such as PVFS or GPFS? We spent some time in november layout 
>>>>> out the filesystem to be sympathetic to GPFS to help avoid 
>>>>> bottlenecks like you are seeing here. It would be kinda sad if 
>>>>> either it isn't available or you aren't using it even though it is 
>>>>> available.
>>>>>           
>>>>> From what I can tell from the web, PVFS and/or GPFS are available on 
>>>>> all 
>>>>>           
>>>> of the Argonne Blue Gene machines. Is this true? I don't want to 
>>>> provide more scalability support for NFS-on-bluegene if it is.
>>>>
>>>>         
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>>       
>
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080401/206e127a/attachment.html>

From hategan at mcs.anl.gov  Tue Apr  1 11:08:24 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 01 Apr 2008 11:08:24 -0500
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <47F25814.8050301@cs.uchicago.edu>
References: <47E7DC3D.6040704@cs.uchicago.edu>
	<47E7E30A.2020208@mcs.anl.gov>	<47E7E684.5070303@cs.uchicago.edu>
	<47E7EC03.1090609@mcs.anl.gov>
	<1206381792.11561.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>
	<47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov>
	<Pine.LNX.4.64.0804010855270.9854@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804011004080.5372@dildano.hawaga.org.uk>
	<47F232C4.3080607@mcs.anl.gov>  <47F25435.8080105@cs.uchicago.edu>
	<1207063923.30798.0.camel@blabla.mcs.anl.gov>
	<47F25814.8050301@cs.uchicago.edu>
Message-ID: <1207066105.434.0.camel@blabla.mcs.anl.gov>

> > >     
> > 
> > Doing nothing can be incredibly fast.
> >   
> What I meant is that we need to move these operations to the local
> file system, i.e. RAM.  We have run applications on BG/P via Falkon
> only, and implemented a caching strategy that caches all scripts,
> binaries, and input data, to RAM... once the task execution (all from
> RAM) completes, and has written its output to RAM, then there is a
> single copy operation of the output data from RAM to GPFS.  We control
> how frequently this copy operation occurs, so we can essentially scale
> quite nicely and linearly with this approach.  The hope is that we can
> eventually work this kind of functionality in the wrapper.sh, or in
> Swift itself.  So, a reply to your statement, we would like to
> preserve the functionality of the wrapper.sh, but move as much as
> possible of that functionality from a shared file system to a local
> disk. 
> 

Having optimized wrappers for different architectures is a perfectly
valid option.

Mihael


From wilde at mcs.anl.gov  Tue Apr  1 11:25:28 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 01 Apr 2008 11:25:28 -0500
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <1207066105.434.0.camel@blabla.mcs.anl.gov>
References: <47E7DC3D.6040704@cs.uchicago.edu>	
	<47E7E30A.2020208@mcs.anl.gov>	<47E7E684.5070303@cs.uchicago.edu>	
	<47E7EC03.1090609@mcs.anl.gov>	
	<1206381792.11561.1.camel@blabla.mcs.anl.gov>	
	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>	
	<47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov>	
	<Pine.LNX.4.64.0804010855270.9854@dildano.hawaga.org.uk>	
	<Pine.LNX.4.64.0804011004080.5372@dildano.hawaga.org.uk>	
	<47F232C4.3080607@mcs.anl.gov> <47F25435.8080105@cs.uchicago.edu>	
	<1207063923.30798.0.camel@blabla.mcs.anl.gov>	
	<47F25814.8050301@cs.uchicago.edu>
	<1207066105.434.0.camel@blabla.mcs.anl.gov>
Message-ID: <47F261F8.1050207@mcs.anl.gov>


On 4/1/08 11:08 AM, Mihael Hategan wrote:
> ...
> Having optimized wrappers for different architectures is a perfectly
> valid option.

I agree.

Also to consider is having the wrappers behave differently (e.g. use 
local vs shared filesystem) based on knowledge of the app's size and I/O 
volume vs available space and transfer rates.

I'm in favor of heading to an approach where we have good fast default 
configurations for all our locally used systems (TG, OSG and the 
supercomputers) that work well for most apps, and some well documented 
guidelines tell users under what conditions they need to change the 
settings.


From benc at hawaga.org.uk  Tue Apr  1 11:34:03 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 1 Apr 2008 16:34:03 +0000 (GMT)
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <47F261F8.1050207@mcs.anl.gov>
References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov>
	<47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> 
	<1206381792.11561.1.camel@blabla.mcs.anl.gov> 
	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>
	<47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov>
	<Pine.LNX.4.64.0804010855270.9854@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804011004080.5372@dildano.hawaga.org.uk>
	<47F232C4.3080607@mcs.anl.gov> <47F25435.8080105@cs.uchicago.edu>
	<1207063923.30798.0.camel@blabla.mcs.anl.gov>
	<47F25814.8050301@cs.uchicago.edu>
	<1207066105.434.0.camel@blabla.mcs.anl.gov>
	<47F261F8.1050207@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0804011630590.5372@dildano.hawaga.org.uk>


On Tue, 1 Apr 2008, Michael Wilde wrote:

> I'm in favor of heading to an approach where we have good fast default
> configurations for all our locally used systems (TG, OSG and the

good != fast

for example, debuggable and non-desturctive-to-target-resource are other 
desirable characteristics. the debuggable one is especially important. 
Crippling the logging system to achieve faster execution is something that 
should be turned on, not off - that moves error reporting back to a 
boolean WRONG! style of reporting rather than the (I think) more useful 
stuff that we have at the moment. Likewise, pushing stuff up to the limit 
of what a site can handle (especially using GRAM2) is something that 
should be approached with caution and not by default.


-- 


From iraicu at cs.uchicago.edu  Tue Apr  1 21:14:54 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Tue, 01 Apr 2008 21:14:54 -0500
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <Pine.LNX.4.64.0804011630590.5372@dildano.hawaga.org.uk>
References: <47E7DC3D.6040704@cs.uchicago.edu>
	<47E7E30A.2020208@mcs.anl.gov>	<47E7E684.5070303@cs.uchicago.edu>
	<47E7EC03.1090609@mcs.anl.gov>
	<1206381792.11561.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>	<47E86E5A.6080303@mcs.anl.gov>
	<47E88D8C.4090207@mcs.anl.gov>	<Pine.LNX.4.64.0804010855270.9854@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0804011004080.5372@dildano.hawaga.org.uk>	<47F232C4.3080607@mcs.anl.gov>
	<47F25435.8080105@cs.uchicago.edu>	<1207063923.30798.0.camel@blabla.mcs.anl.gov>	<47F25814.8050301@cs.uchicago.edu>	<1207066105.434.0.camel@blabla.mcs.anl.gov>	<47F261F8.1050207@mcs.anl.gov>
	<Pine.LNX.4.64.0804011630590.5372@dildano.hawaga.org.uk>
Message-ID: <47F2EC1E.4030301@cs.uchicago.edu>

What I think would be nice to have, is a "high performance" option, 
which would disable all logging everywhere in Swift, except for the bare 
essential for Swift to be operational, in order to allow Swift to get 
the best performance possible.  This doesn't have to be the default, but 
could allow a user to simply toggle a parameter and go from fast 
performance mode to slow debug mode.  I think what we are trying to say 
with our recent experience with the BG/P is that we (as the users of 
Swift on BG/P) would be willing to live with a boolean error code if it 
meant that we could get significantly better performance, which in turn 
would give us higher resource utilization.

Ioan

Ben Clifford wrote:
> On Tue, 1 Apr 2008, Michael Wilde wrote:
>
>   
>> I'm in favor of heading to an approach where we have good fast default
>> configurations for all our locally used systems (TG, OSG and the
>>     
>
> good != fast
>
> for example, debuggable and non-desturctive-to-target-resource are other 
> desirable characteristics. the debuggable one is especially important. 
> Crippling the logging system to achieve faster execution is something that 
> should be turned on, not off - that moves error reporting back to a 
> boolean WRONG! style of reporting rather than the (I think) more useful 
> stuff that we have at the moment. Likewise, pushing stuff up to the limit 
> of what a site can handle (especially using GRAM2) is something that 
> should be approached with caution and not by default.
>
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080401/8d7682b8/attachment.html>

From benc at hawaga.org.uk  Tue Apr  1 23:11:23 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 2 Apr 2008 04:11:23 +0000 (GMT)
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <1207066105.434.0.camel@blabla.mcs.anl.gov>
References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov>
	<47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov>
	<1206381792.11561.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>
	<47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov>
	<Pine.LNX.4.64.0804010855270.9854@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804011004080.5372@dildano.hawaga.org.uk>
	<47F232C4.3080607@mcs.anl.gov> <47F25435.8080105@cs.uchicago.edu>
	<1207063923.30798.0.camel@blabla.mcs.anl.gov>
	<47F25814.8050301@cs.uchicago.edu>
	<1207066105.434.0.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0804020405530.5372@dildano.hawaga.org.uk>


On Tue, 1 Apr 2008, Mihael Hategan wrote:

> Having optimized wrappers for different architectures is a perfectly
> valid option.

it might also be possible to replace the entire execute2 layer of 
stagein-execute-stageout if falkon wants to do its own worker-node data 
placement - Swift would be the same down to calling execute2 but the 
execute2-replacement would be falkon specific rather than using the 
present model which assumes a shared filesystem for stageins.

I think that might fit in better with what Falkon is trying to do, letting 
it know which files are required by which job, rather than assuming a 
cluster-wide shared filesystem to fetch data from.

(The same might apply for using condor with no shared filesystem, which 
isthe situation in many campus workstation labs that I've seen - an 
execute2 layer that submits a bundled up stagein/stageout/execute as a 
single condor submission - I have been mulling over that for a few months 
and maybe will get someone to play with that in google summer of code)

-- 


From wilde at mcs.anl.gov  Wed Apr  2 07:21:35 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 02 Apr 2008 07:21:35 -0500
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <Pine.LNX.4.64.0804020405530.5372@dildano.hawaga.org.uk>
References: <47E7DC3D.6040704@cs.uchicago.edu>
	<47E7E30A.2020208@mcs.anl.gov>	<47E7E684.5070303@cs.uchicago.edu>
	<47E7EC03.1090609@mcs.anl.gov>	<1206381792.11561.1.camel@blabla.mcs.anl.gov>	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>	<47E86E5A.6080303@mcs.anl.gov>
	<47E88D8C.4090207@mcs.anl.gov>	<Pine.LNX.4.64.0804010855270.9854@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0804011004080.5372@dildano.hawaga.org.uk>	<47F232C4.3080607@mcs.anl.gov>
	<47F25435.8080105@cs.uchicago.edu>	<1207063923.30798.0.camel@blabla.mcs.anl.gov>	<47F25814.8050301@cs.uchicago.edu>	<1207066105.434.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0804020405530.5372@dildano.hawaga.org.uk>
Message-ID: <47F37A4F.4000606@mcs.anl.gov>

Im very much in favor of this approach.

- Mike


On 4/1/08 11:11 PM, Ben Clifford wrote:
> On Tue, 1 Apr 2008, Mihael Hategan wrote:
> 
>> Having optimized wrappers for different architectures is a perfectly
>> valid option.
> 
> it might also be possible to replace the entire execute2 layer of 
> stagein-execute-stageout if falkon wants to do its own worker-node data 
> placement - Swift would be the same down to calling execute2 but the 
> execute2-replacement would be falkon specific rather than using the 
> present model which assumes a shared filesystem for stageins.
> 
> I think that might fit in better with what Falkon is trying to do, letting 
> it know which files are required by which job, rather than assuming a 
> cluster-wide shared filesystem to fetch data from.
> 
> (The same might apply for using condor with no shared filesystem, which 
> isthe situation in many campus workstation labs that I've seen - an 
> execute2 layer that submits a bundled up stagein/stageout/execute as a 
> single condor submission - I have been mulling over that for a few months 
> and maybe will get someone to play with that in google summer of code)
> 


From bugzilla-daemon at mcs.anl.gov  Thu Apr  3 00:18:12 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Thu,  3 Apr 2008 00:18:12 -0500 (CDT)
Subject: [Swift-devel] [Bug 110] move OPTIONS out of swift executable
In-Reply-To: <bug-110-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080403051812.96936164BB@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=110


benc at hawaga.org.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         AssignedTo|hategan at mcs.anl.gov         |benc at hawaga.org.uk


------- Comment #1 from benc at hawaga.org.uk  2008-04-03 00:18 -------
There is a COG_OPTS environment variable that can be used for this. Probably
should be documented in the user guide.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.


From benc at hawaga.org.uk  Thu Apr  3 00:07:09 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 3 Apr 2008 05:07:09 +0000 (GMT)
Subject: [Swift-devel] Re: plan for 0.5 release
In-Reply-To: <Pine.LNX.4.64.0803200342000.28951@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0803200342000.28951@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0804030503130.5372@dildano.hawaga.org.uk>


On Thu, 20 Mar 2008, Ben Clifford wrote:

> There was a long long pause between swift 0.3 and swift 0.4; and 
> consequently a bunch of bugs have been discovered. so I'd like to put out 
> a 0.5 sometime in the next couple weeks to release those bugfixes.

I would like to do that this week as 0.4 got a bunch of fairly big 
bugfixes right after release.

However, I don't like that data channel caching doesn't work for a bunch 
of sites; one straightforward thing to do there is disable data channel 
caching entirely (that's what I have done in my personal development 
codebase). Opinion?

-- 


From bugzilla-daemon at mcs.anl.gov  Thu Apr  3 00:26:57 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Thu,  3 Apr 2008 00:26:57 -0500 (CDT)
Subject: [Swift-devel] [Bug 128] New: out of memory situations sometimes
	cause silent hangs
Message-ID: <bug-128-21@http.bugzilla.mcs.anl.gov/swift/>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=128

           Summary: out of memory situations sometimes cause silent hangs
           Product: Swift
           Version: unspecified
          Platform: Macintosh
        OS/Version: Mac OS
            Status: NEW
          Severity: normal
          Priority: P2
         Component: General
        AssignedTo: hategan at mcs.anl.gov
        ReportedBy: benc at hawaga.org.uk


Sometimes when Swift gets low on or runs out of memory (as indicated by the
heap size log lines), the execution hangs doing nothing without reporting an
error, rather than cleanly exiting. This is a poor user experience.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Thu Apr  3 00:44:13 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Thu,  3 Apr 2008 00:44:13 -0500 (CDT)
Subject: [Swift-devel] [Bug 129] New: ENV profiles using GRAM2 cause console
	output of environment variable value
Message-ID: <bug-129-21@http.bugzilla.mcs.anl.gov/swift/>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=129

           Summary: ENV profiles using GRAM2 cause console output of
                    environment variable value
           Product: Swift
           Version: unspecified
          Platform: Macintosh
        OS/Version: Mac OS
            Status: NEW
          Severity: minor
          Priority: P2
         Component: General
        AssignedTo: benc at hawaga.org.uk
        ReportedBy: benc at hawaga.org.uk


When profile entries in the ENV namespace are used with the GRAM2 provider,
there is spurious console output of the value of those ENV profile entries.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at mcs.anl.gov  Thu Apr  3 00:57:44 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Thu,  3 Apr 2008 00:57:44 -0500 (CDT)
Subject: [Swift-devel] [Bug 130] New: submitting to TG NCSA Mercury PBS with
	PATH env profile set causes job to hang
Message-ID: <bug-130-21@http.bugzilla.mcs.anl.gov/swift/>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=130

           Summary: submitting to TG NCSA Mercury PBS with PATH env profile
                    set causes job to hang
           Product: Swift
           Version: unspecified
          Platform: Macintosh
        OS/Version: Mac OS
            Status: NEW
          Severity: normal
          Priority: P2
         Component: General
        AssignedTo: nobody at mcs.anl.gov
        ReportedBy: benc at hawaga.org.uk
                CC: skenny at uchicago.edu


Submititng to TG NCSA Mercury PBS with PATH env profile set causes the job to
hang on the worker node.

Submitting without that PATH set does not cause the hang. Submitting to
jobmanager-fork on that machine (instead of PBS) does not cause the hang.
Submitting with PATH env profile to teraport PBS does not cause hang.

This happens with every swift script i have tried.

skenny has also seen similar behaviour (and is the instigator of this
investigation)

Here is a sites.xml entry that causes hangs for me:

  <pool handle="tgncsa-hg" >
    <gridftp  url="gsiftp://gt2-gridftp-hg.ncsa.teragrid.org" />.
    <jobmanager universe="vanilla"
url="grid-hg.ncsa.teragrid.org/jobmanager-pbs" major="2" />
    <workdirectory >/home/ac/benc</workdirectory>
    <profile namespace="globus" key="project">TG-CCR080002N</profile>
    <profile namespace="env" key="PATH">/:$PATH</profile>
  </pool>

Removing that PATH profile entry makes things work again.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.


From foster at mcs.anl.gov  Thu Apr  3 01:23:25 2008
From: foster at mcs.anl.gov (Ian Foster)
Date: Thu, 03 Apr 2008 01:23:25 -0500
Subject: [Swift-devel] Re: plan for 0.5 release
In-Reply-To: <Pine.LNX.4.64.0804030503130.5372@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0803200342000.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804030503130.5372@dildano.hawaga.org.uk>
Message-ID: <47F477DD.6040001@mcs.anl.gov>

Ben:

Can you explain a bit more about the "data channel doesn't work for a 
bunch of sites" problem?

Ian.

Ben Clifford wrote:
> On Thu, 20 Mar 2008, Ben Clifford wrote:
>
>   
>> There was a long long pause between swift 0.3 and swift 0.4; and 
>> consequently a bunch of bugs have been discovered. so I'd like to put out 
>> a 0.5 sometime in the next couple weeks to release those bugfixes.
>>     
>
> I would like to do that this week as 0.4 got a bunch of fairly big 
> bugfixes right after release.
>
> However, I don't like that data channel caching doesn't work for a bunch 
> of sites; one straightforward thing to do there is disable data channel 
> caching entirely (that's what I have done in my personal development 
> codebase). Opinion?
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080403/28959b1c/attachment.html>

From benc at hawaga.org.uk  Thu Apr  3 01:27:23 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 3 Apr 2008 06:27:23 +0000 (GMT)
Subject: [Swift-devel] Re: plan for 0.5 release
In-Reply-To: <47F477DD.6040001@mcs.anl.gov>
References: <Pine.LNX.4.64.0803200342000.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804030503130.5372@dildano.hawaga.org.uk>
	<47F477DD.6040001@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0804030624310.9854@dildano.hawaga.org.uk>


> Can you explain a bit more about the "data channel doesn't work for a bunch of
> sites" problem?

There's some channel reuse code that went into cog in the past few months. 
It gets enabled when it detects that it has been pointed at a specific 
version of the GridFTP server (which is in itself a bug as it should 
really work for lots of versions). The code appears to not work. So when 
Swift is pointed at a gridftp server of that version, it cannot stage in 
or out files. When it is pointed at a different version gridftp server, 
the two bugs cancel each other out - data channel reuse is not used, and 
so Swift can stage files in and out.

-- 


From hategan at mcs.anl.gov  Thu Apr  3 04:22:51 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 03 Apr 2008 04:22:51 -0500
Subject: [Swift-devel] Re: plan for 0.5 release
In-Reply-To: <Pine.LNX.4.64.0804030624310.9854@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0803200342000.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804030503130.5372@dildano.hawaga.org.uk>
	<47F477DD.6040001@mcs.anl.gov>
	<Pine.LNX.4.64.0804030624310.9854@dildano.hawaga.org.uk>
Message-ID: <1207214571.14147.5.camel@blabla.mcs.anl.gov>


On Thu, 2008-04-03 at 06:27 +0000, Ben Clifford wrote:
> 
> > Can you explain a bit more about the "data channel doesn't work for a bunch of
> > sites" problem?
> 
> There's some channel reuse code that went into cog in the past few months. 
> It gets enabled when it detects that it has been pointed at a specific 
> version of the GridFTP server (which is in itself a bug as it should 
> really work for lots of versions).

Not quite. I've put that in precisely because some versions didn't work.

>  The code appears to not work. So when 
> Swift is pointed at a gridftp server of that version, it cannot stage in 
> or out files. When it is pointed at a different version gridftp server, 
> the two bugs cancel each other out - data channel reuse is not used, and 
> so Swift can stage files in and out.
> 


From bugzilla-daemon at mcs.anl.gov  Thu Apr  3 04:26:17 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Thu,  3 Apr 2008 04:26:17 -0500 (CDT)
Subject: [Swift-devel] [Bug 130] submitting to TG NCSA Mercury PBS with PATH
	env profile set causes job to hang
In-Reply-To: <bug-130-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080403092617.146E31650A@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=130


------- Comment #1 from hategan at mcs.anl.gov  2008-04-03 04:26 -------
<profile namespace="env" key="PATH">/:$PATH</profile>

I don't think that trick works (i.e. that the existing $PATH will be
substituted). What probably happens is that the job runs without /bin and
/usr/bin in the path.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.


From wilde at mcs.anl.gov  Thu Apr  3 07:14:13 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 03 Apr 2008 07:14:13 -0500
Subject: [Swift-devel] Re: plan for 0.5 release
In-Reply-To: <1207214571.14147.5.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0803200342000.28951@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0804030503130.5372@dildano.hawaga.org.uk>	<47F477DD.6040001@mcs.anl.gov>	<Pine.LNX.4.64.0804030624310.9854@dildano.hawaga.org.uk>
	<1207214571.14147.5.camel@blabla.mcs.anl.gov>
Message-ID: <47F4CA15.9090303@mcs.anl.gov>

Mihael, Ben,

How well do you understand the problem and whats your confidence in 
being able to reliably fix it?

It seems best to disable it (or adjust things) to get more reliable 
operation across all or more sites, at the expense of performance on 
some sites, while its being fixed.

Mihael, is this bug on your plate? Whats your estimate of effort involved?

- Mike


On 4/3/08 4:22 AM, Mihael Hategan wrote:
> On Thu, 2008-04-03 at 06:27 +0000, Ben Clifford wrote:
>>> Can you explain a bit more about the "data channel doesn't work for a bunch of
>>> sites" problem?
>> There's some channel reuse code that went into cog in the past few months. 
>> It gets enabled when it detects that it has been pointed at a specific 
>> version of the GridFTP server (which is in itself a bug as it should 
>> really work for lots of versions).
> 
> Not quite. I've put that in precisely because some versions didn't work.
> 
>>  The code appears to not work. So when 
>> Swift is pointed at a gridftp server of that version, it cannot stage in 
>> or out files. When it is pointed at a different version gridftp server, 
>> the two bugs cancel each other out - data channel reuse is not used, and 
>> so Swift can stage files in and out.
>>
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 


From hategan at mcs.anl.gov  Thu Apr  3 07:19:27 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 03 Apr 2008 07:19:27 -0500
Subject: [Swift-devel] Re: plan for 0.5 release
In-Reply-To: <47F4CA15.9090303@mcs.anl.gov>
References: <Pine.LNX.4.64.0803200342000.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804030503130.5372@dildano.hawaga.org.uk>
	<47F477DD.6040001@mcs.anl.gov>
	<Pine.LNX.4.64.0804030624310.9854@dildano.hawaga.org.uk>
	<1207214571.14147.5.camel@blabla.mcs.anl.gov>
	<47F4CA15.9090303@mcs.anl.gov>
Message-ID: <1207225168.27151.1.camel@blabla.mcs.anl.gov>


On Thu, 2008-04-03 at 07:14 -0500, Michael Wilde wrote:
> Mihael, Ben,
> 
> How well do you understand the problem and whats your confidence in 
> being able to reliably fix it?
> 
> It seems best to disable it (or adjust things) to get more reliable 
> operation across all or more sites, at the expense of performance on 
> some sites, while its being fixed.
> 
> Mihael, is this bug on your plate? 

Yes.

> Whats your estimate of effort involved?

This was discussed before. In the bigger context of the
small-file-optimization there is one week of real time left.

> 
> - Mike
> 
> 
> 
> 
> On 4/3/08 4:22 AM, Mihael Hategan wrote:
> > On Thu, 2008-04-03 at 06:27 +0000, Ben Clifford wrote:
> >>> Can you explain a bit more about the "data channel doesn't work for a bunch of
> >>> sites" problem?
> >> There's some channel reuse code that went into cog in the past few months. 
> >> It gets enabled when it detects that it has been pointed at a specific 
> >> version of the GridFTP server (which is in itself a bug as it should 
> >> really work for lots of versions).
> > 
> > Not quite. I've put that in precisely because some versions didn't work.
> > 
> >>  The code appears to not work. So when 
> >> Swift is pointed at a gridftp server of that version, it cannot stage in 
> >> or out files. When it is pointed at a different version gridftp server, 
> >> the two bugs cancel each other out - data channel reuse is not used, and 
> >> so Swift can stage files in and out.
> >>
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> > 
> 


From benc at hawaga.org.uk  Thu Apr  3 13:40:29 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 3 Apr 2008 18:40:29 +0000 (GMT)
Subject: [Swift-devel] Re: plan for 0.5 release
In-Reply-To: <47F4CA15.9090303@mcs.anl.gov>
References: <Pine.LNX.4.64.0803200342000.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804030503130.5372@dildano.hawaga.org.uk>
	<47F477DD.6040001@mcs.anl.gov>
	<Pine.LNX.4.64.0804030624310.9854@dildano.hawaga.org.uk>
	<1207214571.14147.5.camel@blabla.mcs.anl.gov>
	<47F4CA15.9090303@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0804031834260.9854@dildano.hawaga.org.uk>


> How well do you understand the problem and whats your confidence in being able
> to reliably fix it?

for a this-week 0.5 release, disabling this is a one liner that I'm pretty 
confident doesn't break things (in as much as it passes the site tests 
that I have). that satisfies my urge to get something out fairly quickly 
that is less shitty than 0.4.

-- 


From wilde at mcs.anl.gov  Thu Apr  3 14:03:49 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 03 Apr 2008 14:03:49 -0500
Subject: [Swift-devel] Re: plan for 0.5 release
In-Reply-To: <Pine.LNX.4.64.0804031834260.9854@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0803200342000.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804030503130.5372@dildano.hawaga.org.uk>
	<47F477DD.6040001@mcs.anl.gov>
	<Pine.LNX.4.64.0804030624310.9854@dildano.hawaga.org.uk>
	<1207214571.14147.5.camel@blabla.mcs.anl.gov>
	<47F4CA15.9090303@mcs.anl.gov>
	<Pine.LNX.4.64.0804031834260.9854@dildano.hawaga.org.uk>
Message-ID: <47F52A15.6060209@mcs.anl.gov>

OK.  Hopefully it can be fixed in the next few weeks but no need to 
delay 0.5 for it.

On 4/3/08 1:40 PM, Ben Clifford wrote:
>> How well do you understand the problem and whats your confidence in being able
>> to reliably fix it?
> 
> for a this-week 0.5 release, disabling this is a one liner that I'm pretty 
> confident doesn't break things (in as much as it passes the site tests 
> that I have). that satisfies my urge to get something out fairly quickly 
> that is less shitty than 0.4.
> 


From foster at mcs.anl.gov  Thu Apr  3 14:31:35 2008
From: foster at mcs.anl.gov (Ian Foster)
Date: Thu, 03 Apr 2008 14:31:35 -0500
Subject: [Swift-devel] Re: plan for 0.5 release
In-Reply-To: <47F52A15.6060209@mcs.anl.gov>
References: <Pine.LNX.4.64.0803200342000.28951@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0804030503130.5372@dildano.hawaga.org.uk>	<47F477DD.6040001@mcs.anl.gov>	<Pine.LNX.4.64.0804030624310.9854@dildano.hawaga.org.uk>	<1207214571.14147.5.camel@blabla.mcs.anl.gov>	<47F4CA15.9090303@mcs.anl.gov>	<Pine.LNX.4.64.0804031834260.9854@dildano.hawaga.org.uk>
	<47F52A15.6060209@mcs.anl.gov>
Message-ID: <47F53097.3000909@mcs.anl.gov>

could we have a flag so that if people want to turn it on, they can? 
(assuming it can work in some settings.)

Michael Wilde wrote:
> OK.  Hopefully it can be fixed in the next few weeks but no need to 
> delay 0.5 for it.
>
> On 4/3/08 1:40 PM, Ben Clifford wrote:
>>> How well do you understand the problem and whats your confidence in 
>>> being able
>>> to reliably fix it?
>>
>> for a this-week 0.5 release, disabling this is a one liner that I'm 
>> pretty confident doesn't break things (in as much as it passes the 
>> site tests that I have). that satisfies my urge to get something out 
>> fairly quickly that is less shitty than 0.4.
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>


From benc at hawaga.org.uk  Thu Apr  3 14:40:26 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 3 Apr 2008 19:40:26 +0000 (GMT)
Subject: [Swift-devel] Re: plan for 0.5 release
In-Reply-To: <47F53097.3000909@mcs.anl.gov>
References: <Pine.LNX.4.64.0803200342000.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804030503130.5372@dildano.hawaga.org.uk>
	<47F477DD.6040001@mcs.anl.gov>
	<Pine.LNX.4.64.0804030624310.9854@dildano.hawaga.org.uk>
	<1207214571.14147.5.camel@blabla.mcs.anl.gov>
	<47F4CA15.9090303@mcs.anl.gov>
	<Pine.LNX.4.64.0804031834260.9854@dildano.hawaga.org.uk>
	<47F52A15.6060209@mcs.anl.gov> <47F53097.3000909@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0804031938150.9854@dildano.hawaga.org.uk>


> could we have a flag so that if people want to turn it on, they can? (assuming
> it can work in some settings.)

it doesn't work in any situation i've tried at the moment. when mihael has 
done his stuff it will work just fine. until then what I'm looking for is 
a quick non-damaging fix to cover from now until that point that lies 
somewhere (hopefully in the next six months) in the future.

-- 


From bugzilla-daemon at mcs.anl.gov  Fri Apr  4 01:12:44 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Fri,  4 Apr 2008 01:12:44 -0500 (CDT)
Subject: [Swift-devel] [Bug 111] stage out -info and cluster logs in the
	same fashion as kickstart records.
In-Reply-To: <bug-111-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080404061244.0EE51164CF@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=111


------- Comment #1 from benc at hawaga.org.uk  2008-04-04 01:12 -------
This is done for info logs.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From bugzilla-daemon at mcs.anl.gov  Fri Apr  4 04:32:47 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Fri,  4 Apr 2008 04:32:47 -0500 (CDT)
Subject: [Swift-devel] [Bug 41] Deadlock in atomic procedures
In-Reply-To: <bug-41-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080404093247.44E5B164EC@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=41


benc at hawaga.org.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         AssignedTo|hategan at mcs.anl.gov         |benc at hawaga.org.uk
          Component|General                     |SwiftScript language


------- Comment #1 from benc at hawaga.org.uk  2008-04-04 04:32 -------
probably the compiler should catch this. for example:

i) it should detect that the parameter is of an inappropriate type (we should
only allow simple types here)

ii) it should detect that a variable that is know to be an output is being used
in an input context.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From hategan at mcs.anl.gov  Fri Apr  4 04:39:37 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 04 Apr 2008 04:39:37 -0500
Subject: [Swift-devel] coaster status summary
Message-ID: <1207301977.1658.14.camel@blabla.mcs.anl.gov>

I've been asked for a summary of the status of the coaster prototype, so
here it is:
- It's a prototype so bugs are plenty
- It's self deployed (you don't need to start a service on the target
cluster)
- You can also use it while starting a service on the target cluster
- There is a worker written in Perl
- It uses encryption between client and coaster service
- It uses UDP between the service and the workers (this may prove to be
better or worse choice than TCP)
- A preliminary test done locally shows an amortized throughput of
around 180 jobs/s (/bin/date). This was done with encryption and with 10
workers. Pretty picture attached (total time vs. # of jobs)

To do:
- The scheduling algorithm in the service needs a bit more work
- When worker messages are lost, some jobs may get lost (i.e. needs more
fault tolerance)
- Start testing it on actual clusters
- Do some memory consumption benchmarks
- Better allocation strategy for workers

Mihael
-------------- next part --------------
A non-text attachment was scrubbed...
Name: speed.pdf
Type: application/pdf
Size: 18168 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080404/511fde25/attachment.pdf>

From benc at hawaga.org.uk  Fri Apr  4 04:41:20 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 4 Apr 2008 09:41:20 +0000 (GMT)
Subject: [Swift-devel] coaster status summary
In-Reply-To: <1207301977.1658.14.camel@blabla.mcs.anl.gov>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0804040941120.5372@dildano.hawaga.org.uk>

are you going to put the source somewhere visible?
-- 


From wilde at mcs.anl.gov  Fri Apr  4 06:59:28 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 04 Apr 2008 06:59:28 -0500
Subject: [Swift-devel] coaster status summary
In-Reply-To: <1207301977.1658.14.camel@blabla.mcs.anl.gov>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov>
Message-ID: <47F61820.3090705@mcs.anl.gov>

Mihael, this is great progress - very exciting.
Some questions (dont need answers right away):

How would the end user use it? Manually start a service?
Is the service a separate process, or in the swift jvm?
How are the number of workers set or adjusted?
Does a service manage workers on one cluster or many?
At 180 jobs/sec with 10 workers, what were the CPU loads on swift, 
worker and service?

Do you want to try this on the workflows we're running on Falkon on the 
BGP and SiCortex?

Im eager to try it when you feel its ready for others to test.

Nice work!

- Mike


On 4/4/08 4:39 AM, Mihael Hategan wrote:
> I've been asked for a summary of the status of the coaster prototype, so
> here it is:
> - It's a prototype so bugs are plenty
> - It's self deployed (you don't need to start a service on the target
> cluster)
> - You can also use it while starting a service on the target cluster
> - There is a worker written in Perl
> - It uses encryption between client and coaster service
> - It uses UDP between the service and the workers (this may prove to be
> better or worse choice than TCP)
> - A preliminary test done locally shows an amortized throughput of
> around 180 jobs/s (/bin/date). This was done with encryption and with 10
> workers. Pretty picture attached (total time vs. # of jobs)
> 
> To do:
> - The scheduling algorithm in the service needs a bit more work
> - When worker messages are lost, some jobs may get lost (i.e. needs more
> fault tolerance)
> - Start testing it on actual clusters
> - Do some memory consumption benchmarks
> - Better allocation strategy for workers
> 
> Mihael
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From hategan at mcs.anl.gov  Fri Apr  4 07:02:22 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 04 Apr 2008 07:02:22 -0500
Subject: [Swift-devel] coaster status summary
In-Reply-To: <Pine.LNX.4.64.0804040941120.5372@dildano.hawaga.org.uk>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0804040941120.5372@dildano.hawaga.org.uk>
Message-ID: <1207310542.7171.0.camel@blabla.mcs.anl.gov>

Of course. Just haven't done it yet.

On Fri, 2008-04-04 at 09:41 +0000, Ben Clifford wrote:
> are you going to put the source somewhere visible?


From hategan at mcs.anl.gov  Fri Apr  4 07:12:47 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 04 Apr 2008 07:12:47 -0500
Subject: [Swift-devel] coaster status summary
In-Reply-To: <47F61820.3090705@mcs.anl.gov>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov>
	<47F61820.3090705@mcs.anl.gov>
Message-ID: <1207311167.7171.12.camel@blabla.mcs.anl.gov>


On Fri, 2008-04-04 at 06:59 -0500, Michael Wilde wrote:
> Mihael, this is great progress - very exciting.
> Some questions (dont need answers right away):
> 
> How would the end user use it? Manually start a service?
> Is the service a separate process, or in the swift jvm?

I though the lines below answered some of these.

A user would specify the coaster provider in sites.xml. The provider
will then automatically deploy a service on the target machine without
the user having to do so. Given that the service is on a different
machine than the client, they can't be in the same JVM.

> How are the number of workers set or adjusted?

Currently workers are requested as much as needed, up to a maximum. This
is preliminary hence "Better allocation strategy for workers".

> Does a service manage workers on one cluster or many?

One service per cluster.

> At 180 jobs/sec with 10 workers, what were the CPU loads on swift, 
> worker and service?

I faintly recall them being at less than 50% for some reason I don't
understand.

> 
> Do you want to try this on the workflows we're running on Falkon on the 
> BGP and SiCortex?

Let me repeat "prototype" and "more testing". In no way do I want to do
preliminary testing with an application that is shaky on an architecture
that is also shaky.

Mihael

> 
> Im eager to try it when you feel its ready for others to test.
> 
> Nice work!
> 
> - Mike
> 
> 
> 
> On 4/4/08 4:39 AM, Mihael Hategan wrote:
> > I've been asked for a summary of the status of the coaster prototype, so
> > here it is:
> > - It's a prototype so bugs are plenty
> > - It's self deployed (you don't need to start a service on the target
> > cluster)
> > - You can also use it while starting a service on the target cluster
> > - There is a worker written in Perl
> > - It uses encryption between client and coaster service
> > - It uses UDP between the service and the workers (this may prove to be
> > better or worse choice than TCP)
> > - A preliminary test done locally shows an amortized throughput of
> > around 180 jobs/s (/bin/date). This was done with encryption and with 10
> > workers. Pretty picture attached (total time vs. # of jobs)
> > 
> > To do:
> > - The scheduling algorithm in the service needs a bit more work
> > - When worker messages are lost, some jobs may get lost (i.e. needs more
> > fault tolerance)
> > - Start testing it on actual clusters
> > - Do some memory consumption benchmarks
> > - Better allocation strategy for workers
> > 
> > Mihael
> > 
> > 
> > ------------------------------------------------------------------------
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From bugzilla-daemon at mcs.anl.gov  Fri Apr  4 07:36:50 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Fri,  4 Apr 2008 07:36:50 -0500 (CDT)
Subject: [Swift-devel] [Bug 131] New: Clarify documentation on profiles
Message-ID: <bug-131-21@http.bugzilla.mcs.anl.gov/swift/>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=131

           Summary: Clarify documentation on profiles
           Product: Swift
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Documentation
        AssignedTo: benc at hawaga.org.uk
        ReportedBy: wilde at mcs.anl.gov


clarify what profiles are for (few sentences)
explain that profiles are associated with sites, in sites.xml, or with apps, in
tc.data

move profile section down after sites & tc description. where its at now, its
confusing what profiles are for when you first encounter the section.

list the set of recognized profiles, by namespace
refer to other docs for parameter values that are beyond the scope of this doc
(but show the common examples, which is mostly OK, but scattered throughout the
UG at the moment)

for Globus, explain more about how queue and maxwalltime interact to determine
how your job is queued, and how to find the queue information (eg a few
pointers to UC/TG and Teraport info in the local users section)

in "local users" section list profile info relevant to uc/osg/tg environment
- cputype is the main missing one I think for UC/TG only.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.


From iraicu at cs.uchicago.edu  Fri Apr  4 19:02:44 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Fri, 04 Apr 2008 19:02:44 -0500
Subject: [Swift-devel] coaster status summary
In-Reply-To: <1207311167.7171.12.camel@blabla.mcs.anl.gov>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov>	<47F61820.3090705@mcs.anl.gov>
	<1207311167.7171.12.camel@blabla.mcs.anl.gov>
Message-ID: <47F6C1A4.5030200@cs.uchicago.edu>

You say that you use UDP on the workers.  This might be more light 
weight, but might also pose practical issues. 

Some of those are:
- might not work well on any network other than a LAN
- won't be friendly to firewalls or NATs, no matter if you the service 
pushes jobs, or workers pull jobs; the logic is that you need 2 way 
communication, and using UDP (being a connectionless protocol), its like 
having a server socket and a client socket on both ends of the 
communication at the same time.  This might not matter if the service 
and the worker are on the same LAN with no NATs or firewalls in the 
middle, but, it would matter on a machine such as the BG/P, as there is 
a NAT inbetween the login nodes and the compute nodes.  In essence, for 
this to work on the BG/P, you'll need to avoid having server side 
sockets on the compute nodes (workers), and you'll probably only be able 
to do that via a connection oriented protocol (i.e. TCP).  Is switching 
to TCP a relatively straight forward option?  If not, it might be worth 
implementing to make the implementation more flexible
- loosing messages and recovering from them will likely be harder than 
anticipated; I have a UDP version of the notification engine that Falkon 
uses, and after much debugging, I gave up and switched over to TCP.  It 
worked most of the time, but the occasional lost message (1 in 1000s, 
maybe even more rare) made Falkon unreliable, and hence I stopped using 
it. 

Is the 180 tasks/sec the overall throughput measured from Swift's point 
of view, including overhead of wrapper.sh?  Or is that a micro-benchmark 
measuring just the coaster performance? 

Ioan


Mihael Hategan wrote:
> On Fri, 2008-04-04 at 06:59 -0500, Michael Wilde wrote:
>   
>> Mihael, this is great progress - very exciting.
>> Some questions (dont need answers right away):
>>
>> How would the end user use it? Manually start a service?
>> Is the service a separate process, or in the swift jvm?
>>     
>
> I though the lines below answered some of these.
>
> A user would specify the coaster provider in sites.xml. The provider
> will then automatically deploy a service on the target machine without
> the user having to do so. Given that the service is on a different
> machine than the client, they can't be in the same JVM.
>
>   
>> How are the number of workers set or adjusted?
>>     
>
> Currently workers are requested as much as needed, up to a maximum. This
> is preliminary hence "Better allocation strategy for workers".
>
>   
>> Does a service manage workers on one cluster or many?
>>     
>
> One service per cluster.
>
>   
>> At 180 jobs/sec with 10 workers, what were the CPU loads on swift, 
>> worker and service?
>>     
>
> I faintly recall them being at less than 50% for some reason I don't
> understand.
>
>   
>> Do you want to try this on the workflows we're running on Falkon on the 
>> BGP and SiCortex?
>>     
>
> Let me repeat "prototype" and "more testing". In no way do I want to do
> preliminary testing with an application that is shaky on an architecture
> that is also shaky.
>
> Mihael
>
>   
>> Im eager to try it when you feel its ready for others to test.
>>
>> Nice work!
>>
>> - Mike
>>
>>
>>
>> On 4/4/08 4:39 AM, Mihael Hategan wrote:
>>     
>>> I've been asked for a summary of the status of the coaster prototype, so
>>> here it is:
>>> - It's a prototype so bugs are plenty
>>> - It's self deployed (you don't need to start a service on the target
>>> cluster)
>>> - You can also use it while starting a service on the target cluster
>>> - There is a worker written in Perl
>>> - It uses encryption between client and coaster service
>>> - It uses UDP between the service and the workers (this may prove to be
>>> better or worse choice than TCP)
>>> - A preliminary test done locally shows an amortized throughput of
>>> around 180 jobs/s (/bin/date). This was done with encryption and with 10
>>> workers. Pretty picture attached (total time vs. # of jobs)
>>>
>>> To do:
>>> - The scheduling algorithm in the service needs a bit more work
>>> - When worker messages are lost, some jobs may get lost (i.e. needs more
>>> fault tolerance)
>>> - Start testing it on actual clusters
>>> - Do some memory consumption benchmarks
>>> - Better allocation strategy for workers
>>>
>>> Mihael
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>       
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080404/91080066/attachment.html>

From benc at hawaga.org.uk  Sat Apr  5 04:30:38 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 5 Apr 2008 09:30:38 +0000 (GMT)
Subject: [Swift-devel] coaster status summary
In-Reply-To: <47F6C1A4.5030200@cs.uchicago.edu>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov>
	<47F61820.3090705@mcs.anl.gov>
	<1207311167.7171.12.camel@blabla.mcs.anl.gov>
	<47F6C1A4.5030200@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0804050930070.5372@dildano.hawaga.org.uk>

> it would matter on a machine such as
> the BG/P, as there is a NAT inbetween the login nodes and the compute nodes.

wierd. is there a description of that somewhere?

-- 


From hategan at mcs.anl.gov  Sat Apr  5 04:45:54 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 05 Apr 2008 04:45:54 -0500
Subject: [Swift-devel] coaster status summary
In-Reply-To: <47F6C1A4.5030200@cs.uchicago.edu>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov>
	<47F61820.3090705@mcs.anl.gov>
	<1207311167.7171.12.camel@blabla.mcs.anl.gov>
	<47F6C1A4.5030200@cs.uchicago.edu>
Message-ID: <1207388755.10629.12.camel@blabla.mcs.anl.gov>


On Fri, 2008-04-04 at 19:02 -0500, Ioan Raicu wrote:
> You say that you use UDP on the workers.  This might be more light
> weight, but might also pose practical issues.  

Of course. That is the trade-off.

> 
> Some of those are:
> - might not work well on any network other than a LAN

It works exactly as it's supposed to: no guarantee of uniqueness, no
guarantee of order, no guarantee of integrity, and no guarantee of
reliability. One has to drop duplicates, do checksums, re-order, have
time-outs.

> - won't be friendly to firewalls or NATs, no matter if you the service
> pushes jobs, or workers pull jobs; the logic is that you need 2 way
> communication, and using UDP (being a connectionless protocol), its
> like having a server socket and a client socket on both ends of the
> communication at the same time.

Precisely so. In Java you can use one UDP socket as both client and
server. Perl seems to be nastier as it won't let you send and receive on
the same socket (at least in the implementation I've seen).

>   This might not matter if the service and the worker are on the same
> LAN with no NATs or firewalls in the middle, but, it would matter on a
> machine such as the BG/P, as there is a NAT inbetween the login nodes
> and the compute nodes.

That's odd. Do you have anything to back that up?

>   In essence, for this to work on the BG/P, you'll need to avoid
> having server side sockets on the compute nodes (workers), and you'll
> probably only be able to do that via a connection oriented protocol
> (i.e. TCP).  Is switching to TCP a relatively straight forward option?
> If not, it might be worth implementing to make the implementation more
> flexible
> - loosing messages and recovering from them will likely be harder than
> anticipated; I have a UDP version of the notification engine that
> Falkon uses, and after much debugging, I gave up and switched over to
> TCP.  It worked most of the time, but the occasional lost message (1
> in 1000s, maybe even more rare) made Falkon unreliable, and hence I
> stopped using it.

Of course it's unreliable unless you deal with the reliability issues as
outlined above.

> 
> Is the 180 tasks/sec the overall throughput measured from Swift's
> point of view, including overhead of wrapper.sh?  Or is that a
> micro-benchmark measuring just the coaster performance?  

It's at the provider level. No wrapper.sh.

> 
> Ioan
> 
> 
> Mihael Hategan wrote: 
> > On Fri, 2008-04-04 at 06:59 -0500, Michael Wilde wrote:
> >   
> > > Mihael, this is great progress - very exciting.
> > > Some questions (dont need answers right away):
> > > 
> > > How would the end user use it? Manually start a service?
> > > Is the service a separate process, or in the swift jvm?
> > >     
> > 
> > I though the lines below answered some of these.
> > 
> > A user would specify the coaster provider in sites.xml. The provider
> > will then automatically deploy a service on the target machine without
> > the user having to do so. Given that the service is on a different
> > machine than the client, they can't be in the same JVM.
> > 
> >   
> > > How are the number of workers set or adjusted?
> > >     
> > 
> > Currently workers are requested as much as needed, up to a maximum. This
> > is preliminary hence "Better allocation strategy for workers".
> > 
> >   
> > > Does a service manage workers on one cluster or many?
> > >     
> > 
> > One service per cluster.
> > 
> >   
> > > At 180 jobs/sec with 10 workers, what were the CPU loads on swift, 
> > > worker and service?
> > >     
> > 
> > I faintly recall them being at less than 50% for some reason I don't
> > understand.
> > 
> >   
> > > Do you want to try this on the workflows we're running on Falkon on the 
> > > BGP and SiCortex?
> > >     
> > 
> > Let me repeat "prototype" and "more testing". In no way do I want to do
> > preliminary testing with an application that is shaky on an architecture
> > that is also shaky.
> > 
> > Mihael
> > 
> >   
> > > Im eager to try it when you feel its ready for others to test.
> > > 
> > > Nice work!
> > > 
> > > - Mike
> > > 
> > > 
> > > 
> > > On 4/4/08 4:39 AM, Mihael Hategan wrote:
> > >     
> > > > I've been asked for a summary of the status of the coaster prototype, so
> > > > here it is:
> > > > - It's a prototype so bugs are plenty
> > > > - It's self deployed (you don't need to start a service on the target
> > > > cluster)
> > > > - You can also use it while starting a service on the target cluster
> > > > - There is a worker written in Perl
> > > > - It uses encryption between client and coaster service
> > > > - It uses UDP between the service and the workers (this may prove to be
> > > > better or worse choice than TCP)
> > > > - A preliminary test done locally shows an amortized throughput of
> > > > around 180 jobs/s (/bin/date). This was done with encryption and with 10
> > > > workers. Pretty picture attached (total time vs. # of jobs)
> > > > 
> > > > To do:
> > > > - The scheduling algorithm in the service needs a bit more work
> > > > - When worker messages are lost, some jobs may get lost (i.e. needs more
> > > > fault tolerance)
> > > > - Start testing it on actual clusters
> > > > - Do some memory consumption benchmarks
> > > > - Better allocation strategy for workers
> > > > 
> > > > Mihael
> > > > 
> > > > 
> > > > ------------------------------------------------------------------------
> > > > 
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > >       
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> >   
> 
> -- 
> ===================================================
> Ioan Raicu
> Ph.D. Candidate
> ===================================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ===================================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
> http://dev.globus.org/wiki/Incubator/Falkon
> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> ===================================================
> ===================================================
> 


From hategan at mcs.anl.gov  Sat Apr  5 04:54:46 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 05 Apr 2008 04:54:46 -0500
Subject: [Swift-devel] coaster status summary
In-Reply-To: <1207388755.10629.12.camel@blabla.mcs.anl.gov>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov>
	<47F61820.3090705@mcs.anl.gov>
	<1207311167.7171.12.camel@blabla.mcs.anl.gov>
	<47F6C1A4.5030200@cs.uchicago.edu>
	<1207388755.10629.12.camel@blabla.mcs.anl.gov>
Message-ID: <1207389286.10629.16.camel@blabla.mcs.anl.gov>

> >   This might not matter if the service and the worker are on the same
> > LAN with no NATs or firewalls in the middle, but, it would matter on a
> > machine such as the BG/P, as there is a NAT inbetween the login nodes
> > and the compute nodes.
> 
> That's odd. Do you have anything to back that up?
> 

Really really odd. I mean MPI has to work between any two worker nodes.
If they are on separate networks with NAT in-between, this would be
rather difficult.


From benc at hawaga.org.uk  Sat Apr  5 07:07:39 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 5 Apr 2008 12:07:39 +0000 (GMT)
Subject: [Swift-devel] swift 0.5rc1
Message-ID: <Pine.LNX.4.64.0804051204530.31934@dildano.hawaga.org.uk>

Swift 0.5 release candidate 1 is at 
http://www.ci.uchicago.edu/~benc/vdsk-0.5rc1.tar.gz

This is primarily bugfixes for bugs that were found around the time of the 
0.4 release - syntax error handling that was poorly tested before 0.4; and 
data channel caching problems. There shouldn't be many (if any) new 
features here.

Please test.

If there are no significant problems, I'll put this out as 0.5 on tuesday.

-- 


From iraicu at cs.uchicago.edu  Sat Apr  5 08:16:12 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sat, 05 Apr 2008 08:16:12 -0500
Subject: [Swift-devel] coaster status summary
In-Reply-To: <47F61820.3090705@mcs.anl.gov>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov>
	<47F61820.3090705@mcs.anl.gov>
Message-ID: <47F77B9C.2060200@cs.uchicago.edu>

I meant to send this before, but somehow it seems to have gotten stuck 
in my draft folder ;(

We are running out of time on the papers we are writing now, but it 
would certainly been a good comparison of implementations, assumptions, 
trade-offs, performance, etc... for a future paper!

I am eager to learn more about it!

Ioan

Michael Wilde wrote:
> Mihael, this is great progress - very exciting.
> Some questions (dont need answers right away):
>
> How would the end user use it? Manually start a service?
> Is the service a separate process, or in the swift jvm?
> How are the number of workers set or adjusted?
> Does a service manage workers on one cluster or many?
> At 180 jobs/sec with 10 workers, what were the CPU loads on swift, 
> worker and service?
>
> Do you want to try this on the workflows we're running on Falkon on 
> the BGP and SiCortex?
>
> Im eager to try it when you feel its ready for others to test.
>
> Nice work!
>
> - Mike
>
>
>
> On 4/4/08 4:39 AM, Mihael Hategan wrote:
>> I've been asked for a summary of the status of the coaster prototype, so
>> here it is:
>> - It's a prototype so bugs are plenty
>> - It's self deployed (you don't need to start a service on the target
>> cluster)
>> - You can also use it while starting a service on the target cluster
>> - There is a worker written in Perl
>> - It uses encryption between client and coaster service
>> - It uses UDP between the service and the workers (this may prove to be
>> better or worse choice than TCP)
>> - A preliminary test done locally shows an amortized throughput of
>> around 180 jobs/s (/bin/date). This was done with encryption and with 10
>> workers. Pretty picture attached (total time vs. # of jobs)
>>
>> To do:
>> - The scheduling algorithm in the service needs a bit more work
>> - When worker messages are lost, some jobs may get lost (i.e. needs more
>> fault tolerance)
>> - Start testing it on actual clusters
>> - Do some memory consumption benchmarks
>> - Better allocation strategy for workers
>>
>> Mihael
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From iraicu at cs.uchicago.edu  Sat Apr  5 08:25:36 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sat, 05 Apr 2008 08:25:36 -0500
Subject: [Swift-devel] coaster status summary
In-Reply-To: <Pine.LNX.4.64.0804050930070.5372@dildano.hawaga.org.uk>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov>
	<47F61820.3090705@mcs.anl.gov>
	<1207311167.7171.12.camel@blabla.mcs.anl.gov>
	<47F6C1A4.5030200@cs.uchicago.edu>
	<Pine.LNX.4.64.0804050930070.5372@dildano.hawaga.org.uk>
Message-ID: <47F77DD0.8040302@cs.uchicago.edu>

I looked around for some docs on the networking structure, but couldn't 
find anything.

There are several networks available on the BG/P: Torus, Tree, Barrier, 
RAS, 10Gig Ethernet.

Of all these, we are only using the Ethernet network, which allows us to 
communicate via TCP/IP (or potentially UDP/IP) between compute nodes and 
I/O nodes, or between compute nodes and login nodes.  For the rest of 
the discussion, we assume only Ethernet communication.  There is 1 I/O 
node per 64 compute nodes (what we call a P-SET), and the I/O node can 
only communicate with compute nodes that it manages within the same 
P-SET (the 64 nodes).  A compute node from one P-SET cannot directly 
communicate with another compute from a different P-SET.  This is 
primarily because compute nodes have private addresses (192.168.x.x), 
I/O nodes are the NAT between the public IP and the private IP, and the 
login nodes only have a public IP.  So, the compute nodes all have the 
same IP addresses, 192.168.x.x, and they repeat for every P-SET, and the 
I/O nodes handle their traffic in and out. 

Zhao, if you have any docs on the Ethernet network and the NAT that sits 
on the I/O node, can you please send it to the mailing list?

Ioan

Ben Clifford wrote:
>> it would matter on a machine such as
>> the BG/P, as there is a NAT inbetween the login nodes and the compute nodes.
>>     
>
> wierd. is there a description of that somewhere?
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From iraicu at cs.uchicago.edu  Sat Apr  5 08:36:18 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sat, 05 Apr 2008 08:36:18 -0500
Subject: [Swift-devel] coaster status summary
In-Reply-To: <1207388755.10629.12.camel@blabla.mcs.anl.gov>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov>	
	<47F61820.3090705@mcs.anl.gov>	
	<1207311167.7171.12.camel@blabla.mcs.anl.gov>	
	<47F6C1A4.5030200@cs.uchicago.edu>
	<1207388755.10629.12.camel@blabla.mcs.anl.gov>
Message-ID: <47F78052.5020702@cs.uchicago.edu>


Mihael Hategan wrote:
> On Fri, 2008-04-04 at 19:02 -0500, Ioan Raicu wrote:
>   
>> You say that you use UDP on the workers.  This might be more light
>> weight, but might also pose practical issues.  
>>     
>
> Of course. That is the trade-off.
>
>   
Right, there will be, the key is to be able to switch between TCP and 
UDP easily.
>> Some of those are:
>> - might not work well on any network other than a LAN
>>     
>
> It works exactly as it's supposed to: no guarantee of uniqueness, no
> guarantee of order, no guarantee of integrity, and no guarantee of
> reliability. One has to drop duplicates, do checksums, re-order, have
> time-outs.
>
>   
>> - won't be friendly to firewalls or NATs, no matter if you the service
>> pushes jobs, or workers pull jobs; the logic is that you need 2 way
>> communication, and using UDP (being a connectionless protocol), its
>> like having a server socket and a client socket on both ends of the
>> communication at the same time.
>>     
>
> Precisely so. In Java you can use one UDP socket as both client and
> server. 
But even if the abstraction is OK and allows you to use the same socket 
for both reads and writes, that doesn't mean that the NAT will actually 
set up the coresponding entries for you to have 2-way communication.  
With TCP, given the connection oriented protocol, NATs are fine as long 
as one initiates the connection from the inside the NAT, but with UDP, 
you will only be able to have outgoing messages, but incoming messages 
to the NAT will not have the rules setup.  The only way I could see UDP 
working through the NAT is to have static rules setup ahead of time, 
that map between some PORTs on the NAT and IP:PORT on the compute nodes....
> Perl seems to be nastier as it won't let you send and receive on
> the same socket (at least in the implementation I've seen).
>
>   
>>   This might not matter if the service and the worker are on the same
>> LAN with no NATs or firewalls in the middle, but, it would matter on a
>> machine such as the BG/P, as there is a NAT inbetween the login nodes
>> and the compute nodes.
>>     
>
> That's odd. Do you have anything to back that up?
>
>   
Compute nodes have a private address per P-SET (64 nodes), and there are 
16 P-SETs in the current machine we use, and there will be 640 P-SETs in 
the final machine.  The I/O nodes (1 per P-SET) act as a NAT and have 
network connectivity on both public and private networks, and the login 
nodes only have access to the public network.  This has been our 
experience for the past 2 months in using TCP/IP on the BG/P.  Zhao, if 
you have anything else to add (especially links to docs confirming what 
I just said), please do so.
>>   In essence, for this to work on the BG/P, you'll need to avoid
>> having server side sockets on the compute nodes (workers), and you'll
>> probably only be able to do that via a connection oriented protocol
>> (i.e. TCP).  Is switching to TCP a relatively straight forward option?
>> If not, it might be worth implementing to make the implementation more
>> flexible
>> - loosing messages and recovering from them will likely be harder than
>> anticipated; I have a UDP version of the notification engine that
>> Falkon uses, and after much debugging, I gave up and switched over to
>> TCP.  It worked most of the time, but the occasional lost message (1
>> in 1000s, maybe even more rare) made Falkon unreliable, and hence I
>> stopped using it.
>>     
>
> Of course it's unreliable unless you deal with the reliability issues as
> outlined above.
>   
I did deal with them, duplicates, out of order, retries, timeouts, 
etc... yet, I still couldn't get a 100% reliable implementation, and I 
gave up... in theory, UDP should work given that you deal with all the 
reliability issues you outlined.  I am just pointing out that after lots 
of debugging, I gave in and swapped UDP for TCP to avoid the unexplained 
lost message once in a while.  I am positive it was a bug in my code, so 
perhaps you'll have better luck!
>   
>> Is the 180 tasks/sec the overall throughput measured from Swift's
>> point of view, including overhead of wrapper.sh?  Or is that a
>> micro-benchmark measuring just the coaster performance?  
>>     
>
> It's at the provider level. No wrapper.sh.
>   
OK, great!

Ioan
>   
>> Ioan
>>
>>
>> Mihael Hategan wrote: 
>>     
>>> On Fri, 2008-04-04 at 06:59 -0500, Michael Wilde wrote:
>>>   
>>>       
>>>> Mihael, this is great progress - very exciting.
>>>> Some questions (dont need answers right away):
>>>>
>>>> How would the end user use it? Manually start a service?
>>>> Is the service a separate process, or in the swift jvm?
>>>>     
>>>>         
>>> I though the lines below answered some of these.
>>>
>>> A user would specify the coaster provider in sites.xml. The provider
>>> will then automatically deploy a service on the target machine without
>>> the user having to do so. Given that the service is on a different
>>> machine than the client, they can't be in the same JVM.
>>>
>>>   
>>>       
>>>> How are the number of workers set or adjusted?
>>>>     
>>>>         
>>> Currently workers are requested as much as needed, up to a maximum. This
>>> is preliminary hence "Better allocation strategy for workers".
>>>
>>>   
>>>       
>>>> Does a service manage workers on one cluster or many?
>>>>     
>>>>         
>>> One service per cluster.
>>>
>>>   
>>>       
>>>> At 180 jobs/sec with 10 workers, what were the CPU loads on swift, 
>>>> worker and service?
>>>>     
>>>>         
>>> I faintly recall them being at less than 50% for some reason I don't
>>> understand.
>>>
>>>   
>>>       
>>>> Do you want to try this on the workflows we're running on Falkon on the 
>>>> BGP and SiCortex?
>>>>     
>>>>         
>>> Let me repeat "prototype" and "more testing". In no way do I want to do
>>> preliminary testing with an application that is shaky on an architecture
>>> that is also shaky.
>>>
>>> Mihael
>>>
>>>   
>>>       
>>>> Im eager to try it when you feel its ready for others to test.
>>>>
>>>> Nice work!
>>>>
>>>> - Mike
>>>>
>>>>
>>>>
>>>> On 4/4/08 4:39 AM, Mihael Hategan wrote:
>>>>     
>>>>         
>>>>> I've been asked for a summary of the status of the coaster prototype, so
>>>>> here it is:
>>>>> - It's a prototype so bugs are plenty
>>>>> - It's self deployed (you don't need to start a service on the target
>>>>> cluster)
>>>>> - You can also use it while starting a service on the target cluster
>>>>> - There is a worker written in Perl
>>>>> - It uses encryption between client and coaster service
>>>>> - It uses UDP between the service and the workers (this may prove to be
>>>>> better or worse choice than TCP)
>>>>> - A preliminary test done locally shows an amortized throughput of
>>>>> around 180 jobs/s (/bin/date). This was done with encryption and with 10
>>>>> workers. Pretty picture attached (total time vs. # of jobs)
>>>>>
>>>>> To do:
>>>>> - The scheduling algorithm in the service needs a bit more work
>>>>> - When worker messages are lost, some jobs may get lost (i.e. needs more
>>>>> fault tolerance)
>>>>> - Start testing it on actual clusters
>>>>> - Do some memory consumption benchmarks
>>>>> - Better allocation strategy for workers
>>>>>
>>>>> Mihael
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>       
>>>>>           
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>>   
>>>       
>> -- 
>> ===================================================
>> Ioan Raicu
>> Ph.D. Candidate
>> ===================================================
>> Distributed Systems Laboratory
>> Computer Science Department
>> University of Chicago
>> 1100 E. 58th Street, Ryerson Hall
>> Chicago, IL 60637
>> ===================================================
>> Email: iraicu at cs.uchicago.edu
>> Web:   http://www.cs.uchicago.edu/~iraicu
>> http://dev.globus.org/wiki/Incubator/Falkon
>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>> ===================================================
>> ===================================================
>>
>>     
>
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From iraicu at cs.uchicago.edu  Sat Apr  5 08:47:09 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sat, 05 Apr 2008 08:47:09 -0500
Subject: [Swift-devel] coaster status summary
In-Reply-To: <1207389286.10629.16.camel@blabla.mcs.anl.gov>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov>	
	<47F61820.3090705@mcs.anl.gov>	
	<1207311167.7171.12.camel@blabla.mcs.anl.gov>	
	<47F6C1A4.5030200@cs.uchicago.edu>	
	<1207388755.10629.12.camel@blabla.mcs.anl.gov>
	<1207389286.10629.16.camel@blabla.mcs.anl.gov>
Message-ID: <47F782DD.5070805@cs.uchicago.edu>


Mihael Hategan wrote:
>>>   This might not matter if the service and the worker are on the same
>>> LAN with no NATs or firewalls in the middle, but, it would matter on a
>>> machine such as the BG/P, as there is a NAT inbetween the login nodes
>>> and the compute nodes.
>>>       
>> That's odd. Do you have anything to back that up?
>>
>>     
>
> Really really odd. I mean MPI has to work between any two worker nodes.
> If they are on separate networks with NAT in-between, this would be
> rather difficult.
>   
MPI doesn't use the Ethernet network.  There are 5 networks to choose 
from (Torus, Tree, Barrier, RAS, 10Gig Ethernet), and I bet the NAT is 
only on one of them.  However, the Ethernet network is important, 
because we want to use TCP/UDP/IP so we can leverage code and systems 
that work in a typical Linux environment that traditionally only has 
Ethernet networks.  So, if you are willing to use MPI to communicate 
between service and workers, then you will likely not have to deal with 
a NAT.  However, then this might limit the generality of the 
implementation, as some Linux clusters might not have the necessary MPI 
packages installed.  The middle ground that we found useful, use TCP, 
and initiate all communication from the workers; this approach has 
worked for us great so far!  We have been able to scale on the BG/P to 
4K workers, and on the SiCortex with 5.8K workers.  I expect our current 
TCP-based implementation to scale to at least 10K workers per service, 
maybe more.  More testing is needed to find the upper bound of how many 
workers we can manage with the current login nodes memory capacity (4GB) 
and the quad-cpu systems we have. 

Ioan
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From hategan at mcs.anl.gov  Sat Apr  5 13:06:51 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 05 Apr 2008 13:06:51 -0500
Subject: [Swift-devel] Re: plan for 0.5 release
In-Reply-To: <Pine.LNX.4.64.0804031938150.9854@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0803200342000.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804030503130.5372@dildano.hawaga.org.uk>
	<47F477DD.6040001@mcs.anl.gov>
	<Pine.LNX.4.64.0804030624310.9854@dildano.hawaga.org.uk>
	<1207214571.14147.5.camel@blabla.mcs.anl.gov>
	<47F4CA15.9090303@mcs.anl.gov>
	<Pine.LNX.4.64.0804031834260.9854@dildano.hawaga.org.uk>
	<47F52A15.6060209@mcs.anl.gov> <47F53097.3000909@mcs.anl.gov>
	<Pine.LNX.4.64.0804031938150.9854@dildano.hawaga.org.uk>
Message-ID: <1207418811.23834.0.camel@blabla.mcs.anl.gov>

Issue should be fixed in cog r1956.

On Thu, 2008-04-03 at 19:40 +0000, Ben Clifford wrote:
> > could we have a flag so that if people want to turn it on, they can? (assuming
> > it can work in some settings.)
> 
> it doesn't work in any situation i've tried at the moment. when mihael has 
> done his stuff it will work just fine. until then what I'm looking for is 
> a quick non-damaging fix to cover from now until that point that lies 
> somewhere (hopefully in the next six months) in the future.
> 


From hategan at mcs.anl.gov  Sun Apr  6 04:14:11 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 06 Apr 2008 04:14:11 -0500
Subject: [Swift-devel] coaster status summary
In-Reply-To: <47F77DD0.8040302@cs.uchicago.edu>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov>
	<47F61820.3090705@mcs.anl.gov>
	<1207311167.7171.12.camel@blabla.mcs.anl.gov>
	<47F6C1A4.5030200@cs.uchicago.edu>
	<Pine.LNX.4.64.0804050930070.5372@dildano.hawaga.org.uk>
	<47F77DD0.8040302@cs.uchicago.edu>
Message-ID: <1207473251.10063.1.camel@blabla.mcs.anl.gov>


On Sat, 2008-04-05 at 08:25 -0500, Ioan Raicu wrote:
> I looked around for some docs on the networking structure, but couldn't 
> find anything.
> 
> There are several networks available on the BG/P: Torus, Tree, Barrier, 
> RAS, 10Gig Ethernet.
> 
> Of all these, we are only using the Ethernet network, which allows us to 
> communicate via TCP/IP (or potentially UDP/IP) between compute nodes and 
> I/O nodes, or between compute nodes and login nodes.  For the rest of 
> the discussion, we assume only Ethernet communication.  There is 1 I/O 
> node per 64 compute nodes (what we call a P-SET), and the I/O node can 
> only communicate with compute nodes that it manages within the same 
> P-SET (the 64 nodes).  A compute node from one P-SET cannot directly 
> communicate with another compute from a different P-SET.  This is 
> primarily because compute nodes have private addresses (192.168.x.x), 
> I/O nodes are the NAT between the public IP and the private IP, and the 
> login nodes only have a public IP.  So, the compute nodes all have the 
> same IP addresses, 192.168.x.x, and they repeat for every P-SET, and the 
> I/O nodes handle their traffic in and out.

You are describing NAT. I understand what NAT is. I was looking for an
independent source confirming this.

>  
> 
> Zhao, if you have any docs on the Ethernet network and the NAT that sits 
> on the I/O node, can you please send it to the mailing list?
> 
> Ioan
> 
> Ben Clifford wrote:
> >> it would matter on a machine such as
> >> the BG/P, as there is a NAT inbetween the login nodes and the compute nodes.
> >>     
> >
> > wierd. is there a description of that somewhere?
> >
> >   
> 


From hategan at mcs.anl.gov  Sun Apr  6 04:17:22 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 06 Apr 2008 04:17:22 -0500
Subject: [Swift-devel] coaster status summary
In-Reply-To: <47F78052.5020702@cs.uchicago.edu>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov>
	<47F61820.3090705@mcs.anl.gov>
	<1207311167.7171.12.camel@blabla.mcs.anl.gov>
	<47F6C1A4.5030200@cs.uchicago.edu>
	<1207388755.10629.12.camel@blabla.mcs.anl.gov>
	<47F78052.5020702@cs.uchicago.edu>
Message-ID: <1207473442.10063.3.camel@blabla.mcs.anl.gov>

> >
> > Of course it's unreliable unless you deal with the reliability issues as
> > outlined above.
> >   
> I did deal with them, duplicates, out of order, retries, timeouts, 
> etc... yet, I still couldn't get a 100% reliable implementation,

Of course you couldn't. It's impossible.

>  and I 
> gave up... in theory, UDP should work given that you deal with all the 
> reliability issues you outlined.  I am just pointing out that after lots 
> of debugging, I gave in and swapped UDP for TCP to avoid the unexplained 
> lost message once in a while.  I am positive it was a bug in my code, so 
> perhaps you'll have better luck!
> >   
> >> Is the 180 tasks/sec the overall throughput measured from Swift's
> >> point of view, including overhead of wrapper.sh?  Or is that a
> >> micro-benchmark measuring just the coaster performance?  
> >>     
> >
> > It's at the provider level. No wrapper.sh.
> >   
> OK, great!
> 
> Ioan
> >   
> >> Ioan
> >>
> >>
> >> Mihael Hategan wrote: 
> >>     
> >>> On Fri, 2008-04-04 at 06:59 -0500, Michael Wilde wrote:
> >>>   
> >>>       
> >>>> Mihael, this is great progress - very exciting.
> >>>> Some questions (dont need answers right away):
> >>>>
> >>>> How would the end user use it? Manually start a service?
> >>>> Is the service a separate process, or in the swift jvm?
> >>>>     
> >>>>         
> >>> I though the lines below answered some of these.
> >>>
> >>> A user would specify the coaster provider in sites.xml. The provider
> >>> will then automatically deploy a service on the target machine without
> >>> the user having to do so. Given that the service is on a different
> >>> machine than the client, they can't be in the same JVM.
> >>>
> >>>   
> >>>       
> >>>> How are the number of workers set or adjusted?
> >>>>     
> >>>>         
> >>> Currently workers are requested as much as needed, up to a maximum. This
> >>> is preliminary hence "Better allocation strategy for workers".
> >>>
> >>>   
> >>>       
> >>>> Does a service manage workers on one cluster or many?
> >>>>     
> >>>>         
> >>> One service per cluster.
> >>>
> >>>   
> >>>       
> >>>> At 180 jobs/sec with 10 workers, what were the CPU loads on swift, 
> >>>> worker and service?
> >>>>     
> >>>>         
> >>> I faintly recall them being at less than 50% for some reason I don't
> >>> understand.
> >>>
> >>>   
> >>>       
> >>>> Do you want to try this on the workflows we're running on Falkon on the 
> >>>> BGP and SiCortex?
> >>>>     
> >>>>         
> >>> Let me repeat "prototype" and "more testing". In no way do I want to do
> >>> preliminary testing with an application that is shaky on an architecture
> >>> that is also shaky.
> >>>
> >>> Mihael
> >>>
> >>>   
> >>>       
> >>>> Im eager to try it when you feel its ready for others to test.
> >>>>
> >>>> Nice work!
> >>>>
> >>>> - Mike
> >>>>
> >>>>
> >>>>
> >>>> On 4/4/08 4:39 AM, Mihael Hategan wrote:
> >>>>     
> >>>>         
> >>>>> I've been asked for a summary of the status of the coaster prototype, so
> >>>>> here it is:
> >>>>> - It's a prototype so bugs are plenty
> >>>>> - It's self deployed (you don't need to start a service on the target
> >>>>> cluster)
> >>>>> - You can also use it while starting a service on the target cluster
> >>>>> - There is a worker written in Perl
> >>>>> - It uses encryption between client and coaster service
> >>>>> - It uses UDP between the service and the workers (this may prove to be
> >>>>> better or worse choice than TCP)
> >>>>> - A preliminary test done locally shows an amortized throughput of
> >>>>> around 180 jobs/s (/bin/date). This was done with encryption and with 10
> >>>>> workers. Pretty picture attached (total time vs. # of jobs)
> >>>>>
> >>>>> To do:
> >>>>> - The scheduling algorithm in the service needs a bit more work
> >>>>> - When worker messages are lost, some jobs may get lost (i.e. needs more
> >>>>> fault tolerance)
> >>>>> - Start testing it on actual clusters
> >>>>> - Do some memory consumption benchmarks
> >>>>> - Better allocation strategy for workers
> >>>>>
> >>>>> Mihael
> >>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------------
> >>>>>
> >>>>> _______________________________________________
> >>>>> Swift-devel mailing list
> >>>>> Swift-devel at ci.uchicago.edu
> >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>       
> >>>>>           
> >>> _______________________________________________
> >>> Swift-devel mailing list
> >>> Swift-devel at ci.uchicago.edu
> >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>
> >>>   
> >>>       
> >> -- 
> >> ===================================================
> >> Ioan Raicu
> >> Ph.D. Candidate
> >> ===================================================
> >> Distributed Systems Laboratory
> >> Computer Science Department
> >> University of Chicago
> >> 1100 E. 58th Street, Ryerson Hall
> >> Chicago, IL 60637
> >> ===================================================
> >> Email: iraicu at cs.uchicago.edu
> >> Web:   http://www.cs.uchicago.edu/~iraicu
> >> http://dev.globus.org/wiki/Incubator/Falkon
> >> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> >> ===================================================
> >> ===================================================
> >>
> >>     
> >
> >
> >   
> 


From wilde at mcs.anl.gov  Sun Apr  6 20:20:27 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 06 Apr 2008 20:20:27 -0500
Subject: [Swift-devel] Problem calling atomic procedure with multiple args
	via Falkon
Message-ID: <47F976DB.2060500@mcs.anl.gov>

I'm debugging a problem where I changed an atomic proc from one input 
arg to two.

Im using the patched wrapper.sh (Ben's 3 patches to run in /tmp).

Seems to work for local execution.

With Falkon execution on BGP Im getting this error:

bg$ cat ./status/h/dockwrap1-hpaczuqi-error
Missing -of argument

It looks like Falkon is getting the following command from swift and 
sending it to its bgp worker:

Sent task to worker 172.16.3.12:33161: 426 
urn:0-1-1-1207529458407#/bin/bash#shared/wrapper.sh dockwrap1-hpaczuqi 
-jobdir h -e /home/wilde/dock/bin/dockwrap1.cn -out stdout.txt -err 
stderr.txt -i -d mol-1M/8269 -if 
mol-1M/8269/058269.in|mol-1M/8269/058269.mol2 -of mol-1M/8269/058269.out 
-k  -a /home/wilde/dock/DOCK5_bgp_ram.tgz dock_bgp_login 
mol-1M/8269/058269.in mol-1M/8269/058269.mol2 mol-1M/8269/058269.out # 
#/home/wilde/swiftwork/dock2-20080406-1950-krt29l04#

The problem seems to stem from this arg:

-if mol-1M/8269/058269.in|mol-1M/8269/058269.mol2

The value after the -if arg needs to be in quotes to shield it from the 
shell, as falkon takes the string and involves wrapper.sh using 
system(). The IFS char "|" is causing the cmd to end there.

Its not clear if the arg is not quoted because in other providers its 
somehow shielded from shell evaluation, or if Falkon or the deef 
provider is pulling off the quotes.

Can anyone spot where the problem is?


From zhoujianghua1017 at 163.com  Sun Apr  6 20:15:39 2008
From: zhoujianghua1017 at 163.com (jezhee)
Date: Mon, 7 Apr 2008 09:15:39 +0800
Subject: [Swift-devel] The method getting result back
Message-ID: <200804070915377736963@163.com>

 After the tasks are dispatcjed to the computing nodes, what should swift do? Now, it simply blocks and waits until the tasks are completed and send the result back. It's suitable for the light-weighted tasks or the situation swift handles a few applications. When Swift runs as a server for several applications, simple block will lead to waste and inconvenience.
 In the future version, I think this work should be done by a independent thread running as a demon. When the tasks has been transfered to the computing grid, the main thread switched the work to this thread. This thread will do the left stuff. Even for the simple application, there is no harm too.
?Jezhee
							2008-04-07


From benc at hawaga.org.uk  Mon Apr  7 01:18:49 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 7 Apr 2008 06:18:49 +0000 (GMT)
Subject: [Swift-devel] Problem calling atomic procedure with multiple
	args via Falkon
In-Reply-To: <47F976DB.2060500@mcs.anl.gov>
References: <47F976DB.2060500@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0804070610120.5372@dildano.hawaga.org.uk>


On Sun, 6 Apr 2008, Michael Wilde wrote:

> involves wrapper.sh using system(). The IFS char "|" is causing the cmd 
> to end there.

[...]

> Can anyone spot where the problem is?

Using system to invoke the command is perhaps a bad thing to do - the 
other layers in the stack (including in Falkon in the Java worker) keep 
the arguments in array-like data structures to help avoid need for 
quoting.

The C worker isn't portable enough to build on my laptop so I can't easily 
play there, but you might try yourself replacing the system call with 
execve or something like that.

-- 


From benc at hawaga.org.uk  Mon Apr  7 02:03:27 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 7 Apr 2008 07:03:27 +0000 (GMT)
Subject: [Swift-devel] coaster status summary
In-Reply-To: <Pine.LNX.4.64.0804040941120.5372@dildano.hawaga.org.uk>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0804040941120.5372@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0804070701010.31934@dildano.hawaga.org.uk>


there's an interesting issue here of how much we should (in our general 
codebase, rather than in special customisations such as plugging in 
falkon) be supporting 'wierd' systems such as the BG/P which have, 
apparently, neither a decent shared filesystem or IP layer network fabric, 
vs support for more traditional clusters which seem to have both IP layer 
interconnect between nodes and decent shared filesystems.

-- 


From hategan at mcs.anl.gov  Mon Apr  7 02:50:11 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 07 Apr 2008 02:50:11 -0500
Subject: [Swift-devel] The method getting result back
In-Reply-To: <200804070915377736963@163.com>
References: <200804070915377736963@163.com>
Message-ID: <1207554611.22686.0.camel@blabla.mcs.anl.gov>

Swift already uses lightweight threading. There is a limited number of
OS threads that do the work.

On Mon, 2008-04-07 at 09:15 +0800, jezhee wrote:
> After the tasks are dispatcjed to the computing nodes, what should swift do? Now, it simply blocks and waits until the tasks are completed and send the result back. It's suitable for the light-weighted tasks or the situation swift handles a few applications. When Swift runs as a server for several applications, simple block will lead to waste and inconvenience.
>  In the future version, I think this work should be done by a independent thread running as a demon. When the tasks has been transfered to the computing grid, the main thread switched the work to this thread. This thread will do the left stuff. Even for the simple application, there is no harm too.
> ?Jezhee
> 							2008-04-07
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From hategan at mcs.anl.gov  Mon Apr  7 02:59:52 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 07 Apr 2008 02:59:52 -0500
Subject: [Swift-devel] coaster status summary
In-Reply-To: <Pine.LNX.4.64.0804070701010.31934@dildano.hawaga.org.uk>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0804040941120.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804070701010.31934@dildano.hawaga.org.uk>
Message-ID: <1207555192.22686.6.camel@blabla.mcs.anl.gov>


On Mon, 2008-04-07 at 07:03 +0000, Ben Clifford wrote:
> there's an interesting issue here of how much we should (in our general 
> codebase, rather than in special customisations such as plugging in 
> falkon) be supporting 'wierd' systems such as the BG/P which have, 
> apparently, neither a decent shared filesystem or IP layer network fabric, 
> vs support for more traditional clusters which seem to have both IP layer 
> interconnect between nodes and decent shared filesystems.

That is a good point. However, UDP was chosen to support systems with a
very large number of CPUs in the first place. In other words if UDP
won't work on BG/P, I don't see much reason for going with it.

> 


From hategan at mcs.anl.gov  Mon Apr  7 03:04:20 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 07 Apr 2008 03:04:20 -0500
Subject: [Swift-devel] swift 0.5rc1
In-Reply-To: <Pine.LNX.4.64.0804051204530.31934@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0804051204530.31934@dildano.hawaga.org.uk>
Message-ID: <1207555460.22686.8.camel@blabla.mcs.anl.gov>

cog r1961 fixes some issues that would prevent the gridftp connection
caching mechanism from caching things. I think it's worth an rc2.

On Sat, 2008-04-05 at 12:07 +0000, Ben Clifford wrote:
> Swift 0.5 release candidate 1 is at 
> http://www.ci.uchicago.edu/~benc/vdsk-0.5rc1.tar.gz
> 
> This is primarily bugfixes for bugs that were found around the time of the 
> 0.4 release - syntax error handling that was poorly tested before 0.4; and 
> data channel caching problems. There shouldn't be many (if any) new 
> features here.
> 
> Please test.
> 
> If there are no significant problems, I'll put this out as 0.5 on tuesday.
> 


From hategan at mcs.anl.gov  Mon Apr  7 03:14:23 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 07 Apr 2008 03:14:23 -0500
Subject: [Swift-devel] Problem calling atomic procedure with multiple
	args via Falkon
In-Reply-To: <47F976DB.2060500@mcs.anl.gov>
References: <47F976DB.2060500@mcs.anl.gov>
Message-ID: <1207556063.22686.14.camel@blabla.mcs.anl.gov>

> The problem seems to stem from this arg:
> 
> -if mol-1M/8269/058269.in|mol-1M/8269/058269.mol2

I'd say the problem stems from improper passing of the arguments by some
layer somewhere.

> 
> The value after the -if arg needs to be in quotes to shield it from the 
> shell, as falkon takes the string and involves wrapper.sh using 
> system().

...presumably by concatenating the arguments into a single string and
hoping system() will split them correctly. Falkon should use execve().

>  The IFS char "|" is causing the cmd to end there.
> 
> Its not clear if the arg is not quoted because in other providers its 
> somehow shielded from shell evaluation, or if Falkon or the deef 
> provider is pulling off the quotes.
> 
> Can anyone spot where the problem is?
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From foster at mcs.anl.gov  Mon Apr  7 07:35:41 2008
From: foster at mcs.anl.gov (Ian Foster)
Date: Mon, 07 Apr 2008 07:35:41 -0500
Subject: [Swift-devel] coaster status summary
In-Reply-To: <1207555192.22686.6.camel@blabla.mcs.anl.gov>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov>	<Pine.LNX.4.64.0804040941120.5372@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0804070701010.31934@dildano.hawaga.org.uk>
	<1207555192.22686.6.camel@blabla.mcs.anl.gov>
Message-ID: <47FA151D.1080504@mcs.anl.gov>

I wonder whether we should be making use of MPI on the BG/P where we can 
... I suspect that is what is optimized, rather than the IP stack.

Mihael Hategan wrote:
> On Mon, 2008-04-07 at 07:03 +0000, Ben Clifford wrote:
>   
>> there's an interesting issue here of how much we should (in our general 
>> codebase, rather than in special customisations such as plugging in 
>> falkon) be supporting 'wierd' systems such as the BG/P which have, 
>> apparently, neither a decent shared filesystem or IP layer network fabric, 
>> vs support for more traditional clusters which seem to have both IP layer 
>> interconnect between nodes and decent shared filesystems.
>>     
>
> That is a good point. However, UDP was chosen to support systems with a
> very large number of CPUs in the first place. In other words if UDP
> won't work on BG/P, I don't see much reason for going with it.
>
>   
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080407/cd00a190/attachment.html>

From hategan at mcs.anl.gov  Mon Apr  7 07:45:35 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 07 Apr 2008 07:45:35 -0500
Subject: [Swift-devel] coaster status summary
In-Reply-To: <47FA151D.1080504@mcs.anl.gov>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0804040941120.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804070701010.31934@dildano.hawaga.org.uk>
	<1207555192.22686.6.camel@blabla.mcs.anl.gov>
	<47FA151D.1080504@mcs.anl.gov>
Message-ID: <1207572335.27797.5.camel@blabla.mcs.anl.gov>

One unfortunate part there is the lack of a decent Java MPI
implementation.

Another unfortunate part may be that MPI may not work very well with
processes that come and go dynamically, but I guess that can be
addressed in a way or another.

On Mon, 2008-04-07 at 07:35 -0500, Ian Foster wrote:
> I wonder whether we should be making use of MPI on the BG/P where we
> can ... I suspect that is what is optimized, rather than the IP stack.
> 
> Mihael Hategan wrote: 
> > On Mon, 2008-04-07 at 07:03 +0000, Ben Clifford wrote:
> >   
> > > there's an interesting issue here of how much we should (in our general 
> > > codebase, rather than in special customisations such as plugging in 
> > > falkon) be supporting 'wierd' systems such as the BG/P which have, 
> > > apparently, neither a decent shared filesystem or IP layer network fabric, 
> > > vs support for more traditional clusters which seem to have both IP layer 
> > > interconnect between nodes and decent shared filesystems.
> > >     
> > 
> > That is a good point. However, UDP was chosen to support systems with a
> > very large number of CPUs in the first place. In other words if UDP
> > won't work on BG/P, I don't see much reason for going with it.
> > 
> >   
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> >   


From benc at hawaga.org.uk  Mon Apr  7 07:49:40 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 7 Apr 2008 12:49:40 +0000 (GMT)
Subject: [Swift-devel] coaster status summary
In-Reply-To: <1207572335.27797.5.camel@blabla.mcs.anl.gov>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov> 
	<Pine.LNX.4.64.0804040941120.5372@dildano.hawaga.org.uk> 
	<Pine.LNX.4.64.0804070701010.31934@dildano.hawaga.org.uk> 
	<1207555192.22686.6.camel@blabla.mcs.anl.gov>
	<47FA151D.1080504@mcs.anl.gov>
	<1207572335.27797.5.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0804071246520.31934@dildano.hawaga.org.uk>


Wary of excessive optimisation of job completion notification speed in 
order to get high 'trivial/useless job' numbers, when there also seem to 
be problems getting shared filesystem access fast enough for non-useless 
jobs. Getting a ridiculously high trivial job throughput is not (in my 
eyes) a design goal of this coaster work.

-- 


From foster at mcs.anl.gov  Mon Apr  7 07:59:25 2008
From: foster at mcs.anl.gov (Ian Foster)
Date: Mon, 07 Apr 2008 07:59:25 -0500
Subject: [Swift-devel] coaster status summary
In-Reply-To: <Pine.LNX.4.64.0804071246520.31934@dildano.hawaga.org.uk>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0804040941120.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804070701010.31934@dildano.hawaga.org.uk>
	<1207555192.22686.6.camel@blabla.mcs.anl.gov>
	<47FA151D.1080504@mcs.anl.gov>
	<1207572335.27797.5.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0804071246520.31934@dildano.hawaga.org.uk>
Message-ID: <47FA1AAD.5080009@mcs.anl.gov>

YES! I agree absolutely.

Ben Clifford wrote:
> Wary of excessive optimisation of job completion notification speed in 
> order to get high 'trivial/useless job' numbers, when there also seem to 
> be problems getting shared filesystem access fast enough for non-useless 
> jobs. Getting a ridiculously high trivial job throughput is not (in my 
> eyes) a design goal of this coaster work.
>
>   


From hategan at mcs.anl.gov  Mon Apr  7 08:08:29 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 07 Apr 2008 08:08:29 -0500
Subject: [Swift-devel] coaster status summary
In-Reply-To: <Pine.LNX.4.64.0804071246520.31934@dildano.hawaga.org.uk>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0804040941120.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804070701010.31934@dildano.hawaga.org.uk>
	<1207555192.22686.6.camel@blabla.mcs.anl.gov>
	<47FA151D.1080504@mcs.anl.gov>
	<1207572335.27797.5.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0804071246520.31934@dildano.hawaga.org.uk>
Message-ID: <1207573709.27797.16.camel@blabla.mcs.anl.gov>

On Mon, 2008-04-07 at 12:49 +0000, Ben Clifford wrote:
> Wary of excessive optimisation of job completion notification speed in 
> order to get high 'trivial/useless job' numbers, when there also seem to 
> be problems getting shared filesystem access fast enough for non-useless 
> jobs. Getting a ridiculously high trivial job throughput is not (in my 
> eyes) a design goal of this coaster work.

200 j/s should be enough for anybody.

Joking aside, the issue was ability to scale to large number of jobs
rather than speed. But it looks like the issue is only an issue for
monsters such as the BG/P.

> 


From benc at hawaga.org.uk  Mon Apr  7 03:40:34 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 7 Apr 2008 08:40:34 +0000 (GMT)
Subject: [Swift-devel] swift 0.5rc1
In-Reply-To: <1207555460.22686.8.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0804051204530.31934@dildano.hawaga.org.uk>
	<1207555460.22686.8.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0804070840190.31934@dildano.hawaga.org.uk>


ok. I will put one out later.

On Mon, 7 Apr 2008, Mihael Hategan wrote:

> cog r1961 fixes some issues that would prevent the gridftp connection
> caching mechanism from caching things. I think it's worth an rc2.

-- 


From wilde at mcs.anl.gov  Mon Apr  7 12:18:26 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 07 Apr 2008 12:18:26 -0500
Subject: [Swift-devel] Problem calling atomic procedure with multiple
	args via Falkon
In-Reply-To: <Pine.LNX.4.64.0804070610120.5372@dildano.hawaga.org.uk>
References: <47F976DB.2060500@mcs.anl.gov>
	<Pine.LNX.4.64.0804070610120.5372@dildano.hawaga.org.uk>
Message-ID: <47FA5762.2070408@mcs.anl.gov>


On 4/7/08 1:18 AM, Ben Clifford wrote:
> On Sun, 6 Apr 2008, Michael Wilde wrote:
> 
>> involves wrapper.sh using system(). The IFS char "|" is causing the cmd 
>> to end there.
> 
> [...]
> 
>> Can anyone spot where the problem is?
> 
> Using system to invoke the command is perhaps a bad thing to do - the 
> other layers in the stack (including in Falkon in the Java worker) keep 
> the arguments in array-like data structures to help avoid need for 
> quoting.

That makes sense. Can you start some documentation fragments in the 
users guide on how quoting and tokenization works for atomic procedures 
from the swift declaration down to the actual invocation?

something like:

- each token in the app {} declaration becomes one arg to execve()
- strings must be enclosed in quotes
- quotes and other special chars within the strings can be represented as...
- quotes are expected to pass through the provider interfaces(GRAMs, 
PBS, Falkon etc) without further processing or alteration...???
-etc

> The C worker isn't portable enough to build on my laptop so I can't easily 
> play there, but you might try yourself replacing the system call with 
> execve or something like that.

That sounds reasonable - we can do that.

I was able to work around the problem for the moment by changing | to 
"," just before the system() call, but I agree that execve() is the 
right way to do things.

- Mike


From iraicu at cs.uchicago.edu  Mon Apr  7 13:16:25 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Mon, 07 Apr 2008 13:16:25 -0500
Subject: [Swift-devel] coaster status summary
In-Reply-To: <1207473442.10063.3.camel@blabla.mcs.anl.gov>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov>	
	<47F61820.3090705@mcs.anl.gov>	
	<1207311167.7171.12.camel@blabla.mcs.anl.gov>	
	<47F6C1A4.5030200@cs.uchicago.edu>	
	<1207388755.10629.12.camel@blabla.mcs.anl.gov>	
	<47F78052.5020702@cs.uchicago.edu>
	<1207473442.10063.3.camel@blabla.mcs.anl.gov>
Message-ID: <47FA64F9.8060005@cs.uchicago.edu>

Although, when switching to TCP, most of my problems magically went 
away... obviously TCP's error recovery mechanisms are more robust than 
what I implemented.  The moral of the story is from my experience, have 
a UDP option for potentially better performance and scalability, but 
have TCP as a configurable option for potentially better reliability and 
robustness.

Ioan

Mihael Hategan wrote:
>>> Of course it's unreliable unless you deal with the reliability issues as
>>> outlined above.
>>>   
>>>       
>> I did deal with them, duplicates, out of order, retries, timeouts, 
>> etc... yet, I still couldn't get a 100% reliable implementation,
>>     
>
> Of course you couldn't. It's impossible.
>
>   
>>  and I 
>> gave up... in theory, UDP should work given that you deal with all the 
>> reliability issues you outlined.  I am just pointing out that after lots 
>> of debugging, I gave in and swapped UDP for TCP to avoid the unexplained 
>> lost message once in a while.  I am positive it was a bug in my code, so 
>> perhaps you'll have better luck!
>>     
>>>   
>>>       
>>>> Is the 180 tasks/sec the overall throughput measured from Swift's
>>>> point of view, including overhead of wrapper.sh?  Or is that a
>>>> micro-benchmark measuring just the coaster performance?  
>>>>     
>>>>         
>>> It's at the provider level. No wrapper.sh.
>>>   
>>>       
>> OK, great!
>>
>> Ioan
>>     
>>>   
>>>       
>>>> Ioan
>>>>
>>>>
>>>> Mihael Hategan wrote: 
>>>>     
>>>>         
>>>>> On Fri, 2008-04-04 at 06:59 -0500, Michael Wilde wrote:
>>>>>   
>>>>>       
>>>>>           
>>>>>> Mihael, this is great progress - very exciting.
>>>>>> Some questions (dont need answers right away):
>>>>>>
>>>>>> How would the end user use it? Manually start a service?
>>>>>> Is the service a separate process, or in the swift jvm?
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> I though the lines below answered some of these.
>>>>>
>>>>> A user would specify the coaster provider in sites.xml. The provider
>>>>> will then automatically deploy a service on the target machine without
>>>>> the user having to do so. Given that the service is on a different
>>>>> machine than the client, they can't be in the same JVM.
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>>>> How are the number of workers set or adjusted?
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> Currently workers are requested as much as needed, up to a maximum. This
>>>>> is preliminary hence "Better allocation strategy for workers".
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>>>> Does a service manage workers on one cluster or many?
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> One service per cluster.
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>>>> At 180 jobs/sec with 10 workers, what were the CPU loads on swift, 
>>>>>> worker and service?
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> I faintly recall them being at less than 50% for some reason I don't
>>>>> understand.
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>>>> Do you want to try this on the workflows we're running on Falkon on the 
>>>>>> BGP and SiCortex?
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> Let me repeat "prototype" and "more testing". In no way do I want to do
>>>>> preliminary testing with an application that is shaky on an architecture
>>>>> that is also shaky.
>>>>>
>>>>> Mihael
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>>>> Im eager to try it when you feel its ready for others to test.
>>>>>>
>>>>>> Nice work!
>>>>>>
>>>>>> - Mike
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 4/4/08 4:39 AM, Mihael Hategan wrote:
>>>>>>     
>>>>>>         
>>>>>>             
>>>>>>> I've been asked for a summary of the status of the coaster prototype, so
>>>>>>> here it is:
>>>>>>> - It's a prototype so bugs are plenty
>>>>>>> - It's self deployed (you don't need to start a service on the target
>>>>>>> cluster)
>>>>>>> - You can also use it while starting a service on the target cluster
>>>>>>> - There is a worker written in Perl
>>>>>>> - It uses encryption between client and coaster service
>>>>>>> - It uses UDP between the service and the workers (this may prove to be
>>>>>>> better or worse choice than TCP)
>>>>>>> - A preliminary test done locally shows an amortized throughput of
>>>>>>> around 180 jobs/s (/bin/date). This was done with encryption and with 10
>>>>>>> workers. Pretty picture attached (total time vs. # of jobs)
>>>>>>>
>>>>>>> To do:
>>>>>>> - The scheduling algorithm in the service needs a bit more work
>>>>>>> - When worker messages are lost, some jobs may get lost (i.e. needs more
>>>>>>> fault tolerance)
>>>>>>> - Start testing it on actual clusters
>>>>>>> - Do some memory consumption benchmarks
>>>>>>> - Better allocation strategy for workers
>>>>>>>
>>>>>>> Mihael
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Swift-devel mailing list
>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>       
>>>>>>>           
>>>>>>>               
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>> -- 
>>>> ===================================================
>>>> Ioan Raicu
>>>> Ph.D. Candidate
>>>> ===================================================
>>>> Distributed Systems Laboratory
>>>> Computer Science Department
>>>> University of Chicago
>>>> 1100 E. 58th Street, Ryerson Hall
>>>> Chicago, IL 60637
>>>> ===================================================
>>>> Email: iraicu at cs.uchicago.edu
>>>> Web:   http://www.cs.uchicago.edu/~iraicu
>>>> http://dev.globus.org/wiki/Incubator/Falkon
>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>>> ===================================================
>>>> ===================================================
>>>>
>>>>     
>>>>         
>>>   
>>>       
>
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080407/1b023ebc/attachment.html>

From iraicu at cs.uchicago.edu  Mon Apr  7 13:16:53 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Mon, 07 Apr 2008 13:16:53 -0500
Subject: [Swift-devel] Problem calling atomic procedure with multiple
	args via Falkon
In-Reply-To: <Pine.LNX.4.64.0804070610120.5372@dildano.hawaga.org.uk>
References: <47F976DB.2060500@mcs.anl.gov>
	<Pine.LNX.4.64.0804070610120.5372@dildano.hawaga.org.uk>
Message-ID: <47FA6515.3060000@cs.uchicago.edu>


Ben Clifford wrote:
> On Sun, 6 Apr 2008, Michael Wilde wrote:
>
>   
>> involves wrapper.sh using system(). The IFS char "|" is causing the cmd 
>> to end there.
>>     
>
> [...]
>
>   
>> Can anyone spot where the problem is?
>>     
>
> Using system to invoke the command is perhaps a bad thing to do - the 
> other layers in the stack (including in Falkon in the Java worker) keep 
> the arguments in array-like data structures to help avoid need for 
> quoting.
>   
I agree, and one of these days (maybe sooner rather than later), we'll 
switch to fork() and exec(), rather than system.
> The C worker isn't portable enough to build on my laptop so I can't easily 
> play there, 
The C worker is quite basic, what error do you get that it doesn't 
compile?  It has compiled for me on numerous platforms as is, so if its 
something we need to fix in general to help it be more portable, let us 
know.

Ioan
> but you might try yourself replacing the system call with 
> execve or something like that.
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080407/5548bcc0/attachment.html>

From iraicu at cs.uchicago.edu  Mon Apr  7 13:18:12 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Mon, 07 Apr 2008 13:18:12 -0500
Subject: [Swift-devel] coaster status summary
In-Reply-To: <1207573709.27797.16.camel@blabla.mcs.anl.gov>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov>	<Pine.LNX.4.64.0804040941120.5372@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0804070701010.31934@dildano.hawaga.org.uk>	<1207555192.22686.6.camel@blabla.mcs.anl.gov>	<47FA151D.1080504@mcs.anl.gov>	<1207572335.27797.5.camel@blabla.mcs.anl.gov>	<Pine.LNX.4.64.0804071246520.31934@dildano.hawaga.org.uk>
	<1207573709.27797.16.camel@blabla.mcs.anl.gov>
Message-ID: <47FA6564.60008@cs.uchicago.edu>

I agree that the BG/P is the only system I can think of right now that 
won't work with the UDP scheme you currently have, assuming that you 
will run the service on a login node that has access to both compute 
nodes and external world (i.e. Swift).  The compute nodes don't support 
Java, so you'd have to have some C/Fortran code, or maybe some scripting 
language (which I don't know what kind of support there is).   If you 
use C or Fortran, MPI becomes a viable alternative.  TCP has always been 
an alternative.  Anyways, if UDP doesn't work on the BG/P, and the BG/P 
is the only scale large enough (today) that warrants a connectionless 
protocol, then I suggest you switch to TCP (which has worked for us well 
on the BG/P, and is general enough to work in most environments) or even 
MPI (but you loose the generality of TCP, but might gain performance).

Ioan

Mihael Hategan wrote:
> On Mon, 2008-04-07 at 12:49 +0000, Ben Clifford wrote:
>   
>> Wary of excessive optimisation of job completion notification speed in 
>> order to get high 'trivial/useless job' numbers, when there also seem to 
>> be problems getting shared filesystem access fast enough for non-useless 
>> jobs. Getting a ridiculously high trivial job throughput is not (in my 
>> eyes) a design goal of this coaster work.
>>     
>
> 200 j/s should be enough for anybody.
>
> Joking aside, the issue was ability to scale to large number of jobs
> rather than speed. But it looks like the issue is only an issue for
> monsters such as the BG/P.
>
>   
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080407/cec7f073/attachment.html>

From iraicu at cs.uchicago.edu  Mon Apr  7 13:19:25 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Mon, 07 Apr 2008 13:19:25 -0500
Subject: [Swift-devel] Problem calling atomic procedure with multiple
	args via Falkon
In-Reply-To: <1207556063.22686.14.camel@blabla.mcs.anl.gov>
References: <47F976DB.2060500@mcs.anl.gov>
	<1207556063.22686.14.camel@blabla.mcs.anl.gov>
Message-ID: <47FA65AD.1060509@cs.uchicago.edu>


Mihael Hategan wrote:
>> The problem seems to stem from this arg:
>>
>> -if mol-1M/8269/058269.in|mol-1M/8269/058269.mol2
>>     
>
> I'd say the problem stems from improper passing of the arguments by some
> layer somewhere.
>
>   
>> The value after the -if arg needs to be in quotes to shield it from the 
>> shell, as falkon takes the string and involves wrapper.sh using 
>> system().
>>     
>
> ...presumably by concatenating the arguments into a single string and
> hoping system() will split them correctly. Falkon should use execve().
>
>   
Right... its on the to-do list!
http://bugzilla.globus.org/globus/show_bug.cgi?id=5987

>>  The IFS char "|" is causing the cmd to end there.
>>
>> Its not clear if the arg is not quoted because in other providers its 
>> somehow shielded from shell evaluation, or if Falkon or the deef 
>> provider is pulling off the quotes.
>>
>> Can anyone spot where the problem is?
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
>>     
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080407/fe290d01/attachment.html>

From iraicu at cs.uchicago.edu  Mon Apr  7 15:48:47 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Mon, 07 Apr 2008 15:48:47 -0500
Subject: [Swift-devel] coaster status summary
In-Reply-To: <1207473251.10063.1.camel@blabla.mcs.anl.gov>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov>	
	<47F61820.3090705@mcs.anl.gov>	
	<1207311167.7171.12.camel@blabla.mcs.anl.gov>	
	<47F6C1A4.5030200@cs.uchicago.edu>	
	<Pine.LNX.4.64.0804050930070.5372@dildano.hawaga.org.uk>	
	<47F77DD0.8040302@cs.uchicago.edu>
	<1207473251.10063.1.camel@blabla.mcs.anl.gov>
Message-ID: <47FA88AF.7050302@cs.uchicago.edu>

If you won't take my word for it, when I have been on the machine and 
have seen what I described first hand, then feel free to write tech 
support for the BG/P!

Here is their email address: ALCF Support <support at alcf.anl.gov>

Cheers,
Ioan

Mihael Hategan wrote:
> On Sat, 2008-04-05 at 08:25 -0500, Ioan Raicu wrote:
>   
>> I looked around for some docs on the networking structure, but couldn't 
>> find anything.
>>
>> There are several networks available on the BG/P: Torus, Tree, Barrier, 
>> RAS, 10Gig Ethernet.
>>
>> Of all these, we are only using the Ethernet network, which allows us to 
>> communicate via TCP/IP (or potentially UDP/IP) between compute nodes and 
>> I/O nodes, or between compute nodes and login nodes.  For the rest of 
>> the discussion, we assume only Ethernet communication.  There is 1 I/O 
>> node per 64 compute nodes (what we call a P-SET), and the I/O node can 
>> only communicate with compute nodes that it manages within the same 
>> P-SET (the 64 nodes).  A compute node from one P-SET cannot directly 
>> communicate with another compute from a different P-SET.  This is 
>> primarily because compute nodes have private addresses (192.168.x.x), 
>> I/O nodes are the NAT between the public IP and the private IP, and the 
>> login nodes only have a public IP.  So, the compute nodes all have the 
>> same IP addresses, 192.168.x.x, and they repeat for every P-SET, and the 
>> I/O nodes handle their traffic in and out.
>>     
>
> You are describing NAT. I understand what NAT is. I was looking for an
> independent source confirming this.
>
>   
>>  
>>
>> Zhao, if you have any docs on the Ethernet network and the NAT that sits 
>> on the I/O node, can you please send it to the mailing list?
>>
>> Ioan
>>
>> Ben Clifford wrote:
>>     
>>>> it would matter on a machine such as
>>>> the BG/P, as there is a NAT inbetween the login nodes and the compute nodes.
>>>>     
>>>>         
>>> wierd. is there a description of that somewhere?
>>>
>>>   
>>>       
>
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080407/7e511bfb/attachment.html>

From duxu at mcs.anl.gov  Mon Apr  7 09:57:55 2008
From: duxu at mcs.anl.gov (Xu Du)
Date: Mon, 7 Apr 2008 09:57:55 -0500
Subject: [Swift-devel] Swift Innovation for BOINC : Design Spec
References: <00ec01c89361$e21c5ec0$6a08dd8c@karen>
	<47F22DCE.5070405@mcs.anl.gov>
Message-ID: <004601c898bf$c6335ed0$9a01a8c0@karen>

Dear Mike,

I have updated the design spec of "Swift Innovation for BOINC", please find it in the attachement.

I also copy it to the swift develop members, any suggestion is welcome.

Thanks,
Du, Xu
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SWIFT-SDS-0000-D0.2_080406.pdf
Type: application/pdf
Size: 80731 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080407/5cf7ad21/attachment.pdf>

From duxu at mcs.anl.gov  Mon Apr  7 10:01:01 2008
From: duxu at mcs.anl.gov (Xu Du)
Date: Mon, 7 Apr 2008 10:01:01 -0500
Subject: [Swift-devel] SWIFT INNOVATION FOR BOINC: Weekly Report Mar.31-Apr.6
References: <00ec01c89361$e21c5ec0$6a08dd8c@karen>
	<47F22DCE.5070405@mcs.anl.gov>
Message-ID: <005301c898c0$32052580$9a01a8c0@karen>

Dear Mike,

The following is the last weekly report. Any suggestion and comment are welcome.

Regards,
Xu

--------------------------------------------------------------------------------------
              Weekly Report Mar.31-Apr.6

Done:
1. Traced the source code of Cogkit and Swift,  made clear about how Swift works. The ?Boinc provider? can be found by Swift now, but it still does not work well with Swift
2. Swift adaptor can now handle the situation when jobs are submitted simultaneously, even though they ask the same application to execute with different input files or arguments.

Issues: 
1. As for the problem of ?Boinc provider? can not work well with Swift,  since ?Boinc provider? employs the same mechanism as ?ssh provider?, so we try ?ssh provider?, and the same problem occurred. User can not login to ssh server using Swift, the SSH server can only receive the header of the authentication message, and then the connection lost.

To Do:
1.  Solve the problem of ?ssh? and work out the prototype, and then test the whole system.
2. Make the program be able to handle job state query request from BOINC provider.

From hategan at mcs.anl.gov  Mon Apr  7 19:22:33 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 07 Apr 2008 19:22:33 -0500
Subject: [Swift-devel] coaster status summary
In-Reply-To: <47FA6564.60008@cs.uchicago.edu>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0804040941120.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804070701010.31934@dildano.hawaga.org.uk>
	<1207555192.22686.6.camel@blabla.mcs.anl.gov>
	<47FA151D.1080504@mcs.anl.gov>
	<1207572335.27797.5.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0804071246520.31934@dildano.hawaga.org.uk>
	<1207573709.27797.16.camel@blabla.mcs.anl.gov>
	<47FA6564.60008@cs.uchicago.edu>
Message-ID: <1207614154.6864.0.camel@blabla.mcs.anl.gov>

Do you tweak the TCP window size or do you use the default?

On Mon, 2008-04-07 at 13:18 -0500, Ioan Raicu wrote:
> I agree that the BG/P is the only system I can think of right now that
> won't work with the UDP scheme you currently have, assuming that you
> will run the service on a login node that has access to both compute
> nodes and external world (i.e. Swift).  The compute nodes don't
> support Java, so you'd have to have some C/Fortran code, or maybe some
> scripting language (which I don't know what kind of support there is).
> If you use C or Fortran, MPI becomes a viable alternative.  TCP has
> always been an alternative.  Anyways, if UDP doesn't work on the BG/P,
> and the BG/P is the only scale large enough (today) that warrants a
> connectionless protocol, then I suggest you switch to TCP (which has
> worked for us well on the BG/P, and is general enough to work in most
> environments) or even MPI (but you loose the generality of TCP, but
> might gain performance).
> 
> Ioan
> 
> Mihael Hategan wrote: 
> > On Mon, 2008-04-07 at 12:49 +0000, Ben Clifford wrote:
> >   
> > > Wary of excessive optimisation of job completion notification speed in 
> > > order to get high 'trivial/useless job' numbers, when there also seem to 
> > > be problems getting shared filesystem access fast enough for non-useless 
> > > jobs. Getting a ridiculously high trivial job throughput is not (in my 
> > > eyes) a design goal of this coaster work.
> > >     
> > 
> > 200 j/s should be enough for anybody.
> > 
> > Joking aside, the issue was ability to scale to large number of jobs
> > rather than speed. But it looks like the issue is only an issue for
> > monsters such as the BG/P.
> > 
> >   
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> >   
> 
> -- 
> ===================================================
> Ioan Raicu
> Ph.D. Candidate
> ===================================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ===================================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
> http://dev.globus.org/wiki/Incubator/Falkon
> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> ===================================================
> ===================================================
> 


From benc at hawaga.org.uk  Tue Apr  8 02:54:55 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 8 Apr 2008 07:54:55 +0000 (GMT)
Subject: [Swift-devel] Swift Innovation for BOINC : Design Spec
In-Reply-To: <004601c898bf$c6335ed0$9a01a8c0@karen>
References: <00ec01c89361$e21c5ec0$6a08dd8c@karen>
	<47F22DCE.5070405@mcs.anl.gov>
	<004601c898bf$c6335ed0$9a01a8c0@karen>
Message-ID: <Pine.LNX.4.64.0804080749210.5372@dildano.hawaga.org.uk>


Hi.

When running a program in Swift, there is a requirement that the input 
files for a job are placed in the current working directory that the unix 
process runs on.

Usually, that is achieved by using a shared filesystem between every 
worker node. But, I think that in a BOINC deployment, there will not be a 
shared filesystem that is shared between every worker node.

I see that you intend to do 'file transfer' with BOINC, but it is not 
clear to me how those files will be connected with the jobs that want to 
use them.

Is this addressed in your design?

-- 


From benc at hawaga.org.uk  Tue Apr  8 03:03:55 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 8 Apr 2008 08:03:55 +0000 (GMT)
Subject: [Swift-devel] Problem calling atomic procedure with multiple
	args via Falkon
In-Reply-To: <47FA6515.3060000@cs.uchicago.edu>
References: <47F976DB.2060500@mcs.anl.gov>
	<Pine.LNX.4.64.0804070610120.5372@dildano.hawaga.org.uk>
	<47FA6515.3060000@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0804080802190.31934@dildano.hawaga.org.uk>


On Mon, 7 Apr 2008, Ioan Raicu wrote:

> > The C worker isn't portable enough to build on my laptop so I can't easily
> > play there, 

> The C worker is quite basic, what error do you get that it doesn't compile?
> It has compiled for me on numerous platforms as is, so if its something we
> need to fix in general to help it be more portable, let us know.

$ ./make.worker-c.sh 
Compiling C Executor
BGexec.c: In function 'set_sockopt':
BGexec.c:48: error: 'SOL_TCP' undeclared (first use in this function)
BGexec.c:48: error: (Each undeclared identifier is reported only once
BGexec.c:48: error: for each function it appears in.)
BGexec.c:48: error: 'TCP_KEEPCNT' undeclared (first use in this function)
BGexec.c:53: error: 'TCP_KEEPIDLE' undeclared (first use in this function)
BGexec.c:58: error: 'TCP_KEEPINTVL' undeclared (first use in this 
function)

$ uname -a
Darwin soju.hawaga.org.uk 8.11.1 Darwin Kernel Version 8.11.1: Wed Oct 10 
18:23:28 PDT 2007; root:xnu-792.25.20~1/RELEASE_I386 i386 i386


If/when you rearrange the source code so it can be easily checked out, you 
can have multi-platform testing of this on a bunch of platforms in NMI 
build-and-test.

-- 


From benc at hawaga.org.uk  Tue Apr  8 05:01:04 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 8 Apr 2008 10:01:04 +0000 (GMT)
Subject: [Swift-devel] cog r1957 breaks swift ftp usage (when port not
	specified?)
Message-ID: <Pine.LNX.4.64.0804080950520.31934@dildano.hawaga.org.uk>


CoG r1957 appears to break handling of gsiftp URLs specified in the Swift 
site catalog.

All of the site tests in 
https://svn.ci.uchicago.edu/svn/vdl2/trunk/tests/sites are configured that 
way and are broken by r1957.

When I apply svn diff -r 1957:1956 to my checkout, things work better (I 
ran the site tests on a few of the sites, but not all as I got tired of 
waiting)

First part of stack trace is:

Using sites file: ../sites/tp-fork-gram2-ftpport.xml
Using tc.data: ../sites/tc.data
For input string: ""
For input string: ""
        task:service @ vdl-sc.k, line: 23
        sys:if @ vdl-sc.k, line: 21
        gridftp @ tp-fork-gram2-ftpport.xml, line: 4
        pool @ tp-fork-gram2-ftpport.xml, line: 4
        pool @ tp-fork-gram2-ftpport.xml, line: 4
        org.globus.cog.karajan.workflow.nodes.Sequential @ 
tp-fork-gram2-ftpport.xml
        sys:executefile @ vdl-sc.k, line: 59
        task:resources @ vdl-sc.k, line: 59
        vdl:sitecatalog @ scheduler.xml, line: 42
        task:scheduler @ scheduler.xml, line: 27
        kernel:import @ scheduler.xml, line: 3
        kernel:project @ 061-cattwo.kml, line: 2
        061-cattwo-20080408-1100-eibujbf8
Caused by: java.lang.NumberFormatException: For input string: ""
        at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
        at java.lang.Integer.parseInt(Integer.java:468)
        at java.lang.Integer.parseInt(Integer.java:497)
        at 
org.globus.cog.abstraction.impl.common.task.ServiceContactImpl.parse(ServiceContactImpl.java:81)
        at 
org.globus.cog.abstraction.impl.common.task.ServiceContactImpl.<init>(ServiceContactImpl.java:26)
        at 
org.globus.cog.karajan.workflow.nodes.grid.ServiceNode.function(ServiceNode.java:123)
        at 
org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45)
        at 
org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
        at 
org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)


-- 


From hategan at mcs.anl.gov  Tue Apr  8 06:10:03 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 08 Apr 2008 06:10:03 -0500
Subject: [Swift-devel] Re: cog r1957 breaks swift ftp usage (when port not
	specified?)
In-Reply-To: <Pine.LNX.4.64.0804080950520.31934@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0804080950520.31934@dildano.hawaga.org.uk>
Message-ID: <1207653003.7262.0.camel@blabla.mcs.anl.gov>

Grr!
Fix coming up.

On Tue, 2008-04-08 at 10:01 +0000, Ben Clifford wrote:
> CoG r1957 appears to break handling of gsiftp URLs specified in the Swift 
> site catalog.
> 
> All of the site tests in 
> https://svn.ci.uchicago.edu/svn/vdl2/trunk/tests/sites are configured that 
> way and are broken by r1957.
> 
> When I apply svn diff -r 1957:1956 to my checkout, things work better (I 
> ran the site tests on a few of the sites, but not all as I got tired of 
> waiting)
> 
> First part of stack trace is:
> 
> Using sites file: ../sites/tp-fork-gram2-ftpport.xml
> Using tc.data: ../sites/tc.data
> For input string: ""
> For input string: ""
>         task:service @ vdl-sc.k, line: 23
>         sys:if @ vdl-sc.k, line: 21
>         gridftp @ tp-fork-gram2-ftpport.xml, line: 4
>         pool @ tp-fork-gram2-ftpport.xml, line: 4
>         pool @ tp-fork-gram2-ftpport.xml, line: 4
>         org.globus.cog.karajan.workflow.nodes.Sequential @ 
> tp-fork-gram2-ftpport.xml
>         sys:executefile @ vdl-sc.k, line: 59
>         task:resources @ vdl-sc.k, line: 59
>         vdl:sitecatalog @ scheduler.xml, line: 42
>         task:scheduler @ scheduler.xml, line: 27
>         kernel:import @ scheduler.xml, line: 3
>         kernel:project @ 061-cattwo.kml, line: 2
>         061-cattwo-20080408-1100-eibujbf8
> Caused by: java.lang.NumberFormatException: For input string: ""
>         at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>         at java.lang.Integer.parseInt(Integer.java:468)
>         at java.lang.Integer.parseInt(Integer.java:497)
>         at 
> org.globus.cog.abstraction.impl.common.task.ServiceContactImpl.parse(ServiceContactImpl.java:81)
>         at 
> org.globus.cog.abstraction.impl.common.task.ServiceContactImpl.<init>(ServiceContactImpl.java:26)
>         at 
> org.globus.cog.karajan.workflow.nodes.grid.ServiceNode.function(ServiceNode.java:123)
>         at 
> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45)
>         at 
> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
>         at 
> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
> 
> 
> 


From hategan at mcs.anl.gov  Tue Apr  8 06:21:15 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 08 Apr 2008 06:21:15 -0500
Subject: [Swift-devel] Re: cog r1957 breaks swift ftp usage (when port
	not specified?)
In-Reply-To: <1207653003.7262.0.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0804080950520.31934@dildano.hawaga.org.uk>
	<1207653003.7262.0.camel@blabla.mcs.anl.gov>
Message-ID: <1207653675.7262.2.camel@blabla.mcs.anl.gov>

Ok. Try r1962.

On Tue, 2008-04-08 at 06:10 -0500, Mihael Hategan wrote:
> Grr!
> Fix coming up.
> 
> On Tue, 2008-04-08 at 10:01 +0000, Ben Clifford wrote:
> > CoG r1957 appears to break handling of gsiftp URLs specified in the Swift 
> > site catalog.
> > 
> > All of the site tests in 
> > https://svn.ci.uchicago.edu/svn/vdl2/trunk/tests/sites are configured that 
> > way and are broken by r1957.
> > 
> > When I apply svn diff -r 1957:1956 to my checkout, things work better (I 
> > ran the site tests on a few of the sites, but not all as I got tired of 
> > waiting)
> > 
> > First part of stack trace is:
> > 
> > Using sites file: ../sites/tp-fork-gram2-ftpport.xml
> > Using tc.data: ../sites/tc.data
> > For input string: ""
> > For input string: ""
> >         task:service @ vdl-sc.k, line: 23
> >         sys:if @ vdl-sc.k, line: 21
> >         gridftp @ tp-fork-gram2-ftpport.xml, line: 4
> >         pool @ tp-fork-gram2-ftpport.xml, line: 4
> >         pool @ tp-fork-gram2-ftpport.xml, line: 4
> >         org.globus.cog.karajan.workflow.nodes.Sequential @ 
> > tp-fork-gram2-ftpport.xml
> >         sys:executefile @ vdl-sc.k, line: 59
> >         task:resources @ vdl-sc.k, line: 59
> >         vdl:sitecatalog @ scheduler.xml, line: 42
> >         task:scheduler @ scheduler.xml, line: 27
> >         kernel:import @ scheduler.xml, line: 3
> >         kernel:project @ 061-cattwo.kml, line: 2
> >         061-cattwo-20080408-1100-eibujbf8
> > Caused by: java.lang.NumberFormatException: For input string: ""
> >         at 
> > java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
> >         at java.lang.Integer.parseInt(Integer.java:468)
> >         at java.lang.Integer.parseInt(Integer.java:497)
> >         at 
> > org.globus.cog.abstraction.impl.common.task.ServiceContactImpl.parse(ServiceContactImpl.java:81)
> >         at 
> > org.globus.cog.abstraction.impl.common.task.ServiceContactImpl.<init>(ServiceContactImpl.java:26)
> >         at 
> > org.globus.cog.karajan.workflow.nodes.grid.ServiceNode.function(ServiceNode.java:123)
> >         at 
> > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45)
> >         at 
> > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
> >         at 
> > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
> > 
> > 
> > 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From benc at hawaga.org.uk  Tue Apr  8 06:40:52 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 8 Apr 2008 11:40:52 +0000 (GMT)
Subject: [Swift-devel] Re: cog r1957 breaks swift ftp usage (when port
	not specified?)
In-Reply-To: <1207653675.7262.2.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0804080950520.31934@dildano.hawaga.org.uk> 
	<1207653003.7262.0.camel@blabla.mcs.anl.gov>
	<1207653675.7262.2.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0804081140390.31934@dildano.hawaga.org.uk>


Seems to work better.

On Tue, 8 Apr 2008, Mihael Hategan wrote:

> Ok. Try r1962.
> 
> On Tue, 2008-04-08 at 06:10 -0500, Mihael Hategan wrote:
> > Grr!
> > Fix coming up.
> > 
> > On Tue, 2008-04-08 at 10:01 +0000, Ben Clifford wrote:
> > > CoG r1957 appears to break handling of gsiftp URLs specified in the Swift 
> > > site catalog.

-- 


From iraicu at cs.uchicago.edu  Tue Apr  8 08:29:35 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Tue, 08 Apr 2008 08:29:35 -0500
Subject: [Swift-devel] Problem calling atomic procedure with multiple
	args via Falkon
In-Reply-To: <Pine.LNX.4.64.0804080802190.31934@dildano.hawaga.org.uk>
References: <47F976DB.2060500@mcs.anl.gov>
	<Pine.LNX.4.64.0804070610120.5372@dildano.hawaga.org.uk>
	<47FA6515.3060000@cs.uchicago.edu>
	<Pine.LNX.4.64.0804080802190.31934@dildano.hawaga.org.uk>
Message-ID: <47FB733F.2060200@cs.uchicago.edu>

Thanks for the error output.... they seem to be TCP related variables 
that are not found, so I assume that we have to include additional 
header files for your platform to ensure that it finds these variables. 

We'll track it down and fix it!
http://bugzilla.globus.org/globus/show_bug.cgi?id=5990

Ioan

Ben Clifford wrote:
> On Mon, 7 Apr 2008, Ioan Raicu wrote:
>
>   
>>> The C worker isn't portable enough to build on my laptop so I can't easily
>>> play there, 
>>>       
>
>   
>> The C worker is quite basic, what error do you get that it doesn't compile?
>> It has compiled for me on numerous platforms as is, so if its something we
>> need to fix in general to help it be more portable, let us know.
>>     
>
> $ ./make.worker-c.sh 
> Compiling C Executor
> BGexec.c: In function 'set_sockopt':
> BGexec.c:48: error: 'SOL_TCP' undeclared (first use in this function)
> BGexec.c:48: error: (Each undeclared identifier is reported only once
> BGexec.c:48: error: for each function it appears in.)
> BGexec.c:48: error: 'TCP_KEEPCNT' undeclared (first use in this function)
> BGexec.c:53: error: 'TCP_KEEPIDLE' undeclared (first use in this function)
> BGexec.c:58: error: 'TCP_KEEPINTVL' undeclared (first use in this 
> function)
>
> $ uname -a
> Darwin soju.hawaga.org.uk 8.11.1 Darwin Kernel Version 8.11.1: Wed Oct 10 
> 18:23:28 PDT 2007; root:xnu-792.25.20~1/RELEASE_I386 i386 i386
>
>
> If/when you rearrange the source code so it can be easily checked out, you 
> can have multi-platform testing of this on a bunch of platforms in NMI 
> build-and-test.
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From iraicu at cs.uchicago.edu  Tue Apr  8 08:31:59 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Tue, 08 Apr 2008 08:31:59 -0500
Subject: [Swift-devel] coaster status summary
In-Reply-To: <1207614154.6864.0.camel@blabla.mcs.anl.gov>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov>	
	<Pine.LNX.4.64.0804040941120.5372@dildano.hawaga.org.uk>	
	<Pine.LNX.4.64.0804070701010.31934@dildano.hawaga.org.uk>	
	<1207555192.22686.6.camel@blabla.mcs.anl.gov>	
	<47FA151D.1080504@mcs.anl.gov>	
	<1207572335.27797.5.camel@blabla.mcs.anl.gov>	
	<Pine.LNX.4.64.0804071246520.31934@dildano.hawaga.org.uk>	
	<1207573709.27797.16.camel@blabla.mcs.anl.gov>	
	<47FA6564.60008@cs.uchicago.edu>
	<1207614154.6864.0.camel@blabla.mcs.anl.gov>
Message-ID: <47FB73CF.4090509@cs.uchicago.edu>

We use the default.  For the SiCortex, we had to tweak the TCP 
keepalives to ensure that the TCP connections were not getting 
disconnected by the firewall on the SiCortex, which only allowed 180 
seconds of inactivity before it disconnected connections.  This meant 
that any job that took more than 180 seconds, or any Falkon idleness for 
more than 180 seconds resulted in TCP connection terminations.  BTW, we 
did not experience this kind of firewall rules when running in other 
environments, so it took us a week to debug and find the root of the 
problem.  This also happens because the Falkon service was running 
outside the SiCortex home network, but we had to do this as the SiCortex 
doesn't support Java, and at the time, didn't have access to any system 
within the internal network that supported Java.

Ioan

Mihael Hategan wrote:
> Do you tweak the TCP window size or do you use the default?
>
> On Mon, 2008-04-07 at 13:18 -0500, Ioan Raicu wrote:
>   
>> I agree that the BG/P is the only system I can think of right now that
>> won't work with the UDP scheme you currently have, assuming that you
>> will run the service on a login node that has access to both compute
>> nodes and external world (i.e. Swift).  The compute nodes don't
>> support Java, so you'd have to have some C/Fortran code, or maybe some
>> scripting language (which I don't know what kind of support there is).
>> If you use C or Fortran, MPI becomes a viable alternative.  TCP has
>> always been an alternative.  Anyways, if UDP doesn't work on the BG/P,
>> and the BG/P is the only scale large enough (today) that warrants a
>> connectionless protocol, then I suggest you switch to TCP (which has
>> worked for us well on the BG/P, and is general enough to work in most
>> environments) or even MPI (but you loose the generality of TCP, but
>> might gain performance).
>>
>> Ioan
>>
>> Mihael Hategan wrote: 
>>     
>>> On Mon, 2008-04-07 at 12:49 +0000, Ben Clifford wrote:
>>>   
>>>       
>>>> Wary of excessive optimisation of job completion notification speed in 
>>>> order to get high 'trivial/useless job' numbers, when there also seem to 
>>>> be problems getting shared filesystem access fast enough for non-useless 
>>>> jobs. Getting a ridiculously high trivial job throughput is not (in my 
>>>> eyes) a design goal of this coaster work.
>>>>     
>>>>         
>>> 200 j/s should be enough for anybody.
>>>
>>> Joking aside, the issue was ability to scale to large number of jobs
>>> rather than speed. But it looks like the issue is only an issue for
>>> monsters such as the BG/P.
>>>
>>>   
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>>   
>>>       
>> -- 
>> ===================================================
>> Ioan Raicu
>> Ph.D. Candidate
>> ===================================================
>> Distributed Systems Laboratory
>> Computer Science Department
>> University of Chicago
>> 1100 E. 58th Street, Ryerson Hall
>> Chicago, IL 60637
>> ===================================================
>> Email: iraicu at cs.uchicago.edu
>> Web:   http://www.cs.uchicago.edu/~iraicu
>> http://dev.globus.org/wiki/Incubator/Falkon
>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>> ===================================================
>> ===================================================
>>
>>     
>
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From bugzilla-daemon at mcs.anl.gov  Tue Apr  8 09:09:34 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue,  8 Apr 2008 09:09:34 -0500 (CDT)
Subject: [Swift-devel] [Bug 106] Improve error messages for double-set and
	un-set variables
In-Reply-To: <bug-106-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080408140934.9EDB6164CF@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=106


benc at hawaga.org.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED


------- Comment #1 from benc at hawaga.org.uk  2008-04-08 09:09 -------
t10 is fairly easy to fix (I have a patch that I think does this)

t7 is harder. the variable m is marked as an input (because it is never
assigned to). in the case of a file-mapped variable, that would mean the rest
of execution would assume that the backing file exists at the start. In the
case of a variable that is being used as an in-memory unmapped variable like m,
then the present behaviour doesn't work. There perhaps need to be tighter
constraints on when it is permissible to extract a value from a closed dataset
in this situation.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Tue Apr  8 09:13:44 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue,  8 Apr 2008 09:13:44 -0500 (CDT)
Subject: [Swift-devel] [Bug 76] disable intermediate stageout of data
In-Reply-To: <bug-76-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080408141344.5E10D164CF@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=76


benc at hawaga.org.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |DUPLICATE


------- Comment #2 from benc at hawaga.org.uk  2008-04-08 09:13 -------


*** This bug has been marked as a duplicate of 29 ***


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From bugzilla-daemon at mcs.anl.gov  Tue Apr  8 09:13:44 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue,  8 Apr 2008 09:13:44 -0500 (CDT)
Subject: [Swift-devel] [Bug 29] Staging out of temporary files
In-Reply-To: <bug-29-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080408141344.A6EF416532@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=29


benc at hawaga.org.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |nefedova at mcs.anl.gov


------- Comment #2 from benc at hawaga.org.uk  2008-04-08 09:13 -------
*** Bug 76 has been marked as a duplicate of this bug. ***


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From bugzilla-daemon at mcs.anl.gov  Tue Apr  8 09:13:45 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue,  8 Apr 2008 09:13:45 -0500 (CDT)
Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
In-Reply-To: <bug-72-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080408141345.09D5516562@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72


Bug 72 depends on bug 76, which changed state.

Bug 76 Summary: disable intermediate stageout of data
http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=76

           What    |Old Value                   |New Value
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |DUPLICATE


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From bugzilla-daemon at mcs.anl.gov  Tue Apr  8 09:27:57 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue,  8 Apr 2008 09:27:57 -0500 (CDT)
Subject: [Swift-devel] [Bug 32] Hello world gone wild
In-Reply-To: <bug-32-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080408142757.C9E211650A@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=32


benc at hawaga.org.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #3 from benc at hawaga.org.uk  2008-04-08 09:27 -------
the mentioned jobmanager attribute was added in Swift 0.4


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From bugzilla-daemon at mcs.anl.gov  Tue Apr  8 09:29:15 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue,  8 Apr 2008 09:29:15 -0500 (CDT)
Subject: [Swift-devel] [Bug 9] Limitation when abusing the submission rate
In-Reply-To: <bug-9-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080408142915.3A8D1164EC@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=9


benc at hawaga.org.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |WONTFIX


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Tue Apr  8 09:38:48 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue,  8 Apr 2008 09:38:48 -0500 (CDT)
Subject: [Swift-devel] [Bug 56] multiple variable definitions in a row hide
	previous ones rather than causing a syntax error.
In-Reply-To: <bug-56-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080408143848.8DABC164CF@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=56


benc at hawaga.org.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #2 from benc at hawaga.org.uk  2008-04-08 09:38 -------
A compile-time error for this was introduced in Swift 0.4.
r1786 introduces a test for this bug based on the below code.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.
You are the assignee for the bug, or are watching the assignee.


From hategan at mcs.anl.gov  Tue Apr  8 10:00:48 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 08 Apr 2008 10:00:48 -0500
Subject: [Swift-devel] coaster status summary
In-Reply-To: <47FB73CF.4090509@cs.uchicago.edu>
References: <1207301977.1658.14.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0804040941120.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804070701010.31934@dildano.hawaga.org.uk>
	<1207555192.22686.6.camel@blabla.mcs.anl.gov>
	<47FA151D.1080504@mcs.anl.gov>
	<1207572335.27797.5.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0804071246520.31934@dildano.hawaga.org.uk>
	<1207573709.27797.16.camel@blabla.mcs.anl.gov>
	<47FA6564.60008@cs.uchicago.edu>
	<1207614154.6864.0.camel@blabla.mcs.anl.gov>
	<47FB73CF.4090509@cs.uchicago.edu>
Message-ID: <1207666848.12977.5.camel@blabla.mcs.anl.gov>

You may want to try lowering the window size. The default is in the
order of 100K (as far as I understand from various sources). That may be
quite a bit if you have many connections. It may also be fairly useless
for local LAN connections used to send short messages (i.e. less than
the MTU/MSS).

On Tue, 2008-04-08 at 08:31 -0500, Ioan Raicu wrote:
> We use the default.  For the SiCortex, we had to tweak the TCP 
> keepalives to ensure that the TCP connections were not getting 
> disconnected by the firewall on the SiCortex, which only allowed 180 
> seconds of inactivity before it disconnected connections.  This meant 
> that any job that took more than 180 seconds, or any Falkon idleness for 
> more than 180 seconds resulted in TCP connection terminations.  BTW, we 
> did not experience this kind of firewall rules when running in other 
> environments, so it took us a week to debug and find the root of the 
> problem.  This also happens because the Falkon service was running 
> outside the SiCortex home network, but we had to do this as the SiCortex 
> doesn't support Java, and at the time, didn't have access to any system 
> within the internal network that supported Java.
> 
> Ioan
> 
> Mihael Hategan wrote:
> > Do you tweak the TCP window size or do you use the default?
> >
> > On Mon, 2008-04-07 at 13:18 -0500, Ioan Raicu wrote:
> >   
> >> I agree that the BG/P is the only system I can think of right now that
> >> won't work with the UDP scheme you currently have, assuming that you
> >> will run the service on a login node that has access to both compute
> >> nodes and external world (i.e. Swift).  The compute nodes don't
> >> support Java, so you'd have to have some C/Fortran code, or maybe some
> >> scripting language (which I don't know what kind of support there is).
> >> If you use C or Fortran, MPI becomes a viable alternative.  TCP has
> >> always been an alternative.  Anyways, if UDP doesn't work on the BG/P,
> >> and the BG/P is the only scale large enough (today) that warrants a
> >> connectionless protocol, then I suggest you switch to TCP (which has
> >> worked for us well on the BG/P, and is general enough to work in most
> >> environments) or even MPI (but you loose the generality of TCP, but
> >> might gain performance).
> >>
> >> Ioan
> >>
> >> Mihael Hategan wrote: 
> >>     
> >>> On Mon, 2008-04-07 at 12:49 +0000, Ben Clifford wrote:
> >>>   
> >>>       
> >>>> Wary of excessive optimisation of job completion notification speed in 
> >>>> order to get high 'trivial/useless job' numbers, when there also seem to 
> >>>> be problems getting shared filesystem access fast enough for non-useless 
> >>>> jobs. Getting a ridiculously high trivial job throughput is not (in my 
> >>>> eyes) a design goal of this coaster work.
> >>>>     
> >>>>         
> >>> 200 j/s should be enough for anybody.
> >>>
> >>> Joking aside, the issue was ability to scale to large number of jobs
> >>> rather than speed. But it looks like the issue is only an issue for
> >>> monsters such as the BG/P.
> >>>
> >>>   
> >>>
> >>> _______________________________________________
> >>> Swift-devel mailing list
> >>> Swift-devel at ci.uchicago.edu
> >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>
> >>>   
> >>>       
> >> -- 
> >> ===================================================
> >> Ioan Raicu
> >> Ph.D. Candidate
> >> ===================================================
> >> Distributed Systems Laboratory
> >> Computer Science Department
> >> University of Chicago
> >> 1100 E. 58th Street, Ryerson Hall
> >> Chicago, IL 60637
> >> ===================================================
> >> Email: iraicu at cs.uchicago.edu
> >> Web:   http://www.cs.uchicago.edu/~iraicu
> >> http://dev.globus.org/wiki/Incubator/Falkon
> >> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> >> ===================================================
> >> ===================================================
> >>
> >>     
> >
> >
> >   
> 


From duxu at mcs.anl.gov  Tue Apr  8 11:37:29 2008
From: duxu at mcs.anl.gov (Xu Du)
Date: Tue, 8 Apr 2008 11:37:29 -0500
Subject: [Swift-devel] Swift Innovation for BOINC : Design Spec
References: <00ec01c89361$e21c5ec0$6a08dd8c@karen>
	<47F22DCE.5070405@mcs.anl.gov>
	<004601c898bf$c6335ed0$9a01a8c0@karen>
	<Pine.LNX.4.64.0804080749210.5372@dildano.hawaga.org.uk>
Message-ID: <002401c89996$d600cd20$6a08dd8c@karen>

Hi Ben,

The object of the project is to enable swift dispatch a job to BOINC. Currently, BOINC and Swift are independent.  BOINC has itself mechanism to manage the jobs( tasks),  it doesn't employ the shared file system. Simple speaking, a tasks will be created and the relate data (input files) will be put into BOINC Date Base when we submit a job to BOINC sever. In fact, there is no change on the files (including the executable program and the data) during so called "file transfer", what is done here is only to put the files to BOINC DB and register a new task on BOINC sever.  After a task submitted, BOINC server will handle the task, all BOINC clients connect with BOINC server, the procedure of BOINC system processing the task is transparent to Swift. 

Thanks,

Xu

----- Original Message ----- 
From: "Ben Clifford" <benc at hawaga.org.uk>
To: "Xu Du" <duxu at 263.net>
Cc: "Michael Wilde" <wilde at mcs.anl.gov>; <swift-devel at ci.uchicago.edu>
Sent: Tuesday, April 08, 2008 2:54 AM
Subject: Re: [Swift-devel] Swift Innovation for BOINC : Design Spec


> 
> Hi.
> 
> When running a program in Swift, there is a requirement that the input 
> files for a job are placed in the current working directory that the unix 
> process runs on.
> 
> Usually, that is achieved by using a shared filesystem between every 
> worker node. But, I think that in a BOINC deployment, there will not be a 
> shared filesystem that is shared between every worker node.
> 
> I see that you intend to do 'file transfer' with BOINC, but it is not 
> clear to me how those files will be connected with the jobs that want to 
> use them.
> 
> Is this addressed in your design?
> 
> -- 
> 
> 
>

From benc at hawaga.org.uk  Wed Apr  9 04:58:30 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 9 Apr 2008 09:58:30 +0000 (GMT)
Subject: [Swift-devel] swift 0.5rc2
Message-ID: <Pine.LNX.4.64.0804090946250.5372@dildano.hawaga.org.uk>


release candidate 2 for swift 0.5 is available here:

http://www.ci.uchicago.edu/~benc/vdsk-0.5rc2.tar.gz

Please test. If no significant fixes required, I'll put it out at the 
weekend as final release.

-- 


From bugzilla-daemon at mcs.anl.gov  Wed Apr  9 08:58:41 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Wed,  9 Apr 2008 08:58:41 -0500 (CDT)
Subject: [Swift-devel] [Bug 106] Improve error messages for double-set and
	un-set variables
In-Reply-To: <bug-106-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080409135841.0975E164BB@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=106


------- Comment #2 from benc at hawaga.org.uk  2008-04-09 08:58 -------
r1785 adds multiple assignment detection (which I thought I'd put in
previously, but apparently not); this addresses the first part of the bug
report.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.


From benc at hawaga.org.uk  Wed Apr  9 10:08:24 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 9 Apr 2008 15:08:24 +0000 (GMT)
Subject: [Swift-devel] how to put different wrapper behaviour into production
Message-ID: <Pine.LNX.4.64.0804091458430.31934@dildano.hawaga.org.uk>


Last week or so I made some patches that change wrapper.sh to copy lots of 
stuff around to a(n assumed) worker-local filesystem rather than using the 
shared filesystem.

I don't particularly like this for general use - it means doing more 
steps, and more stuff to go wrong. Most especially, the worker-node-local 
info log files mean that if something goes wrong during execution (as 
often happens) there is a much greater level of difficulty in getting hold 
of those logs to debug.

There are two paths that I see:

i) add a swift runtime option that is passed to the wrapper, to select 
more-worker-node-local or less-worker-node-local behaviour; with one 
wrapper script able to function in both modes.

or

ii) allow the wrapper script to be specified as a runtime option; supply 
the standard wrapper script and the worker-node local script.


Option i leads down a path of perhaps having lots of different options 
passed to the worker. This might be a good thing or might not.

Option ii allows more open ended customisation of the wrapper scripts, but 
is likely to result in people keeping their own versions of the wrapper 
script around which will quickly stagnate and cause problems when they try 
to use.

I'm somewhat inclined towards option ii.

-- 


From bugzilla-daemon at mcs.anl.gov  Wed Apr  9 10:26:05 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Wed,  9 Apr 2008 10:26:05 -0500 (CDT)
Subject: [Swift-devel] [Bug 40] source location indication in execution-time
	error messages
In-Reply-To: <bug-40-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080409152605.228AD164BB@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=40


------- Comment #2 from benc at hawaga.org.uk  2008-04-09 10:26 -------
In Swift 0.4, better compile-time line number handling was added, as was better
compile time error checking. Whilst not directly addressing this bug, many
situations where this was a problem before are now caught by the compile time
changes.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at mcs.anl.gov  Wed Apr  9 10:29:38 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Wed,  9 Apr 2008 10:29:38 -0500 (CDT)
Subject: [Swift-devel] [Bug 42] paper(s) on Swift web not externally readable
In-Reply-To: <bug-42-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080409152938.8B393164BB@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=42


benc at hawaga.org.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #1 from benc at hawaga.org.uk  2008-04-09 10:29 -------
both of the mentioned papers are in the CI Swift WWW space and are accessible
(to me).


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Wed Apr  9 10:46:45 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Wed,  9 Apr 2008 10:46:45 -0500 (CDT)
Subject: [Swift-devel] [Bug 132] New: order of stdin and stdout on app
	commandline can cause XML validation exceptions
Message-ID: <bug-132-21@http.bugzilla.mcs.anl.gov/swift/>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=132

           Summary: order of stdin and stdout on app commandline can cause
                    XML validation exceptions
           Product: Swift
           Version: unspecified
          Platform: Macintosh
        OS/Version: Mac OS
            Status: NEW
          Severity: normal
          Priority: P2
         Component: SwiftScript language
        AssignedTo: benc at hawaga.org.uk
        ReportedBy: benc at hawaga.org.uk


The order in which stdin and stdout (and presumably stderr) elements are placed
in the intermediate XML appears to be the same as the order in which they
appear in source text. However, only one order is valid according to the XML
schema.

This validates:
        echo "hello" stdin=@filename(t) stdout=@filename(q);

This does not:
        echo "hello" stdout=@filename(q) stdin=@filename(t);


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at mcs.anl.gov  Wed Apr  9 12:13:42 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Wed,  9 Apr 2008 12:13:42 -0500 (CDT)
Subject: [Swift-devel] [Bug 132] order of stdin and stdout on app
	commandline can cause XML validation exceptions
In-Reply-To: <bug-132-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080409171342.15F27164BB@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=132


benc at hawaga.org.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #1 from benc at hawaga.org.uk  2008-04-09 12:13 -------
this should be fixed in r1787


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at mcs.anl.gov  Wed Apr  9 16:19:04 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Wed,  9 Apr 2008 16:19:04 -0500 (CDT)
Subject: [Swift-devel] [Bug 101] failure in site initialisation appears to
	cause job to fail rather than be retried elsewhere.
In-Reply-To: <bug-101-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080409211904.29AFE164BB@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=101


------- Comment #2 from benc at hawaga.org.uk  2008-04-09 16:19 -------
There's another example of this in ccf-perm-wf-20080409-1511-kz872673.log

Looks like permission error on one site means that it becomes available for use
again rapidly, whilst the other sites (3 of them) are occupied running jobs
successfully.

So a failed job is retried on the only free resource, the broken one, over and
over until it fails.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
You reported the bug, or are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Thu Apr 10 03:24:18 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Thu, 10 Apr 2008 03:24:18 -0500 (CDT)
Subject: [Swift-devel] [Bug 101] failure in site initialisation appears to
	cause job to fail rather than be retried elsewhere.
In-Reply-To: <bug-101-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080410082418.15FFA164CF@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=101


benc at hawaga.org.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|failure in site             |failure in site
                   |initialisation appears to   |initialisation appears to
                   |cause job to fail rather    |cause job to fail  rather
                   |than be retried elsewhere.  |than be retried elsewhere.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
You reported the bug, or are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Thu Apr 10 03:25:24 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Thu, 10 Apr 2008 03:25:24 -0500 (CDT)
Subject: [Swift-devel] [Bug 101] failure in site initialisation appears to
	cause job to fail rather than be retried elsewhere.
In-Reply-To: <bug-101-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080410082524.7A30E164CF@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=101


benc at hawaga.org.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|failure in site             |failure in site
                   |initialisation appears to   |initialisation appears to
                   |cause job to fail  rather   |cause job to fail   rather
                   |than be retried elsewhere.  |than be retried elsewhere.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
You reported the bug, or are watching the reporter.


From wilde at mcs.anl.gov  Thu Apr 10 07:31:51 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 10 Apr 2008 07:31:51 -0500
Subject: [Swift-devel] SSH support
In-Reply-To: <1195854124.12780.7.camel@blabla.mcs.anl.gov>
References: <1195854124.12780.7.camel@blabla.mcs.anl.gov>
Message-ID: <47FE08B7.9070606@mcs.anl.gov>

I just tried this for the first time and I cant get it to work, Mihael. 
Can you take a look?

I get these errors:

2008-04-09 23:04:47,221-0500 DEBUG TaskImpl Task(type=FILE_OPERATION, 
identity=urn:0-1207800286502) setting status to Failed 
org.globus.cog.abstraction.impl.file.FileResourceException: Error while 
communi\
cating with the SSH server on login.ci.uchicago.edu:22
Could not initialize shared directory on login.ci
...
Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: 
Error while communicating with the SSH server on login.ci.uchicago.edu:22
...
Caused by: 
org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: 
org.globus.cog.abstraction.impl.file.FileResourceException: Error while 
communicating with the SSH server on login.ci.uchica\
go.edu:22
Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: 
Error while communicating with the SSH server on login.ci.uchicago.edu:22
Caused by: java.lang.NullPointerException


All the related files and logs are in ~benc/swift/logs/wilde/run354

Im running swift on terminable, with a 1-job test workflow to login.ci.

I created a new rsa key, with a passphrase, and added it to 
authorized-keys. I tested the key and can manually ssh to login.ci from 
terminable with it, and verified the passphrase. (see file keytest)

Also, once we have this working, can I eliminate the passphrase from 
auth.defaults if I use an agent?

Thanks,

- Mike


On 11/23/07 3:42 PM, Mihael Hategan wrote:
> I've updated the SSH provider in cog to do a few things:
> - make better use of connections (cache them). SSH has this nifty thing:
> On one connection you can configure multiple independent channels
> (OpenSSH servers seem to support up to 10 such channels per connection).
> With this you get up to 10 independent shells without authenticating
> again.
> - access remote filesystems (a file op provider) with SFTP
> - get default authentication information from a file
> (~/.ssh/auth.defaults). I attached a sample. I need to document this.
> 
> I also added a filesystem element in the site catalog, which works in a
> similar way to the execution element:
>  <pool handle="plussed" sysinfo="INTEL32::LINUX">
>     <filesystem provider="ssh" url="plussed.mcs.anl.gov"
> storage="/homes/hategan/tmp" />
>     <execution provider="ssh" url="plussed.mcs.anl.gov" />
>     <workdirectory>/homes/hategan/tmp</workdirectory>
>   </pool>
> 
> That basically allows Swift to work with SSH.
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From wilde at mcs.anl.gov  Thu Apr 10 07:55:06 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 10 Apr 2008 07:55:06 -0500
Subject: [Swift-devel] SSH support
In-Reply-To: <47FE08B7.9070606@mcs.anl.gov>
References: <1195854124.12780.7.camel@blabla.mcs.anl.gov>
	<47FE08B7.9070606@mcs.anl.gov>
Message-ID: <47FE0E2A.4000404@mcs.anl.gov>

Ooops, I have a typo in my sites file - I fixed it but must have saved 
into wrong place. Let me re-test before you look into this. Sorry.

- Mike


On 4/10/08 7:31 AM, Michael Wilde wrote:
> I just tried this for the first time and I cant get it to work, Mihael. 
> Can you take a look?
> 
> I get these errors:
> 
> 2008-04-09 23:04:47,221-0500 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> identity=urn:0-1207800286502) setting status to Failed 
> org.globus.cog.abstraction.impl.file.FileResourceException: Error while 
> communi\
> cating with the SSH server on login.ci.uchicago.edu:22
> Could not initialize shared directory on login.ci
> ...
> Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: 
> Error while communicating with the SSH server on login.ci.uchicago.edu:22
> ...
> Caused by: 
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: 
> org.globus.cog.abstraction.impl.file.FileResourceException: Error while 
> communicating with the SSH server on login.ci.uchica\
> go.edu:22
> Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: 
> Error while communicating with the SSH server on login.ci.uchicago.edu:22
> Caused by: java.lang.NullPointerException
> 
> 
> All the related files and logs are in ~benc/swift/logs/wilde/run354
> 
> Im running swift on terminable, with a 1-job test workflow to login.ci.
> 
> I created a new rsa key, with a passphrase, and added it to 
> authorized-keys. I tested the key and can manually ssh to login.ci from 
> terminable with it, and verified the passphrase. (see file keytest)
> 
> Also, once we have this working, can I eliminate the passphrase from 
> auth.defaults if I use an agent?
> 
> Thanks,
> 
> - Mike
> 
> 
> 
> On 11/23/07 3:42 PM, Mihael Hategan wrote:
>> I've updated the SSH provider in cog to do a few things:
>> - make better use of connections (cache them). SSH has this nifty thing:
>> On one connection you can configure multiple independent channels
>> (OpenSSH servers seem to support up to 10 such channels per connection).
>> With this you get up to 10 independent shells without authenticating
>> again.
>> - access remote filesystems (a file op provider) with SFTP
>> - get default authentication information from a file
>> (~/.ssh/auth.defaults). I attached a sample. I need to document this.
>>
>> I also added a filesystem element in the site catalog, which works in a
>> similar way to the execution element:
>>  <pool handle="plussed" sysinfo="INTEL32::LINUX">
>>     <filesystem provider="ssh" url="plussed.mcs.anl.gov"
>> storage="/homes/hategan/tmp" />
>>     <execution provider="ssh" url="plussed.mcs.anl.gov" />
>>     <workdirectory>/homes/hategan/tmp</workdirectory>
>>   </pool>
>>
>> That basically allows Swift to work with SSH.
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From hategan at mcs.anl.gov  Thu Apr 10 08:02:53 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 10 Apr 2008 08:02:53 -0500
Subject: [Swift-devel] SSH support
In-Reply-To: <47FE08B7.9070606@mcs.anl.gov>
References: <1195854124.12780.7.camel@blabla.mcs.anl.gov>
	<47FE08B7.9070606@mcs.anl.gov>
Message-ID: <1207832573.26566.13.camel@blabla.mcs.anl.gov>


On Thu, 2008-04-10 at 07:31 -0500, Michael Wilde wrote:
> I just tried this for the first time and I cant get it to work, Mihael. 
> Can you take a look?
> 
> I get these errors:
> 
> 2008-04-09 23:04:47,221-0500 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> identity=urn:0-1207800286502) setting status to Failed 
> org.globus.cog.abstraction.impl.file.FileResourceException: Error while 
> communi\
> cating with the SSH server on login.ci.uchicago.edu:22
> Could not initialize shared directory on login.ci
> ...
> Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: 
> Error while communicating with the SSH server on login.ci.uchicago.edu:22
> ...
> Caused by: 
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: 
> org.globus.cog.abstraction.impl.file.FileResourceException: Error while 
> communicating with the SSH server on login.ci.uchica\
> go.edu:22
> Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: 
> Error while communicating with the SSH server on login.ci.uchicago.edu:22
> Caused by: java.lang.NullPointerException
> 
> 
> All the related files and logs are in ~benc/swift/logs/wilde/run354

I'll take a look.

> 
> Im running swift on terminable, with a 1-job test workflow to login.ci.
> 
> I created a new rsa key, with a passphrase, and added it to 
> authorized-keys. I tested the key and can manually ssh to login.ci from 
> terminable with it, and verified the passphrase. (see file keytest)
> 
> Also, once we have this working, can I eliminate the passphrase from 
> auth.defaults if I use an agent?

No. The agent is not supported.

> 
> Thanks,
> 
> - Mike
> 
> 
> 
> On 11/23/07 3:42 PM, Mihael Hategan wrote:
> > I've updated the SSH provider in cog to do a few things:
> > - make better use of connections (cache them). SSH has this nifty thing:
> > On one connection you can configure multiple independent channels
> > (OpenSSH servers seem to support up to 10 such channels per connection).
> > With this you get up to 10 independent shells without authenticating
> > again.
> > - access remote filesystems (a file op provider) with SFTP
> > - get default authentication information from a file
> > (~/.ssh/auth.defaults). I attached a sample. I need to document this.
> > 
> > I also added a filesystem element in the site catalog, which works in a
> > similar way to the execution element:
> >  <pool handle="plussed" sysinfo="INTEL32::LINUX">
> >     <filesystem provider="ssh" url="plussed.mcs.anl.gov"
> > storage="/homes/hategan/tmp" />
> >     <execution provider="ssh" url="plussed.mcs.anl.gov" />
> >     <workdirectory>/homes/hategan/tmp</workdirectory>
> >   </pool>
> > 
> > That basically allows Swift to work with SSH.
> > 
> > 
> > ------------------------------------------------------------------------
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From wilde at mcs.anl.gov  Thu Apr 10 08:10:41 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 10 Apr 2008 08:10:41 -0500
Subject: [Swift-devel] SSH support
In-Reply-To: <47FE0E2A.4000404@mcs.anl.gov>
References: <1195854124.12780.7.camel@blabla.mcs.anl.gov>
	<47FE08B7.9070606@mcs.anl.gov> <47FE0E2A.4000404@mcs.anl.gov>
Message-ID: <47FE11D1.8050408@mcs.anl.gov>

Found and fixed a typo, and it works.  Very nice! I'll use this to 
access the SiCortex.

- Mike

On 4/10/08 7:55 AM, Michael Wilde wrote:
> Ooops, I have a typo in my sites file - I fixed it but must have saved 
> into wrong place. Let me re-test before you look into this. Sorry.
> 
> - Mike
> 
> 
> On 4/10/08 7:31 AM, Michael Wilde wrote:
>> I just tried this for the first time and I cant get it to work, 
>> Mihael. Can you take a look?
>>
>> I get these errors:
>>
>> 2008-04-09 23:04:47,221-0500 DEBUG TaskImpl Task(type=FILE_OPERATION, 
>> identity=urn:0-1207800286502) setting status to Failed 
>> org.globus.cog.abstraction.impl.file.FileResourceException: Error 
>> while communi\
>> cating with the SSH server on login.ci.uchicago.edu:22
>> Could not initialize shared directory on login.ci
>> ...
>> Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: 
>> Error while communicating with the SSH server on login.ci.uchicago.edu:22
>> ...
>> Caused by: 
>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: 
>> org.globus.cog.abstraction.impl.file.FileResourceException: Error 
>> while communicating with the SSH server on login.ci.uchica\
>> go.edu:22
>> Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: 
>> Error while communicating with the SSH server on login.ci.uchicago.edu:22
>> Caused by: java.lang.NullPointerException
>>
>>
>> All the related files and logs are in ~benc/swift/logs/wilde/run354
>>
>> Im running swift on terminable, with a 1-job test workflow to login.ci.
>>
>> I created a new rsa key, with a passphrase, and added it to 
>> authorized-keys. I tested the key and can manually ssh to login.ci 
>> from terminable with it, and verified the passphrase. (see file keytest)
>>
>> Also, once we have this working, can I eliminate the passphrase from 
>> auth.defaults if I use an agent?
>>
>> Thanks,
>>
>> - Mike
>>
>>
>>
>> On 11/23/07 3:42 PM, Mihael Hategan wrote:
>>> I've updated the SSH provider in cog to do a few things:
>>> - make better use of connections (cache them). SSH has this nifty thing:
>>> On one connection you can configure multiple independent channels
>>> (OpenSSH servers seem to support up to 10 such channels per connection).
>>> With this you get up to 10 independent shells without authenticating
>>> again.
>>> - access remote filesystems (a file op provider) with SFTP
>>> - get default authentication information from a file
>>> (~/.ssh/auth.defaults). I attached a sample. I need to document this.
>>>
>>> I also added a filesystem element in the site catalog, which works in a
>>> similar way to the execution element:
>>>  <pool handle="plussed" sysinfo="INTEL32::LINUX">
>>>     <filesystem provider="ssh" url="plussed.mcs.anl.gov"
>>> storage="/homes/hategan/tmp" />
>>>     <execution provider="ssh" url="plussed.mcs.anl.gov" />
>>>     <workdirectory>/homes/hategan/tmp</workdirectory>
>>>   </pool>
>>>
>>> That basically allows Swift to work with SSH.
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
> 


From benc at hawaga.org.uk  Fri Apr 11 08:43:28 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 11 Apr 2008 13:43:28 +0000 (GMT)
Subject: [Swift-devel] fast-failing jobs
Message-ID: <Pine.LNX.4.64.0804111335210.31934@dildano.hawaga.org.uk>


bug 101 discusses a class of site-selection failures that look like this:

two (or more) sites:
  site G works
  site F fails all jobs submitted to it, very rapidly.

Submit 10 non-trivial jobs for scheduling. At present, the minimum number 
of simultaneous jobs that will be sent to a site is 2. Two jobs go to site 
G, and occupy it (for eg 20 minutes); two jobs go to site F and fail 
(within eg. 10 seconds). two more jobs go to site F and fail (within eg 10 
seconds). All jobs apart from the two jobs that went to site G are 
repeatedly submitted to site F and fail, exhausting all their retries and 
causing a workflow failure.

One approach to stopping this is to slow down submission to poorly scoring 
sites. However, in this case, the delay between submissions would need to 
be on the scale of minutes .. tens of minutes to avoid this.

However, the delay needs to be on roughly the same scale as the length of 
a job, which varies widely depending on usage (some people are putting 
through half hour jobs, some people put through jobs that are a few 
seconds long). That seems difficult to determine at startup.

It seems undesirable to block a site from execution entirely based on poor 
performance because much can change over the duration of a long run 
(working sites break and non-working sites unbreak).

Related to the need for job execution length information here is stuff 
we've talked about in the past where jobs should be unselected/relaunched 
at a different site if they take 'too long', where 'too long' is 
determined based perhaps on some statistical analysis of other jobs that 
have executed successfully.

-- 


From iraicu at cs.uchicago.edu  Fri Apr 11 09:11:42 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Fri, 11 Apr 2008 09:11:42 -0500
Subject: [Swift-devel] fast-failing jobs
In-Reply-To: <Pine.LNX.4.64.0804111335210.31934@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0804111335210.31934@dildano.hawaga.org.uk>
Message-ID: <47FF719E.2040404@cs.uchicago.edu>

We addressed this in Falkon by suspending bad nodes (within Falkon).  
About trying to solve the problem in general, here is an idea.

The retry counter is on a per site basis.  Lets assume the max retry is 
set to 3, and we have 4 sites, of which 3 are broken (fail fast, seconds 
per job), and only 1 site is good (computes for minutes per job).  
Assuming we have 10 jobs in total to do, within 1 minute, all 10 jobs 
will have failed 3 times per site, and the only site left that could 
potentially run these 10 jobs is the 4th site that is working at a few 
minutes per job.  Now, the 3 sites that are bad aren't penalized in any 
way, if there are jobs that have not run there yet and failed, then they 
will be tried...

This sounds like it would fix your problem, but, I am not sure how easy 
it is to keep track of the retry per site, and only fail a job if it has 
failed the max number of times at all sites!

Ioan

Ben Clifford wrote:
> bug 101 discusses a class of site-selection failures that look like this:
>
> two (or more) sites:
>   site G works
>   site F fails all jobs submitted to it, very rapidly.
>
> Submit 10 non-trivial jobs for scheduling. At present, the minimum number 
> of simultaneous jobs that will be sent to a site is 2. Two jobs go to site 
> G, and occupy it (for eg 20 minutes); two jobs go to site F and fail 
> (within eg. 10 seconds). two more jobs go to site F and fail (within eg 10 
> seconds). All jobs apart from the two jobs that went to site G are 
> repeatedly submitted to site F and fail, exhausting all their retries and 
> causing a workflow failure.
>
> One approach to stopping this is to slow down submission to poorly scoring 
> sites. However, in this case, the delay between submissions would need to 
> be on the scale of minutes .. tens of minutes to avoid this.
>
> However, the delay needs to be on roughly the same scale as the length of 
> a job, which varies widely depending on usage (some people are putting 
> through half hour jobs, some people put through jobs that are a few 
> seconds long). That seems difficult to determine at startup.
>
> It seems undesirable to block a site from execution entirely based on poor 
> performance because much can change over the duration of a long run 
> (working sites break and non-working sites unbreak).
>
> Related to the need for job execution length information here is stuff 
> we've talked about in the past where jobs should be unselected/relaunched 
> at a different site if they take 'too long', where 'too long' is 
> determined based perhaps on some statistical analysis of other jobs that 
> have executed successfully.
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From wilde at mcs.anl.gov  Sat Apr 12 08:45:19 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sat, 12 Apr 2008 08:45:19 -0500
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
References: <47FEB489.509@mcs.anl.gov> <47FEEB6E.3060506@cs.uchicago.edu>
	<47FEF367.1060902@uchicago.edu> <47FEF3D5.9060707@cs.uchicago.edu>
	<47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu>
	<47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu>
	<47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu>
	<47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu>
	<47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu>
	<47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
Message-ID: <4800BCEF.5040206@mcs.anl.gov>

Ben, can you confirm this: to turn off all job submission throttling 
(for Falkon), the correct setting for each of the following props is "off"?

# remove all limits on job submit rates
throttle.submit=off
throttle.host.submit=off
throttle.score.job.factor=off

Long ago (circa Nov) it seemed "off" didn't give the the wide-open 
throttle effect I was looking for, but "off" is a more clear setting 
than "big numbers" if we know it to be working as expected.

- Mike


On 4/12/08 5:22 AM, Ben Clifford wrote:
> On Sat, 12 Apr 2008, Zhao Zhang wrote:
> 
>>   i) what files are shared (so swift will only stage-in once)
>>        how many files per job and how big in total per job?
>>   
>>
>> I am not familiar with definition of "stage-in". Mike, could you help to
>> explain this?
> 
> When a job runs in Swift, it looks like this:
> 
>     stage in the input files - copy the input files from where they are
>        stored to where they are needed for execution
>     execute (using falkon in this case)
>     stage out the output files
> 
> The execute stage is what falkon is involved in; there are other 
> mechanisms to move files around into the appropraite
> 
> So really I want to know what are the input files to your jobs.
> 
>> on GPFS of Blue Gene
> 
> If everything is on the same filesystem on bluegene, an interesting idea 
> springs to mind of potentially only symlinking the input files rather than 
> copying them around.
> 
> I can have a play with that next week perhaps.
> 


From hategan at mcs.anl.gov  Sat Apr 12 13:08:43 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 12 Apr 2008 13:08:43 -0500
Subject: [Swift-devel] fast-failing jobs
In-Reply-To: <Pine.LNX.4.64.0804111335210.31934@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0804111335210.31934@dildano.hawaga.org.uk>
Message-ID: <1208023723.2963.13.camel@blabla.mcs.anl.gov>


On Fri, 2008-04-11 at 13:43 +0000, Ben Clifford wrote:
> bug 101 discusses a class of site-selection failures that look like this:
> 
> two (or more) sites:
>   site G works
>   site F fails all jobs submitted to it, very rapidly.
> 
> Submit 10 non-trivial jobs for scheduling. At present, the minimum number 
> of simultaneous jobs that will be sent to a site is 2. Two jobs go to site 
> G, and occupy it (for eg 20 minutes); two jobs go to site F and fail 
> (within eg. 10 seconds). two more jobs go to site F and fail (within eg 10 
> seconds). All jobs apart from the two jobs that went to site G are 
> repeatedly submitted to site F and fail, exhausting all their retries and 
> causing a workflow failure.
> 
> One approach to stopping this is to slow down submission to poorly scoring 
> sites. However, in this case, the delay between submissions would need to 
> be on the scale of minutes .. tens of minutes to avoid this.
> 
> However, the delay needs to be on roughly the same scale as the length of 
> a job, which varies widely depending on usage (some people are putting 
> through half hour jobs, some people put through jobs that are a few 
> seconds long).

That's pretty much what a low score does if there's throttling based on
score. Perhaps our solution is to have a low job throttle and a higher
score range (i.e. T=1000 instead of 100).

That or we could enforce a submission rate (j/s) based on score.

>  That seems difficult to determine at startup.

That, again, is in the nature of things. A good score approximation is
difficult to determine at startup.


From hategan at mcs.anl.gov  Sat Apr 12 13:11:38 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 12 Apr 2008 13:11:38 -0500
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <4800BCEF.5040206@mcs.anl.gov>
References: <47FEB489.509@mcs.anl.gov> <47FEEB6E.3060506@cs.uchicago.edu>
	<47FEF367.1060902@uchicago.edu> <47FEF3D5.9060707@cs.uchicago.edu>
	<47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu>
	<47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu>
	<47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu>
	<47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu>
	<47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu>
	<47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov>
Message-ID: <1208023898.2963.17.camel@blabla.mcs.anl.gov>


On Sat, 2008-04-12 at 08:45 -0500, Michael Wilde wrote:
> Ben, can you confirm this: to turn off all job submission throttling 
> (for Falkon), the correct setting for each of the following props is "off"?
> 
> # remove all limits on job submit rates
> throttle.submit=off
> throttle.host.submit=off
> throttle.score.job.factor=off
> 
> Long ago (circa Nov) it seemed "off" didn't give the the wide-open 
> throttle effect I was looking for, but "off" is a more clear setting 
> than "big numbers" if we know it to be working as expected.

If "off" doesn't work as expected we should figure out why, not invent
another "off". So please use "off" and report any problems.

> 
> - Mike
> 
> 
> 
> On 4/12/08 5:22 AM, Ben Clifford wrote:
> > On Sat, 12 Apr 2008, Zhao Zhang wrote:
> > 
> >>   i) what files are shared (so swift will only stage-in once)
> >>        how many files per job and how big in total per job?
> >>   
> >>
> >> I am not familiar with definition of "stage-in". Mike, could you help to
> >> explain this?
> > 
> > When a job runs in Swift, it looks like this:
> > 
> >     stage in the input files - copy the input files from where they are
> >        stored to where they are needed for execution
> >     execute (using falkon in this case)
> >     stage out the output files
> > 
> > The execute stage is what falkon is involved in; there are other 
> > mechanisms to move files around into the appropraite
> > 
> > So really I want to know what are the input files to your jobs.
> > 
> >> on GPFS of Blue Gene
> > 
> > If everything is on the same filesystem on bluegene, an interesting idea 
> > springs to mind of potentially only symlinking the input files rather than 
> > copying them around.
> > 
> > I can have a play with that next week perhaps.
> > 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From zhaozhang at uchicago.edu  Sat Apr 12 16:46:18 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Sat, 12 Apr 2008 16:46:18 -0500
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <4800BCEF.5040206@mcs.anl.gov>
References: <47FEB489.509@mcs.anl.gov> <47FEEB6E.3060506@cs.uchicago.edu>
	<47FEF367.1060902@uchicago.edu> <47FEF3D5.9060707@cs.uchicago.edu>
	<47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu>
	<47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu>
	<47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu>
	<47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu>
	<47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu>
	<47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov>
Message-ID: <48012DAA.2000308@uchicago.edu>

Hi, Ioan

Check the log file at BGP, 
/home/falkon/DOCK_swift+falkon/DOCK_swift+falkon_4x512_6084_2008.04.12_15.38.31

I ran 6084 DOCK tasks, and it indeed runs on 2048 cores.

zhao

Michael Wilde wrote:
> Ben, can you confirm this: to turn off all job submission throttling 
> (for Falkon), the correct setting for each of the following props is 
> "off"?
>
> # remove all limits on job submit rates
> throttle.submit=off
> throttle.host.submit=off
> throttle.score.job.factor=off
>
> Long ago (circa Nov) it seemed "off" didn't give the the wide-open 
> throttle effect I was looking for, but "off" is a more clear setting 
> than "big numbers" if we know it to be working as expected.
>
> - Mike
>
>
>
> On 4/12/08 5:22 AM, Ben Clifford wrote:
>> On Sat, 12 Apr 2008, Zhao Zhang wrote:
>>
>>>   i) what files are shared (so swift will only stage-in once)
>>>        how many files per job and how big in total per job?
>>>  
>>> I am not familiar with definition of "stage-in". Mike, could you 
>>> help to
>>> explain this?
>>
>> When a job runs in Swift, it looks like this:
>>
>>     stage in the input files - copy the input files from where they are
>>        stored to where they are needed for execution
>>     execute (using falkon in this case)
>>     stage out the output files
>>
>> The execute stage is what falkon is involved in; there are other 
>> mechanisms to move files around into the appropraite
>>
>> So really I want to know what are the input files to your jobs.
>>
>>> on GPFS of Blue Gene
>>
>> If everything is on the same filesystem on bluegene, an interesting 
>> idea springs to mind of potentially only symlinking the input files 
>> rather than copying them around.
>>
>> I can have a play with that next week perhaps.
>>
>


From zhaozhang at uchicago.edu  Sat Apr 12 16:50:33 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Sat, 12 Apr 2008 16:50:33 -0500
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <4800BCEF.5040206@mcs.anl.gov>
References: <47FEB489.509@mcs.anl.gov> <47FEEB6E.3060506@cs.uchicago.edu>
	<47FEF367.1060902@uchicago.edu> <47FEF3D5.9060707@cs.uchicago.edu>
	<47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu>
	<47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu>
	<47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu>
	<47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu>
	<47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu>
	<47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov>
Message-ID: <48012EA9.1060102@uchicago.edu>

Hi, Ben

I got a log file of 6084 successful runs on BGP. Check it here, 
terninable:/home/zzhang/swift_file/dock2-20080412-1609-99cy0z4g.log

zhao

Michael Wilde wrote:
> Ben, can you confirm this: to turn off all job submission throttling 
> (for Falkon), the correct setting for each of the following props is 
> "off"?
>
> # remove all limits on job submit rates
> throttle.submit=off
> throttle.host.submit=off
> throttle.score.job.factor=off
>
> Long ago (circa Nov) it seemed "off" didn't give the the wide-open 
> throttle effect I was looking for, but "off" is a more clear setting 
> than "big numbers" if we know it to be working as expected.
>
> - Mike
>
>
>
> On 4/12/08 5:22 AM, Ben Clifford wrote:
>> On Sat, 12 Apr 2008, Zhao Zhang wrote:
>>
>>>   i) what files are shared (so swift will only stage-in once)
>>>        how many files per job and how big in total per job?
>>>  
>>> I am not familiar with definition of "stage-in". Mike, could you 
>>> help to
>>> explain this?
>>
>> When a job runs in Swift, it looks like this:
>>
>>     stage in the input files - copy the input files from where they are
>>        stored to where they are needed for execution
>>     execute (using falkon in this case)
>>     stage out the output files
>>
>> The execute stage is what falkon is involved in; there are other 
>> mechanisms to move files around into the appropraite
>>
>> So really I want to know what are the input files to your jobs.
>>
>>> on GPFS of Blue Gene
>>
>> If everything is on the same filesystem on bluegene, an interesting 
>> idea springs to mind of potentially only symlinking the input files 
>> rather than copying them around.
>>
>> I can have a play with that next week perhaps.
>>
>


From benc at hawaga.org.uk  Sun Apr 13 09:52:15 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 13 Apr 2008 14:52:15 +0000 (GMT)
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <4800BCEF.5040206@mcs.anl.gov>
References: <47FEB489.509@mcs.anl.gov> <47FEEB6E.3060506@cs.uchicago.edu>
	<47FEF367.1060902@uchicago.edu> <47FEF3D5.9060707@cs.uchicago.edu>
	<47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu>
	<47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu>
	<47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu>
	<47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu>
	<47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu>
	<47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0804131451280.31934@dildano.hawaga.org.uk>


On Sat, 12 Apr 2008, Michael Wilde wrote:

> Long ago (circa Nov) it seemed "off" didn't give the the wide-open throttle
> effect I was looking for, but "off" is a more clear setting than "big numbers"
> if we know it to be working as expected.

If that was the angle stuff that we were doing for SC, a lot of the time 
(or all of the time?) that was constrained by stagein speed, not by job 
submission speed.

-- 


From benc at hawaga.org.uk  Sun Apr 13 11:19:04 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 13 Apr 2008 16:19:04 +0000 (GMT)
Subject: [Swift-devel] fast-failing jobs
In-Reply-To: <1208023723.2963.13.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0804111335210.31934@dildano.hawaga.org.uk>
	<1208023723.2963.13.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0804131615370.31934@dildano.hawaga.org.uk>


On Sat, 12 Apr 2008, Mihael Hategan wrote:

> That's pretty much what a low score does if there's throttling based on
> score. Perhaps our solution is to have a low job throttle and a higher
> score range (i.e. T=1000 instead of 100).

The present scoring system won't ever go below 2 jobs per site, so pretty 
much whatever the parameters are tweaked to, a fast-fail site will eat 2 
jobs per fast-fail cycle.

> That or we could enforce a submission rate (j/s) based on score.

That is perhaps better.

It would make lower scores more punitive than at the moment, which may be 
a problem given the way that in certain other failure modes the score gets 
reduced catastrophically. (eg a transient problem where all jobs fail that 
are in progress, with a large number of jobs in progress - this was why I 
put that lower bound in on the scores - to prevent the score actually 
getting really low)

-- 


From hategan at mcs.anl.gov  Sun Apr 13 11:22:40 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 13 Apr 2008 11:22:40 -0500
Subject: [Swift-devel] fast-failing jobs
In-Reply-To: <Pine.LNX.4.64.0804131615370.31934@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0804111335210.31934@dildano.hawaga.org.uk>
	<1208023723.2963.13.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0804131615370.31934@dildano.hawaga.org.uk>
Message-ID: <1208103760.15803.1.camel@blabla.mcs.anl.gov>


On Sun, 2008-04-13 at 16:19 +0000, Ben Clifford wrote:
> On Sat, 12 Apr 2008, Mihael Hategan wrote:
> 
> > That's pretty much what a low score does if there's throttling based on
> > score. Perhaps our solution is to have a low job throttle and a higher
> > score range (i.e. T=1000 instead of 100).
> 
> The present scoring system won't ever go below 2 jobs per site, so pretty 
> much whatever the parameters are tweaked to, a fast-fail site will eat 2 
> jobs per fast-fail cycle.

That being one thing that probably should be changed.

> 
> > That or we could enforce a submission rate (j/s) based on score.
> 
> That is perhaps better.
> 
> It would make lower scores more punitive than at the moment, which may be 
> a problem given the way that in certain other failure modes the score gets 
> reduced catastrophically. (eg a transient problem where all jobs fail that 
> are in progress, with a large number of jobs in progress - this was why I 
> put that lower bound in on the scores - to prevent the score actually 
> getting really low)
> 


From benc at hawaga.org.uk  Sun Apr 13 11:37:40 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 13 Apr 2008 16:37:40 +0000 (GMT)
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <48012EA9.1060102@uchicago.edu>
References: <47FEB489.509@mcs.anl.gov> <47FEEB6E.3060506@cs.uchicago.edu>
	<47FEF367.1060902@uchicago.edu> <47FEF3D5.9060707@cs.uchicago.edu>
	<47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu>
	<47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu>
	<47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu>
	<47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu>
	<47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu>
	<47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu>
Message-ID: <Pine.LNX.4.64.0804131635330.31934@dildano.hawaga.org.uk>

On Sat, 12 Apr 2008, Zhao Zhang wrote:

> Hi, Ben
> 
> I got a log file of 6084 successful runs on BGP. Check it here,
> terninable:/home/zzhang/swift_file/dock2-20080412-1609-99cy0z4g.log

This one runs better - it gets up to a peak of 5000 jobs submitted into 
Falkon simultaneously, and spends a considerable amount of time over the 
2048 level that I suppose is what you need to be over to get all 2048 CPUs 
used.

There's a lot of stage-in activity that probably could be eliminated / 
changed for the single-filesytem case.

-- 


From wilde at mcs.anl.gov  Sun Apr 13 14:17:56 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 13 Apr 2008 14:17:56 -0500
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <Pine.LNX.4.64.0804131635330.31934@dildano.hawaga.org.uk>
References: <47FEB489.509@mcs.anl.gov> <47FEEB6E.3060506@cs.uchicago.edu>
	<47FEF367.1060902@uchicago.edu> <47FEF3D5.9060707@cs.uchicago.edu>
	<47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu>
	<47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu>
	<47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu>
	<47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu>
	<47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu>
	<47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu>
	<Pine.LNX.4.64.0804131635330.31934@dildano.hawaga.org.uk>
Message-ID: <48025C64.9020502@mcs.anl.gov>

Ben, can you point me to the graphs for this run? (Zhao's *99cy0z4g.log)

Here's a high-level summary of this run:

Swift end   16:42:17
Swift start 16:09:07
Runtime        33:10 = 1990 seconds

2048 cores

Total app wall time = 1190260 seconds

1190260 / ( 1990 * 2048 ) = .29 efficiency

Once stage-ins start to complete, are the corresponding jobs initiated 
quickly, or is Swift doing mostly stage-ins for some period?

Zhao indicated he saw data indicating there was about a 700 second lag 
from workflow start time till the first Falkon jobs started, if I 
understood correctly. Do the graphs confirm this or say something different?

If the 700-second delay figure is true, and stage-in was eliminated by 
copying input files right to the /tmp workdir rather than first to 
/shared, then we'd have:

1190260 / ( 1290 * 2048 ) = .45 efficiency

A good gain, but only partway to a number that looks good.

I assume we're paying the same staging price on the output side?

What I think we learned from the MARS app run, which had no input data 
and only tiny output data files (10 bytes vs 10K bytes), was that the 
optimized wrapper achieved somewhere between .7 to .8 efficiency.

I'd like to look at whatever data we can get from this or similar 
subsequent runs to learn what steps we could take next to increase the 
efficiency metric.  Guidance welcome.

Thanks,

Mike


On 4/13/08 11:37 AM, Ben Clifford wrote:
> On Sat, 12 Apr 2008, Zhao Zhang wrote:
> 
>> Hi, Ben
>>
>> I got a log file of 6084 successful runs on BGP. Check it here,
>> terninable:/home/zzhang/swift_file/dock2-20080412-1609-99cy0z4g.log
> 
> This one runs better - it gets up to a peak of 5000 jobs submitted into 
> Falkon simultaneously, and spends a considerable amount of time over the 
> 2048 level that I suppose is what you need to be over to get all 2048 CPUs 
> used.
> 
> There's a lot of stage-in activity that probably could be eliminated / 
> changed for the single-filesytem case.
> 


- Mike


From benc at hawaga.org.uk  Sun Apr 13 14:43:45 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 13 Apr 2008 19:43:45 +0000 (GMT)
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <48012EA9.1060102@uchicago.edu>
References: <47FEB489.509@mcs.anl.gov> <47FEEB6E.3060506@cs.uchicago.edu>
	<47FEF367.1060902@uchicago.edu> <47FEF3D5.9060707@cs.uchicago.edu>
	<47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu>
	<47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu>
	<47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu>
	<47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu>
	<47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu>
	<47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu>
Message-ID: <Pine.LNX.4.64.0804131940210.31934@dildano.hawaga.org.uk>


> I got a log file of 6084 successful runs on BGP. Check it here,
> terninable:/home/zzhang/swift_file/dock2-20080412-1609-99cy0z4g.log

ok.

It is useful if you also supply the -info files if there is doubt about 
how time is being spent on the worker nodes (which I think there is some 
of).

-- 


From benc at hawaga.org.uk  Sun Apr 13 14:57:06 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 13 Apr 2008 19:57:06 +0000 (GMT)
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <48025C64.9020502@mcs.anl.gov>
References: <47FEB489.509@mcs.anl.gov> <47FEF367.1060902@uchicago.edu>
	<47FEF3D5.9060707@cs.uchicago.edu> <47FEF4FE.6030406@uchicago.edu>
	<47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu>
	<47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu>
	<47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu>
	<47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov>
	<47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu>
	<Pine.LNX.4.64.0804131635330.31934@dildano.hawaga.org.uk>
	<48025C64.9020502@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0804131919290.31934@dildano.hawaga.org.uk>


> Ben, can you point me to the graphs for this run? (Zhao's *99cy0z4g.log)

http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g

> Once stage-ins start to complete, are the corresponding jobs initiated 
> quickly, or is Swift doing mostly stage-ins for some period?

In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to falkon) 
pretty much right as the corresponding stagein completes. I have no deeper 
information about when the worker actually starts to run.

> Zhao indicated he saw data indicating there was about a 700 second lag from
> workflow start time till the first Falkon jobs started, if I understood
> correctly. Do the graphs confirm this or say something different?

There is a period of about 500s or so until stuff starts to happen; I 
haven't looked at it. That is before stage-ins start too, though, which 
means that i think this...

> If the 700-second delay figure is true, and stage-in was eliminated by copying
> input files right to the /tmp workdir rather than first to /shared, then we'd
> have:
> 
> 1190260 / ( 1290 * 2048 ) = .45 efficiency

calculation is not meaningful.

I have not looked at what is going on during that 500s startup time, but I 
plan to.

> I assume we're paying the same staging price on the output side?

not really - the output stageouts go very fast, and also because job 
ending is staggered, they don't happen all at once.

This is the same with most of the large runs I've seen (of any 
application) - stageout tends not to be a problem (or at least, no where 
near the problems of stagein).

All stageins happen over a period t=400 to t=1100 fairly smoothly. There's 
rate limiting still on file operations (100 max) and file transfers (2000 
max) which is being hit still.

I think there's two directions to proceed in here that make sense for 
actual use on single clusters running falkon (rather than trying to cut 
out stuff randomly to push up numbers):

 i) use some of the data placement features in falkon, rather than Swift's
    relatively simple data management that was designed more for running
    on the grid.

 ii) do stage-ins using symlinks rather than file copying. this makes
     sense when everything is living in a single filesystem, which again
     is not what Swift's data management was originally optimised for.

I think option ii) is substantially easier to implement (on the order of 
days) and is generally useful in the single-cluster, local-source-data 
situation that appears to be what people want to do for running on the 
BG/P and scicortex (that is, pretty much ignoring anything grid-like at 
all).

Option i) is much harder (on the order of months), needing a very 
different interface between Swift and Falkon than exists at the moment.


-- 


From zhaozhang at uchicago.edu  Sun Apr 13 15:03:19 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Sun, 13 Apr 2008 15:03:19 -0500
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <Pine.LNX.4.64.0804131919290.31934@dildano.hawaga.org.uk>
References: <47FEB489.509@mcs.anl.gov> <47FEF3D5.9060707@cs.uchicago.edu>
	<47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu>
	<47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu>
	<47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu>
	<47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu>
	<47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu>
	<47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu>
	<Pine.LNX.4.64.0804131635330.31934@dildano.hawaga.org.uk>
	<48025C64.9020502@mcs.anl.gov>
	<Pine.LNX.4.64.0804131919290.31934@dildano.hawaga.org.uk>
Message-ID: <48026707.4070709@uchicago.edu>

An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080413/36f25a8b/attachment.html>

From iraicu at cs.uchicago.edu  Sun Apr 13 15:23:39 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sun, 13 Apr 2008 15:23:39 -0500
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <48025C64.9020502@mcs.anl.gov>
References: <47FEB489.509@mcs.anl.gov> <47FEEB6E.3060506@cs.uchicago.edu>
	<47FEF367.1060902@uchicago.edu> <47FEF3D5.9060707@cs.uchicago.edu>
	<47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu>
	<47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu>
	<47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu>
	<47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu>
	<47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu>
	<47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu>
	<Pine.LNX.4.64.0804131635330.31934@dildano.hawaga.org.uk>
	<48025C64.9020502@mcs.anl.gov>
Message-ID: <48026BCB.8060108@cs.uchicago.edu>

Sorry for being late to the party, putting out other fires :)

Here is what Falkon logs say for this run:
2544.996 0 0 35 2048 2048 0 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0 0 0 0.0 0.0 0 
0 100 1536 1331 1536
2545.996 1 1 35 2048 2008 0 40 0 0 40 0 0 0.0 0 0 0 0 0 0 0 0 0 0.0 0.0 
1 0 99 1536 1322 1536
...
3814.999 1 1 35 2048 2047 0 1 0 0 1 0 6083 0.0 6083 0 0 0 0 0 0 0 0 0.0 
0.0 0 0 100 1536 1291 1536
3815.999 0 1 35 2048 2048 0 0 0 0 0 0 6084 1.0 6084 0 0 0 0 0 0 0 0 0.0 
0.0 1 1 98 1536 1291 1536

At 2545.996, it was the first time that Swift sent anything...
and at 3815, it was the time that the last job exit code was reported to 
Swift. 

So, runtime of 1270 seconds.  BTW, time 0 in the log maps back to
//0-time is 1208032721191ms

Also, the total CPU time from Falkon's point of view (accurate to the 
ms), is 1914115.25 CPU seconds, not 1190260.  So, by my numbers, I get:
1914115.25 / (1270 * 2048 ) = 0.735926446

This is already looking OK, isn't it?  Now, this doesn't actually look 
at the efficiency of the app, as it scaled up, which we would have to do 
by either repeating the same workload on 1 node, or taking a small 
sample of the workload and running on 1 node to compare against.

Ioan


Michael Wilde wrote:
> Ben, can you point me to the graphs for this run? (Zhao's *99cy0z4g.log)
>
> Here's a high-level summary of this run:
>
> Swift end   16:42:17
> Swift start 16:09:07
> Runtime        33:10 = 1990 seconds
>
> 2048 cores
>
> Total app wall time = 1190260 seconds
>
> 1190260 / ( 1990 * 2048 ) = .29 efficiency
>
> Once stage-ins start to complete, are the corresponding jobs initiated 
> quickly, or is Swift doing mostly stage-ins for some period?
>
> Zhao indicated he saw data indicating there was about a 700 second lag 
> from workflow start time till the first Falkon jobs started, if I 
> understood correctly. Do the graphs confirm this or say something 
> different?
>
> If the 700-second delay figure is true, and stage-in was eliminated by 
> copying input files right to the /tmp workdir rather than first to 
> /shared, then we'd have:
>
> 1190260 / ( 1290 * 2048 ) = .45 efficiency
>
> A good gain, but only partway to a number that looks good.
>
> I assume we're paying the same staging price on the output side?
>
> What I think we learned from the MARS app run, which had no input data 
> and only tiny output data files (10 bytes vs 10K bytes), was that the 
> optimized wrapper achieved somewhere between .7 to .8 efficiency.
>
> I'd like to look at whatever data we can get from this or similar 
> subsequent runs to learn what steps we could take next to increase the 
> efficiency metric.  Guidance welcome.
>
> Thanks,
>
> Mike
>
>
> On 4/13/08 11:37 AM, Ben Clifford wrote:
>> On Sat, 12 Apr 2008, Zhao Zhang wrote:
>>
>>> Hi, Ben
>>>
>>> I got a log file of 6084 successful runs on BGP. Check it here,
>>> terninable:/home/zzhang/swift_file/dock2-20080412-1609-99cy0z4g.log
>>
>> This one runs better - it gets up to a peak of 5000 jobs submitted 
>> into Falkon simultaneously, and spends a considerable amount of time 
>> over the 2048 level that I suppose is what you need to be over to get 
>> all 2048 CPUs used.
>>
>> There's a lot of stage-in activity that probably could be eliminated 
>> / changed for the single-filesytem case.
>>
>
>
> - Mike
>

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From benc at hawaga.org.uk  Sun Apr 13 15:30:16 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 13 Apr 2008 20:30:16 +0000 (GMT)
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <48026BCB.8060108@cs.uchicago.edu>
References: <47FEB489.509@mcs.anl.gov> <47FEF3D5.9060707@cs.uchicago.edu>
	<47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu>
	<47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu>
	<47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu>
	<47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu>
	<47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu>
	<47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu>
	<Pine.LNX.4.64.0804131635330.31934@dildano.hawaga.org.uk>
	<48025C64.9020502@mcs.anl.gov> <48026BCB.8060108@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0804132028590.31934@dildano.hawaga.org.uk>


> At 2545.996, it was the first time that Swift sent anything...
[...]
> So, runtime of 1270 seconds. 

There is a period of about 500s before Swift sends anything to falkon, 
though.

-- 


From wilde at mcs.anl.gov  Sun Apr 13 15:35:23 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 13 Apr 2008 15:35:23 -0500
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <Pine.LNX.4.64.0804131919290.31934@dildano.hawaga.org.uk>
References: <47FEB489.509@mcs.anl.gov> <47FEF3D5.9060707@cs.uchicago.edu>
	<47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu>
	<47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu>
	<47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu>
	<47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu>
	<47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu>
	<47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu>
	<Pine.LNX.4.64.0804131635330.31934@dildano.hawaga.org.uk>
	<48025C64.9020502@mcs.anl.gov>
	<Pine.LNX.4.64.0804131919290.31934@dildano.hawaga.org.uk>
Message-ID: <48026E8B.501@mcs.anl.gov>

Ben, your analysis sounds very good. Some notes below, including 
questions for Zhao.

On 4/13/08 2:57 PM, Ben Clifford wrote:
> 
>> Ben, can you point me to the graphs for this run? (Zhao's *99cy0z4g.log)
> 
> http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g
> 
>> Once stage-ins start to complete, are the corresponding jobs initiated 
>> quickly, or is Swift doing mostly stage-ins for some period?
> 
> In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to falkon) 
> pretty much right as the corresponding stagein completes. I have no deeper 
> information about when the worker actually starts to run.
> 
>> Zhao indicated he saw data indicating there was about a 700 second lag from
>> workflow start time till the first Falkon jobs started, if I understood
>> correctly. Do the graphs confirm this or say something different?
> 
> There is a period of about 500s or so until stuff starts to happen; I 
> haven't looked at it. That is before stage-ins start too, though, which 
> means that i think this...
> 
>> If the 700-second delay figure is true, and stage-in was eliminated by copying
>> input files right to the /tmp workdir rather than first to /shared, then we'd
>> have:
>>
>> 1190260 / ( 1290 * 2048 ) = .45 efficiency
> 
> calculation is not meaningful.
> 
> I have not looked at what is going on during that 500s startup time, but I 
> plan to.

Zhao, what SVN rev is your Swift at?  Ben fixed an N^2 mapper logging 
problem a few weeks ago. Could that cause such a delay, Ben? It would be 
very obvious in the swift log.

> 
>> I assume we're paying the same staging price on the output side?
> 
> not really - the output stageouts go very fast, and also because job 
> ending is staggered, they don't happen all at once.
> 
> This is the same with most of the large runs I've seen (of any 
> application) - stageout tends not to be a problem (or at least, no where 
> near the problems of stagein).
> 
> All stageins happen over a period t=400 to t=1100 fairly smoothly. There's 
> rate limiting still on file operations (100 max) and file transfers (2000 
> max) which is being hit still.

I thought Zhao set file operations throttle to 2000 as well.  Sounds 
like we can test with the latter higher, and find out what's limiting 
the former.

Zhao, what are your settings for property throttle.file.operations?
I assume you have throttle.transfers set to 2000.

If its set right, any chance that Swift or Karajan is limiting it somewhere?
> 
> I think there's two directions to proceed in here that make sense for 
> actual use on single clusters running falkon (rather than trying to cut 
> out stuff randomly to push up numbers):
> 
>  i) use some of the data placement features in falkon, rather than Swift's
>     relatively simple data management that was designed more for running
>     on the grid.

Long term: we should consider how the Coaster implementation could 
eventually do a similar data placement approach. In the meantime (mid 
term) examining what interface changes are needed for Falkon data 
placement might help prepare for that. Need to discuss if that would be 
a good step or not.

> 
>  ii) do stage-ins using symlinks rather than file copying. this makes
>      sense when everything is living in a single filesystem, which again
>      is not what Swift's data management was originally optimised for.

I assume you mean symlinks from shared/ back to the user's input files?

That sounds worth testing: find out if symlink creation is fast on NFS 
and GPFS.

Is another approach to copy direct from the user's files to the /tmp 
workdir (ie wrapper.sh pulls the data in)? Measurement will tell if 
symlinks alone get adequate performance. Symlinks do seem an easier 
first step.

> I think option ii) is substantially easier to implement (on the order of 
> days) and is generally useful in the single-cluster, local-source-data 
> situation that appears to be what people want to do for running on the 
> BG/P and scicortex (that is, pretty much ignoring anything grid-like at 
> all).

Grid-like might mean pulling data to the /tmp workdir directly by the 
wrapper - but that seems like a harder step, and would need measurement 
and prototyping of such code before attempting. Data transfer clients 
that the wrapper script can count on might be an obstacle.

> 
> Option i) is much harder (on the order of months), needing a very 
> different interface between Swift and Falkon than exists at the moment.
> 
> 
> 


From zhaozhang at uchicago.edu  Sun Apr 13 15:39:07 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Sun, 13 Apr 2008 15:39:07 -0500
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <48026E8B.501@mcs.anl.gov>
References: <47FEB489.509@mcs.anl.gov> <47FEF3D5.9060707@cs.uchicago.edu>
	<47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu>
	<47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu>
	<47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu>
	<47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu>
	<47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu>
	<47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu>
	<Pine.LNX.4.64.0804131635330.31934@dildano.hawaga.org.uk>
	<48025C64.9020502@mcs.anl.gov>
	<Pine.LNX.4.64.0804131919290.31934@dildano.hawaga.org.uk>
	<48026E8B.501@mcs.anl.! gov>
Message-ID: <48026F6B.9060300@uchicago.edu>

An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080413/f6002ad8/attachment.html>

From iraicu at cs.uchicago.edu  Sun Apr 13 15:39:31 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sun, 13 Apr 2008 15:39:31 -0500
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <Pine.LNX.4.64.0804131919290.31934@dildano.hawaga.org.uk>
References: <47FEB489.509@mcs.anl.gov> <47FEF3D5.9060707@cs.uchicago.edu>
	<47FEF4FE.6030406@uchicago.edu> <47FEF5CA.60700@cs.uchicago.edu>
	<47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu>
	<47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu>
	<47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu>
	<47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu>
	<47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu>
	<Pine.LNX.4.64.0804131635330.31934@dildano.hawaga.org.uk>
	<48025C64.9020502@mcs.anl.gov>
	<Pine.LNX.4.64.0804131919290.31934@dildano.hawaga.org.uk>
Message-ID: <48026F83.4060507@cs.uchicago.edu>


Ben Clifford wrote:
>   
> I think there's two directions to proceed in here that make sense for 
> actual use on single clusters running falkon (rather than trying to cut 
> out stuff randomly to push up numbers):
>
>  i) use some of the data placement features in falkon, rather than Swift's
>     relatively simple data management that was designed more for running
>     on the grid.
>
>  ii) do stage-ins using symlinks rather than file copying. this makes
>      sense when everything is living in a single filesystem, which again
>      is not what Swift's data management was originally optimised for.
>
> I think option ii) is substantially easier to implement (on the order of 
> days) and is generally useful in the single-cluster, local-source-data 
> situation that appears to be what people want to do for running on the 
> BG/P and scicortex (that is, pretty much ignoring anything grid-like at 
> all).
>   
I think this is worth a try, although it will probably be post SC 
(tomorrow night at midnight EST).
> Option i) is much harder (on the order of months), needing a very 
> different interface between Swift and Falkon than exists at the moment.
>   
I agree.  I have another deadline on May 8th, but I think we can start 
the discussions in May, and hope to have some Swift apps running over 
the data diffusion mechanism in Falkon over the summer months.

Ioan
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From iraicu at cs.uchicago.edu  Sun Apr 13 15:43:46 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sun, 13 Apr 2008 15:43:46 -0500
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <Pine.LNX.4.64.0804132028590.31934@dildano.hawaga.org.uk>
References: <47FEB489.509@mcs.anl.gov> <47FEF4FE.6030406@uchicago.edu>
	<47FEF5CA.60700@cs.uchicago.edu>
	<47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu>
	<47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu>
	<47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu>
	<47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu>
	<47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu>
	<Pine.LNX.4.64.0804131635330.31934@dildano.hawaga.org.uk>
	<48025C64.9020502@mcs.anl.gov> <48026BCB.8060108@cs.uchicago.edu>
	<Pine.LNX.4.64.0804132028590.31934@dildano.hawaga.org.uk>
Message-ID: <48027082.1050609@cs.uchicago.edu>

There are 6K jobs... how long would it take Swift to unroll a for loop 
with 6K iterations, and prep things to start sending these jobs to the 
falkon provider?  For example, doing a for loop with 6K sleeps in a for 
loop, how long does that take to start?  In this case, its more like 6K 
jobs, but 12K input files, so perhaps the 12K input files and checking 
dependencies between them (although there are none to be found) is what 
takes some time.  I vaguely remember doing 32K or 64K sleep jobs from 
Swift, and having it take at least minutes, maybe more to start up and 
have activity showing up in Falkon...

Ioan

Ben Clifford wrote:
>> At 2545.996, it was the first time that Swift sent anything...
>>     
> [...]
>   
>> So, runtime of 1270 seconds. 
>>     
>
> There is a period of about 500s before Swift sends anything to falkon, 
> though.
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From iraicu at cs.uchicago.edu  Sun Apr 13 15:51:21 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sun, 13 Apr 2008 15:51:21 -0500
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <48026F6B.9060300@uchicago.edu>
References: <47FEB489.509@mcs.anl.gov> <47FEF4FE.6030406@uchicago.edu>
	<47FEF5CA.60700@cs.uchicago.edu>
	<47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu>
	<47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu>
	<47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu>
	<47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu>
	<47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu>
	<Pine.LNX.4.64.0804131635330.31934@dildano.hawaga.org.uk>
	<48025C64.9020502@mcs.anl.gov>
	<Pine.LNX.4.64.0804131919290.31934@dildano.hawaga.org.uk>
	<48026E8B.501@mcs.anl.! gov> <48026F6B.9060300@uchicago.ed u>
Message-ID: <48027249.2070208@cs.uchicago.edu>

But we have 2X input files as opposed to number of jobs and CPUs.  We 
have 2048 CPUs, shouldn't we set all file I/O operations to at least 
4096... and that means that files won't be ready for the next jobs once 
the first ones start completing... so we should really set things to 
twice that, so 8192 is the number I'd set on all file operations for 
this app on 2K CPUs. 

Ioan

Zhao Zhang wrote:
> Hi, Mike
>
> Michael Wilde wrote:
>> Ben, your analysis sounds very good. Some notes below, including 
>> questions for Zhao.
>>
>> On 4/13/08 2:57 PM, Ben Clifford wrote:
>>>
>>>> Ben, can you point me to the graphs for this run? (Zhao's 
>>>> *99cy0z4g.log)
>>>
>>> http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g
>>>
>>>> Once stage-ins start to complete, are the corresponding jobs 
>>>> initiated quickly, or is Swift doing mostly stage-ins for some period?
>>>
>>> In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to 
>>> falkon) pretty much right as the corresponding stagein completes. I 
>>> have no deeper information about when the worker actually starts to 
>>> run.
>>>
>>>> Zhao indicated he saw data indicating there was about a 700 second 
>>>> lag from
>>>> workflow start time till the first Falkon jobs started, if I 
>>>> understood
>>>> correctly. Do the graphs confirm this or say something different?
>>>
>>> There is a period of about 500s or so until stuff starts to happen; 
>>> I haven't looked at it. That is before stage-ins start too, though, 
>>> which means that i think this...
>>>
>>>> If the 700-second delay figure is true, and stage-in was eliminated 
>>>> by copying
>>>> input files right to the /tmp workdir rather than first to /shared, 
>>>> then we'd
>>>> have:
>>>>
>>>> 1190260 / ( 1290 * 2048 ) = .45 efficiency
>>>
>>> calculation is not meaningful.
>>>
>>> I have not looked at what is going on during that 500s startup time, 
>>> but I plan to.
>>
>> Zhao, what SVN rev is your Swift at?  Ben fixed an N^2 mapper logging 
>> problem a few weeks ago. Could that cause such a delay, Ben? It would 
>> be very obvious in the swift log.
> The version is Swift svn swift-r1780 cog-r1956
>>
>>>
>>>> I assume we're paying the same staging price on the output side?
>>>
>>> not really - the output stageouts go very fast, and also because job 
>>> ending is staggered, they don't happen all at once.
>>>
>>> This is the same with most of the large runs I've seen (of any 
>>> application) - stageout tends not to be a problem (or at least, no 
>>> where near the problems of stagein).
>>>
>>> All stageins happen over a period t=400 to t=1100 fairly smoothly. 
>>> There's rate limiting still on file operations (100 max) and file 
>>> transfers (2000 max) which is being hit still.
>>
>> I thought Zhao set file operations throttle to 2000 as well.  Sounds 
>> like we can test with the latter higher, and find out what's limiting 
>> the former.
>>
>> Zhao, what are your settings for property throttle.file.operations?
>> I assume you have throttle.transfers set to 2000.
>>
>> If its set right, any chance that Swift or Karajan is limiting it 
>> somewhere?
> 2000 for sure,
> throttle.submit=off
> throttle.host.submit=off
> throttle.score.job.factor=off
> throttle.transfers=2000
> throttle.file.operation=2000
>>>
>>> I think there's two directions to proceed in here that make sense 
>>> for actual use on single clusters running falkon (rather than trying 
>>> to cut out stuff randomly to push up numbers):
>>>
>>>  i) use some of the data placement features in falkon, rather than 
>>> Swift's
>>>     relatively simple data management that was designed more for 
>>> running
>>>     on the grid.
>>
>> Long term: we should consider how the Coaster implementation could 
>> eventually do a similar data placement approach. In the meantime (mid 
>> term) examining what interface changes are needed for Falkon data 
>> placement might help prepare for that. Need to discuss if that would 
>> be a good step or not.
>>
>>>
>>>  ii) do stage-ins using symlinks rather than file copying. this makes
>>>      sense when everything is living in a single filesystem, which 
>>> again
>>>      is not what Swift's data management was originally optimised for.
>>
>> I assume you mean symlinks from shared/ back to the user's input files?
>>
>> That sounds worth testing: find out if symlink creation is fast on 
>> NFS and GPFS.
>>
>> Is another approach to copy direct from the user's files to the /tmp 
>> workdir (ie wrapper.sh pulls the data in)? Measurement will tell if 
>> symlinks alone get adequate performance. Symlinks do seem an easier 
>> first step.
>>
>>> I think option ii) is substantially easier to implement (on the 
>>> order of days) and is generally useful in the single-cluster, 
>>> local-source-data situation that appears to be what people want to 
>>> do for running on the BG/P and scicortex (that is, pretty much 
>>> ignoring anything grid-like at all).
>>
>> Grid-like might mean pulling data to the /tmp workdir directly by the 
>> wrapper - but that seems like a harder step, and would need 
>> measurement and prototyping of such code before attempting. Data 
>> transfer clients that the wrapper script can count on might be an 
>> obstacle.
>>
>>>
>>> Option i) is much harder (on the order of months), needing a very 
>>> different interface between Swift and Falkon than exists at the moment.
>>>
>>>
>>>
>>

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From benc at hawaga.org.uk  Sun Apr 13 15:58:12 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 13 Apr 2008 20:58:12 +0000 (GMT)
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <Pine.LNX.4.64.0804131919290.31934@dildano.hawaga.org.uk>
References: <47FEB489.509@mcs.anl.gov> <47FEF4FE.6030406@uchicago.edu>
	<47FEF5CA.60700@cs.uchicago.edu> <47FF0B1E.9030405@uchicago.edu>
	<47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu>
	<47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu>
	<47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov>
	<47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu>
	<Pine.LNX.4.64.0804131635330.31934@dildano.hawaga.org.uk>
	<48025C64.9020502@mcs.anl.gov>
	<Pine.LNX.4.64.0804131919290.31934@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0804132056281.31934@dildano.hawaga.org.uk>


On Sun, 13 Apr 2008, Ben Clifford wrote:

> There is a period of about 500s or so until stuff starts to happen; I 
> haven't looked at it.

What happens in the log file for this period is lots of DSHandle creation 
(vdl:new) - it creates 115596 datasets in 451 seconds, which is about 256 
per second. That seems quite a low rate.

-- 


From wilde at mcs.anl.gov  Sun Apr 13 16:52:56 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 13 Apr 2008 16:52:56 -0500
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <48026F6B.9060300@uchicago.edu>
References: <47FEB489.509@mcs.anl.gov> <47FEF4FE.6030406@uchicago.edu>
	<47FEF5CA.60700@cs.uchicago.edu>
	<47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu>
	<47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu>
	<47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu>
	<47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu>
	<47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu>
	<Pine.LNX.4.64.0804131635330.31934@dildano.hawaga.org.uk>
	<48025C64.9020502@mcs.anl.gov>
	<Pine.LNX.4.64.0804131919290.31934@dildano.hawaga.org.uk>
	<48026E8B.501@mcs.anl.! gov> <48026F6B.9060300@uchicago.e! du>
Message-ID: <480280B8.9040605@mcs.anl.gov>

 >> If its set right, any chance that Swift or Karajan is limiting it
 >> somewhere?
 > 2000 for sure,
 > throttle.submit=off
 > throttle.host.submit=off
 > throttle.score.job.factor=off
 > throttle.transfers=2000
 > throttle.file.operation=2000


Looks like a typo in your properties, Zhao - if the text above came from 
your swift.properties directly:

   throttle.file.operation=2000

vs operations with an s as per the properties doc:

throttle.file.operations=8
#throttle.file.operations=off

Which doesnt explain why we're seeing 100 when the default is 8 ???

- Mike


On 4/13/08 3:39 PM, Zhao Zhang wrote:
> Hi, Mike
> 
> Michael Wilde wrote:
>> Ben, your analysis sounds very good. Some notes below, including 
>> questions for Zhao.
>>
>> On 4/13/08 2:57 PM, Ben Clifford wrote:
>>>
>>>> Ben, can you point me to the graphs for this run? (Zhao's 
>>>> *99cy0z4g.log)
>>>
>>> http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g
>>>
>>>> Once stage-ins start to complete, are the corresponding jobs 
>>>> initiated quickly, or is Swift doing mostly stage-ins for some period?
>>>
>>> In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to 
>>> falkon) pretty much right as the corresponding stagein completes. I 
>>> have no deeper information about when the worker actually starts to run.
>>>
>>>> Zhao indicated he saw data indicating there was about a 700 second 
>>>> lag from
>>>> workflow start time till the first Falkon jobs started, if I understood
>>>> correctly. Do the graphs confirm this or say something different?
>>>
>>> There is a period of about 500s or so until stuff starts to happen; I 
>>> haven't looked at it. That is before stage-ins start too, though, 
>>> which means that i think this...
>>>
>>>> If the 700-second delay figure is true, and stage-in was eliminated 
>>>> by copying
>>>> input files right to the /tmp workdir rather than first to /shared, 
>>>> then we'd
>>>> have:
>>>>
>>>> 1190260 / ( 1290 * 2048 ) = .45 efficiency
>>>
>>> calculation is not meaningful.
>>>
>>> I have not looked at what is going on during that 500s startup time, 
>>> but I plan to.
>>
>> Zhao, what SVN rev is your Swift at?  Ben fixed an N^2 mapper logging 
>> problem a few weeks ago. Could that cause such a delay, Ben? It would 
>> be very obvious in the swift log.
> The version is Swift svn swift-r1780 cog-r1956
>>
>>>
>>>> I assume we're paying the same staging price on the output side?
>>>
>>> not really - the output stageouts go very fast, and also because job 
>>> ending is staggered, they don't happen all at once.
>>>
>>> This is the same with most of the large runs I've seen (of any 
>>> application) - stageout tends not to be a problem (or at least, no 
>>> where near the problems of stagein).
>>>
>>> All stageins happen over a period t=400 to t=1100 fairly smoothly. 
>>> There's rate limiting still on file operations (100 max) and file 
>>> transfers (2000 max) which is being hit still.
>>
>> I thought Zhao set file operations throttle to 2000 as well.  Sounds 
>> like we can test with the latter higher, and find out what's limiting 
>> the former.
>>
>> Zhao, what are your settings for property throttle.file.operations?
>> I assume you have throttle.transfers set to 2000.
>>
>> If its set right, any chance that Swift or Karajan is limiting it 
>> somewhere?
> 2000 for sure,
> throttle.submit=off
> throttle.host.submit=off
> throttle.score.job.factor=off
> throttle.transfers=2000
> throttle.file.operation=2000
>>>
>>> I think there's two directions to proceed in here that make sense for 
>>> actual use on single clusters running falkon (rather than trying to 
>>> cut out stuff randomly to push up numbers):
>>>
>>>  i) use some of the data placement features in falkon, rather than 
>>> Swift's
>>>     relatively simple data management that was designed more for running
>>>     on the grid.
>>
>> Long term: we should consider how the Coaster implementation could 
>> eventually do a similar data placement approach. In the meantime (mid 
>> term) examining what interface changes are needed for Falkon data 
>> placement might help prepare for that. Need to discuss if that would 
>> be a good step or not.
>>
>>>
>>>  ii) do stage-ins using symlinks rather than file copying. this makes
>>>      sense when everything is living in a single filesystem, which again
>>>      is not what Swift's data management was originally optimised for.
>>
>> I assume you mean symlinks from shared/ back to the user's input files?
>>
>> That sounds worth testing: find out if symlink creation is fast on NFS 
>> and GPFS.
>>
>> Is another approach to copy direct from the user's files to the /tmp 
>> workdir (ie wrapper.sh pulls the data in)? Measurement will tell if 
>> symlinks alone get adequate performance. Symlinks do seem an easier 
>> first step.
>>
>>> I think option ii) is substantially easier to implement (on the order 
>>> of days) and is generally useful in the single-cluster, 
>>> local-source-data situation that appears to be what people want to do 
>>> for running on the BG/P and scicortex (that is, pretty much ignoring 
>>> anything grid-like at all).
>>
>> Grid-like might mean pulling data to the /tmp workdir directly by the 
>> wrapper - but that seems like a harder step, and would need 
>> measurement and prototyping of such code before attempting. Data 
>> transfer clients that the wrapper script can count on might be an 
>> obstacle.
>>
>>>
>>> Option i) is much harder (on the order of months), needing a very 
>>> different interface between Swift and Falkon than exists at the moment.
>>>
>>>
>>>
>>


From wilde at mcs.anl.gov  Sun Apr 13 17:31:19 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 13 Apr 2008 17:31:19 -0500
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <Pine.LNX.4.64.0804132056281.31934@dildano.hawaga.org.uk>
References: <47FEB489.509@mcs.anl.gov> <47FEF4FE.6030406@uchicago.edu>
	<47FEF5CA.60700@cs.uchicago.edu>
	<47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu>
	<47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu>
	<47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu>
	<47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu>
	<47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu>
	<Pine.LNX.4.64.0804131635330.31934@dildano.hawaga.org.uk>
	<48025C64.9020502@mcs.anl.gov>
	<Pine.LNX.4.64.0804131919290.31934@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804132056281.31934@dildano.hawaga.org.uk>
Message-ID: <480289B7.4050207@mcs.anl.gov>

That might be a low rate, but its also not clear why its creating so 
many handles: I thought we only have about 6K jobs here, with 2 files in 
and 1 file out per job:

doall(params pset[])
{
   foreach p in pset {
     DockIn ifile  <single_file_mapper;
                    file=@strcat("mol-1M/",p.filename,".in")>;
     Mol2 mfile    <single_file_mapper;
                    file=@strcat("mol-1M/",p.filename,".mol2")>;
     DockOut ofile <single_file_mapper;
                    file=@strcat("mol-1M/",p.filename,".out")>;
     ofile = dock(ifile, mfile);
   }
}

I would have expected more like 18,000 datasets.

Are we calling the mapper incorrectly in this script?

On 4/13/08 3:58 PM, Ben Clifford wrote:
> On Sun, 13 Apr 2008, Ben Clifford wrote:
> 
>> There is a period of about 500s or so until stuff starts to happen; I 
>> haven't looked at it.
> 
> What happens in the log file for this period is lots of DSHandle creation 
> (vdl:new) - it creates 115596 datasets in 451 seconds, which is about 256 
> per second. That seems quite a low rate.
> 


From wilde at mcs.anl.gov  Sun Apr 13 17:50:01 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 13 Apr 2008 17:50:01 -0500
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <48027249.2070208@cs.uchicago.edu>
References: <47FEB489.509@mcs.anl.gov> <47FEF5CA.60700@cs.uchicago.edu>
	<47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu>
	<47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu>
	<47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu>
	<47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu>
	<47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu>
	<Pine.LNX.4.64.0804131635330.31934@dildano.hawaga.org.uk>
	<48025C64.9020502@mcs.anl.gov>
	<Pine.LNX.4.64.0804131919290.31934@dildano.hawaga.org.uk>
	<48026E8B.501@mcs.anl.! gov> <48026F6B.9060300@uchicago.ed u>
	<48027249.2070208@cs.uchica! go.edu>
Message-ID: <48028E19.7020400@mcs.anl.gov>

Its not clear to me whats best here, for 3 reasons:

1) We should set file.transfers and file.operations to a value that 
prevents Swift from adversely impacting performance on shared resources.

Since Swift must run on the login node and hits the shared cluster 
networks, we should test carefully.

2) Its not clear to me how many concurrent operations the login hosts 
can sustain before topping out, and how this number depends on file 
size. Do you know this from the GPFS benchmarks? And did you measure 
impact on system response during those benchmarks?

I think that the overall system would top out well before 2000 
concurrent transfers, but I could be wrong. Going much higher than the 
point where concurrency increases the data rate, it seems, would cause 
the rate to drop due to contention and context switching.

3) If I/O operations are fast compared to the job length and completion 
rate, you dont have to set these values to the same as the max number of 
input files that can be demanded at once.

I think we want to set the I/O operation concurrency to a value that 
achieves the highest operation rate we can sustain while keeping overall 
system performance at some acceptable level (tbd).

So first we need to find the concurrency level that maximizes ops/sec, 
(which may be filesize dependent) and then possibly back that off to 
reduce system impact.

It seems to me that finding the right I/O concurrency setting is complex 
and non-obvious, and I'm interested in what Ben and Mihael suggest here.

- Mike

On 4/13/08 3:51 PM, Ioan Raicu wrote:
> But we have 2X input files as opposed to number of jobs and CPUs.  We 
> have 2048 CPUs, shouldn't we set all file I/O operations to at least 
> 4096... and that means that files won't be ready for the next jobs once 
> the first ones start completing... so we should really set things to 
> twice that, so 8192 is the number I'd set on all file operations for 
> this app on 2K CPUs.
> Ioan
> 
> Zhao Zhang wrote:
>> Hi, Mike
>>
>> Michael Wilde wrote:
>>> Ben, your analysis sounds very good. Some notes below, including 
>>> questions for Zhao.
>>>
>>> On 4/13/08 2:57 PM, Ben Clifford wrote:
>>>>
>>>>> Ben, can you point me to the graphs for this run? (Zhao's 
>>>>> *99cy0z4g.log)
>>>>
>>>> http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g
>>>>
>>>>> Once stage-ins start to complete, are the corresponding jobs 
>>>>> initiated quickly, or is Swift doing mostly stage-ins for some period?
>>>>
>>>> In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to 
>>>> falkon) pretty much right as the corresponding stagein completes. I 
>>>> have no deeper information about when the worker actually starts to 
>>>> run.
>>>>
>>>>> Zhao indicated he saw data indicating there was about a 700 second 
>>>>> lag from
>>>>> workflow start time till the first Falkon jobs started, if I 
>>>>> understood
>>>>> correctly. Do the graphs confirm this or say something different?
>>>>
>>>> There is a period of about 500s or so until stuff starts to happen; 
>>>> I haven't looked at it. That is before stage-ins start too, though, 
>>>> which means that i think this...
>>>>
>>>>> If the 700-second delay figure is true, and stage-in was eliminated 
>>>>> by copying
>>>>> input files right to the /tmp workdir rather than first to /shared, 
>>>>> then we'd
>>>>> have:
>>>>>
>>>>> 1190260 / ( 1290 * 2048 ) = .45 efficiency
>>>>
>>>> calculation is not meaningful.
>>>>
>>>> I have not looked at what is going on during that 500s startup time, 
>>>> but I plan to.
>>>
>>> Zhao, what SVN rev is your Swift at?  Ben fixed an N^2 mapper logging 
>>> problem a few weeks ago. Could that cause such a delay, Ben? It would 
>>> be very obvious in the swift log.
>> The version is Swift svn swift-r1780 cog-r1956
>>>
>>>>
>>>>> I assume we're paying the same staging price on the output side?
>>>>
>>>> not really - the output stageouts go very fast, and also because job 
>>>> ending is staggered, they don't happen all at once.
>>>>
>>>> This is the same with most of the large runs I've seen (of any 
>>>> application) - stageout tends not to be a problem (or at least, no 
>>>> where near the problems of stagein).
>>>>
>>>> All stageins happen over a period t=400 to t=1100 fairly smoothly. 
>>>> There's rate limiting still on file operations (100 max) and file 
>>>> transfers (2000 max) which is being hit still.
>>>
>>> I thought Zhao set file operations throttle to 2000 as well.  Sounds 
>>> like we can test with the latter higher, and find out what's limiting 
>>> the former.
>>>
>>> Zhao, what are your settings for property throttle.file.operations?
>>> I assume you have throttle.transfers set to 2000.
>>>
>>> If its set right, any chance that Swift or Karajan is limiting it 
>>> somewhere?
>> 2000 for sure,
>> throttle.submit=off
>> throttle.host.submit=off
>> throttle.score.job.factor=off
>> throttle.transfers=2000
>> throttle.file.operation=2000
>>>>
>>>> I think there's two directions to proceed in here that make sense 
>>>> for actual use on single clusters running falkon (rather than trying 
>>>> to cut out stuff randomly to push up numbers):
>>>>
>>>>  i) use some of the data placement features in falkon, rather than 
>>>> Swift's
>>>>     relatively simple data management that was designed more for 
>>>> running
>>>>     on the grid.
>>>
>>> Long term: we should consider how the Coaster implementation could 
>>> eventually do a similar data placement approach. In the meantime (mid 
>>> term) examining what interface changes are needed for Falkon data 
>>> placement might help prepare for that. Need to discuss if that would 
>>> be a good step or not.
>>>
>>>>
>>>>  ii) do stage-ins using symlinks rather than file copying. this makes
>>>>      sense when everything is living in a single filesystem, which 
>>>> again
>>>>      is not what Swift's data management was originally optimised for.
>>>
>>> I assume you mean symlinks from shared/ back to the user's input files?
>>>
>>> That sounds worth testing: find out if symlink creation is fast on 
>>> NFS and GPFS.
>>>
>>> Is another approach to copy direct from the user's files to the /tmp 
>>> workdir (ie wrapper.sh pulls the data in)? Measurement will tell if 
>>> symlinks alone get adequate performance. Symlinks do seem an easier 
>>> first step.
>>>
>>>> I think option ii) is substantially easier to implement (on the 
>>>> order of days) and is generally useful in the single-cluster, 
>>>> local-source-data situation that appears to be what people want to 
>>>> do for running on the BG/P and scicortex (that is, pretty much 
>>>> ignoring anything grid-like at all).
>>>
>>> Grid-like might mean pulling data to the /tmp workdir directly by the 
>>> wrapper - but that seems like a harder step, and would need 
>>> measurement and prototyping of such code before attempting. Data 
>>> transfer clients that the wrapper script can count on might be an 
>>> obstacle.
>>>
>>>>
>>>> Option i) is much harder (on the order of months), needing a very 
>>>> different interface between Swift and Falkon than exists at the moment.
>>>>
>>>>
>>>>
>>>
> 


From zhaozhang at uchicago.edu  Sun Apr 13 17:58:55 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Sun, 13 Apr 2008 17:58:55 -0500
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <480280B8.9040605@mcs.anl.gov>
References: <47FEB489.509@mcs.anl.gov> <47FEF5CA.60700@cs.uchicago.edu>
	<47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu>
	<47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu>
	<47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu>
	<47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu>
	<47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu>
	<Pine.LNX.4.64.0804131635330.31934@dildano.hawaga.org.uk>
	<48025C64.9020502@mcs.anl.gov>
	<Pine.LNX.4.64.0804131919290.31934@dildano.hawaga.org.uk>
	<48026E8B.501@mcs.anl.! gov> <48026F6B.9060300@uchicago.e! du>
	<480280B8.9040605@mcs.anl.! gov>
Message-ID: <4802902F.7050704@uchicago.edu>

Hi, Mike

It is just a typo in the email. I my property file, it is 
"throttle.file.operations=2000". Thanks.

zhao

Michael Wilde wrote:
> >> If its set right, any chance that Swift or Karajan is limiting it
> >> somewhere?
> > 2000 for sure,
> > throttle.submit=off
> > throttle.host.submit=off
> > throttle.score.job.factor=off
> > throttle.transfers=2000
> > throttle.file.operation=2000
>
>
> Looks like a typo in your properties, Zhao - if the text above came 
> from your swift.properties directly:
>
>   throttle.file.operation=2000
>
> vs operations with an s as per the properties doc:
>
> throttle.file.operations=8
> #throttle.file.operations=off
>
> Which doesnt explain why we're seeing 100 when the default is 8 ???
>
> - Mike
>
>
> On 4/13/08 3:39 PM, Zhao Zhang wrote:
>> Hi, Mike
>>
>> Michael Wilde wrote:
>>> Ben, your analysis sounds very good. Some notes below, including 
>>> questions for Zhao.
>>>
>>> On 4/13/08 2:57 PM, Ben Clifford wrote:
>>>>
>>>>> Ben, can you point me to the graphs for this run? (Zhao's 
>>>>> *99cy0z4g.log)
>>>>
>>>> http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g
>>>>
>>>>> Once stage-ins start to complete, are the corresponding jobs 
>>>>> initiated quickly, or is Swift doing mostly stage-ins for some 
>>>>> period?
>>>>
>>>> In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to 
>>>> falkon) pretty much right as the corresponding stagein completes. I 
>>>> have no deeper information about when the worker actually starts to 
>>>> run.
>>>>
>>>>> Zhao indicated he saw data indicating there was about a 700 second 
>>>>> lag from
>>>>> workflow start time till the first Falkon jobs started, if I 
>>>>> understood
>>>>> correctly. Do the graphs confirm this or say something different?
>>>>
>>>> There is a period of about 500s or so until stuff starts to happen; 
>>>> I haven't looked at it. That is before stage-ins start too, though, 
>>>> which means that i think this...
>>>>
>>>>> If the 700-second delay figure is true, and stage-in was 
>>>>> eliminated by copying
>>>>> input files right to the /tmp workdir rather than first to 
>>>>> /shared, then we'd
>>>>> have:
>>>>>
>>>>> 1190260 / ( 1290 * 2048 ) = .45 efficiency
>>>>
>>>> calculation is not meaningful.
>>>>
>>>> I have not looked at what is going on during that 500s startup 
>>>> time, but I plan to.
>>>
>>> Zhao, what SVN rev is your Swift at?  Ben fixed an N^2 mapper 
>>> logging problem a few weeks ago. Could that cause such a delay, Ben? 
>>> It would be very obvious in the swift log.
>> The version is Swift svn swift-r1780 cog-r1956
>>>
>>>>
>>>>> I assume we're paying the same staging price on the output side?
>>>>
>>>> not really - the output stageouts go very fast, and also because 
>>>> job ending is staggered, they don't happen all at once.
>>>>
>>>> This is the same with most of the large runs I've seen (of any 
>>>> application) - stageout tends not to be a problem (or at least, no 
>>>> where near the problems of stagein).
>>>>
>>>> All stageins happen over a period t=400 to t=1100 fairly smoothly. 
>>>> There's rate limiting still on file operations (100 max) and file 
>>>> transfers (2000 max) which is being hit still.
>>>
>>> I thought Zhao set file operations throttle to 2000 as well.  Sounds 
>>> like we can test with the latter higher, and find out what's 
>>> limiting the former.
>>>
>>> Zhao, what are your settings for property throttle.file.operations?
>>> I assume you have throttle.transfers set to 2000.
>>>
>>> If its set right, any chance that Swift or Karajan is limiting it 
>>> somewhere?
>> 2000 for sure,
>> throttle.submit=off
>> throttle.host.submit=off
>> throttle.score.job.factor=off
>> throttle.transfers=2000
>> throttle.file.operation=2000
>>>>
>>>> I think there's two directions to proceed in here that make sense 
>>>> for actual use on single clusters running falkon (rather than 
>>>> trying to cut out stuff randomly to push up numbers):
>>>>
>>>>  i) use some of the data placement features in falkon, rather than 
>>>> Swift's
>>>>     relatively simple data management that was designed more for 
>>>> running
>>>>     on the grid.
>>>
>>> Long term: we should consider how the Coaster implementation could 
>>> eventually do a similar data placement approach. In the meantime 
>>> (mid term) examining what interface changes are needed for Falkon 
>>> data placement might help prepare for that. Need to discuss if that 
>>> would be a good step or not.
>>>
>>>>
>>>>  ii) do stage-ins using symlinks rather than file copying. this makes
>>>>      sense when everything is living in a single filesystem, which 
>>>> again
>>>>      is not what Swift's data management was originally optimised for.
>>>
>>> I assume you mean symlinks from shared/ back to the user's input files?
>>>
>>> That sounds worth testing: find out if symlink creation is fast on 
>>> NFS and GPFS.
>>>
>>> Is another approach to copy direct from the user's files to the /tmp 
>>> workdir (ie wrapper.sh pulls the data in)? Measurement will tell if 
>>> symlinks alone get adequate performance. Symlinks do seem an easier 
>>> first step.
>>>
>>>> I think option ii) is substantially easier to implement (on the 
>>>> order of days) and is generally useful in the single-cluster, 
>>>> local-source-data situation that appears to be what people want to 
>>>> do for running on the BG/P and scicortex (that is, pretty much 
>>>> ignoring anything grid-like at all).
>>>
>>> Grid-like might mean pulling data to the /tmp workdir directly by 
>>> the wrapper - but that seems like a harder step, and would need 
>>> measurement and prototyping of such code before attempting. Data 
>>> transfer clients that the wrapper script can count on might be an 
>>> obstacle.
>>>
>>>>
>>>> Option i) is much harder (on the order of months), needing a very 
>>>> different interface between Swift and Falkon than exists at the 
>>>> moment.
>>>>
>>>>
>>>>
>>>
>


From hategan at mcs.anl.gov  Sun Apr 13 17:06:11 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 13 Apr 2008 17:06:11 -0500
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <48028E19.7020400@mcs.anl.gov>
References: <47FEB489.509@mcs.anl.gov> <47FEF5CA.60700@cs.uchicago.edu>
	<47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu>
	<47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu>
	<47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu>
	<47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu>
	<47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu>
	<Pine.LNX.4.64.0804131635330.31934@dildano.hawaga.org.uk>
	<48025C64.9020502@mcs.anl.gov>
	<Pine.LNX.4.64.0804131919290.31934@dildano.hawaga.org.uk>
	<48026E8B.501@mcs.anl.! gov> <48026F6B.9060300@uchicago.ed u>
	<48027249.2070208@cs.uchica! go.edu>  <48028E19.7020400@mcs.anl.gov>
Message-ID: <1208124371.3191.4.camel@blabla.mcs.anl.gov>


On Sun, 2008-04-13 at 17:50 -0500, Michael Wilde wrote:
> Its not clear to me whats best here, for 3 reasons:
> 
> 1) We should set file.transfers and file.operations to a value that 
> prevents Swift from adversely impacting performance on shared resources.
> 
> Since Swift must run on the login node and hits the shared cluster 
> networks, we should test carefully.
> 
> 2) Its not clear to me how many concurrent operations the login hosts 
> can sustain before topping out, and how this number depends on file 
> size. Do you know this from the GPFS benchmarks? And did you measure 
> impact on system response during those benchmarks?
> 
> I think that the overall system would top out well before 2000 
> concurrent transfers, but I could be wrong. Going much higher than the 
> point where concurrency increases the data rate, it seems, would cause 
> the rate to drop due to contention and context switching.

The number is probably in the 10-100 range. With 2000 it's somewhat
likely that the transfers are long done before all the gridftp
connections can be started.

Mihael

> 
> 3) If I/O operations are fast compared to the job length and completion 
> rate, you dont have to set these values to the same as the max number of 
> input files that can be demanded at once.
> 
> I think we want to set the I/O operation concurrency to a value that 
> achieves the highest operation rate we can sustain while keeping overall 
> system performance at some acceptable level (tbd).
> 
> So first we need to find the concurrency level that maximizes ops/sec, 
> (which may be filesize dependent) and then possibly back that off to 
> reduce system impact.
> 
> It seems to me that finding the right I/O concurrency setting is complex 
> and non-obvious, and I'm interested in what Ben and Mihael suggest here.
> 
> - Mike
> 
> On 4/13/08 3:51 PM, Ioan Raicu wrote:
> > But we have 2X input files as opposed to number of jobs and CPUs.  We 
> > have 2048 CPUs, shouldn't we set all file I/O operations to at least 
> > 4096... and that means that files won't be ready for the next jobs once 
> > the first ones start completing... so we should really set things to 
> > twice that, so 8192 is the number I'd set on all file operations for 
> > this app on 2K CPUs.
> > Ioan
> > 
> > Zhao Zhang wrote:
> >> Hi, Mike
> >>
> >> Michael Wilde wrote:
> >>> Ben, your analysis sounds very good. Some notes below, including 
> >>> questions for Zhao.
> >>>
> >>> On 4/13/08 2:57 PM, Ben Clifford wrote:
> >>>>
> >>>>> Ben, can you point me to the graphs for this run? (Zhao's 
> >>>>> *99cy0z4g.log)
> >>>>
> >>>> http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g
> >>>>
> >>>>> Once stage-ins start to complete, are the corresponding jobs 
> >>>>> initiated quickly, or is Swift doing mostly stage-ins for some period?
> >>>>
> >>>> In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to 
> >>>> falkon) pretty much right as the corresponding stagein completes. I 
> >>>> have no deeper information about when the worker actually starts to 
> >>>> run.
> >>>>
> >>>>> Zhao indicated he saw data indicating there was about a 700 second 
> >>>>> lag from
> >>>>> workflow start time till the first Falkon jobs started, if I 
> >>>>> understood
> >>>>> correctly. Do the graphs confirm this or say something different?
> >>>>
> >>>> There is a period of about 500s or so until stuff starts to happen; 
> >>>> I haven't looked at it. That is before stage-ins start too, though, 
> >>>> which means that i think this...
> >>>>
> >>>>> If the 700-second delay figure is true, and stage-in was eliminated 
> >>>>> by copying
> >>>>> input files right to the /tmp workdir rather than first to /shared, 
> >>>>> then we'd
> >>>>> have:
> >>>>>
> >>>>> 1190260 / ( 1290 * 2048 ) = .45 efficiency
> >>>>
> >>>> calculation is not meaningful.
> >>>>
> >>>> I have not looked at what is going on during that 500s startup time, 
> >>>> but I plan to.
> >>>
> >>> Zhao, what SVN rev is your Swift at?  Ben fixed an N^2 mapper logging 
> >>> problem a few weeks ago. Could that cause such a delay, Ben? It would 
> >>> be very obvious in the swift log.
> >> The version is Swift svn swift-r1780 cog-r1956
> >>>
> >>>>
> >>>>> I assume we're paying the same staging price on the output side?
> >>>>
> >>>> not really - the output stageouts go very fast, and also because job 
> >>>> ending is staggered, they don't happen all at once.
> >>>>
> >>>> This is the same with most of the large runs I've seen (of any 
> >>>> application) - stageout tends not to be a problem (or at least, no 
> >>>> where near the problems of stagein).
> >>>>
> >>>> All stageins happen over a period t=400 to t=1100 fairly smoothly. 
> >>>> There's rate limiting still on file operations (100 max) and file 
> >>>> transfers (2000 max) which is being hit still.
> >>>
> >>> I thought Zhao set file operations throttle to 2000 as well.  Sounds 
> >>> like we can test with the latter higher, and find out what's limiting 
> >>> the former.
> >>>
> >>> Zhao, what are your settings for property throttle.file.operations?
> >>> I assume you have throttle.transfers set to 2000.
> >>>
> >>> If its set right, any chance that Swift or Karajan is limiting it 
> >>> somewhere?
> >> 2000 for sure,
> >> throttle.submit=off
> >> throttle.host.submit=off
> >> throttle.score.job.factor=off
> >> throttle.transfers=2000
> >> throttle.file.operation=2000
> >>>>
> >>>> I think there's two directions to proceed in here that make sense 
> >>>> for actual use on single clusters running falkon (rather than trying 
> >>>> to cut out stuff randomly to push up numbers):
> >>>>
> >>>>  i) use some of the data placement features in falkon, rather than 
> >>>> Swift's
> >>>>     relatively simple data management that was designed more for 
> >>>> running
> >>>>     on the grid.
> >>>
> >>> Long term: we should consider how the Coaster implementation could 
> >>> eventually do a similar data placement approach. In the meantime (mid 
> >>> term) examining what interface changes are needed for Falkon data 
> >>> placement might help prepare for that. Need to discuss if that would 
> >>> be a good step or not.
> >>>
> >>>>
> >>>>  ii) do stage-ins using symlinks rather than file copying. this makes
> >>>>      sense when everything is living in a single filesystem, which 
> >>>> again
> >>>>      is not what Swift's data management was originally optimised for.
> >>>
> >>> I assume you mean symlinks from shared/ back to the user's input files?
> >>>
> >>> That sounds worth testing: find out if symlink creation is fast on 
> >>> NFS and GPFS.
> >>>
> >>> Is another approach to copy direct from the user's files to the /tmp 
> >>> workdir (ie wrapper.sh pulls the data in)? Measurement will tell if 
> >>> symlinks alone get adequate performance. Symlinks do seem an easier 
> >>> first step.
> >>>
> >>>> I think option ii) is substantially easier to implement (on the 
> >>>> order of days) and is generally useful in the single-cluster, 
> >>>> local-source-data situation that appears to be what people want to 
> >>>> do for running on the BG/P and scicortex (that is, pretty much 
> >>>> ignoring anything grid-like at all).
> >>>
> >>> Grid-like might mean pulling data to the /tmp workdir directly by the 
> >>> wrapper - but that seems like a harder step, and would need 
> >>> measurement and prototyping of such code before attempting. Data 
> >>> transfer clients that the wrapper script can count on might be an 
> >>> obstacle.
> >>>
> >>>>
> >>>> Option i) is much harder (on the order of months), needing a very 
> >>>> different interface between Swift and Falkon than exists at the moment.
> >>>>
> >>>>
> >>>>
> >>>
> > 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From hategan at mcs.anl.gov  Sun Apr 13 17:09:03 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 13 Apr 2008 17:09:03 -0500
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <4802902F.7050704@uchicago.edu>
References: <47FEB489.509@mcs.anl.gov> <47FEF5CA.60700@cs.uchicago.edu>
	<47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu>
	<47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu>
	<47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu>
	<47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu>
	<47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu>
	<Pine.LNX.4.64.0804131635330.31934@dildano.hawaga.org.uk>
	<48025C64.9020502@mcs.anl.gov>
	<Pine.LNX.4.64.0804131919290.31934@dildano.hawaga.org.uk>
	<48026E8B.501@mcs.anl.! gov> <48026F6B.9060300@uchicago.e! du>
	<480280B8.9040605@mcs.anl.! gov>  <4802902F.7050704@uchicago.edu>
Message-ID: <1208124543.3191.7.camel@blabla.mcs.anl.gov>

Then my guess is that the system itself (swift + server + FS) cannot
sustain a much higher rate than 100 things per second. In principle
setting those throttles to 2000 pretty much means that you're trying to
start 2000 gridftp connections and hence 2000 gridftp processes on the
server.

On Sun, 2008-04-13 at 17:58 -0500, Zhao Zhang wrote:
> Hi, Mike
> 
> It is just a typo in the email. I my property file, it is 
> "throttle.file.operations=2000". Thanks.
> 
> zhao
> 
> Michael Wilde wrote:
> > >> If its set right, any chance that Swift or Karajan is limiting it
> > >> somewhere?
> > > 2000 for sure,
> > > throttle.submit=off
> > > throttle.host.submit=off
> > > throttle.score.job.factor=off
> > > throttle.transfers=2000
> > > throttle.file.operation=2000
> >
> >
> > Looks like a typo in your properties, Zhao - if the text above came 
> > from your swift.properties directly:
> >
> >   throttle.file.operation=2000
> >
> > vs operations with an s as per the properties doc:
> >
> > throttle.file.operations=8
> > #throttle.file.operations=off
> >
> > Which doesnt explain why we're seeing 100 when the default is 8 ???
> >
> > - Mike
> >
> >
> > On 4/13/08 3:39 PM, Zhao Zhang wrote:
> >> Hi, Mike
> >>
> >> Michael Wilde wrote:
> >>> Ben, your analysis sounds very good. Some notes below, including 
> >>> questions for Zhao.
> >>>
> >>> On 4/13/08 2:57 PM, Ben Clifford wrote:
> >>>>
> >>>>> Ben, can you point me to the graphs for this run? (Zhao's 
> >>>>> *99cy0z4g.log)
> >>>>
> >>>> http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g
> >>>>
> >>>>> Once stage-ins start to complete, are the corresponding jobs 
> >>>>> initiated quickly, or is Swift doing mostly stage-ins for some 
> >>>>> period?
> >>>>
> >>>> In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to 
> >>>> falkon) pretty much right as the corresponding stagein completes. I 
> >>>> have no deeper information about when the worker actually starts to 
> >>>> run.
> >>>>
> >>>>> Zhao indicated he saw data indicating there was about a 700 second 
> >>>>> lag from
> >>>>> workflow start time till the first Falkon jobs started, if I 
> >>>>> understood
> >>>>> correctly. Do the graphs confirm this or say something different?
> >>>>
> >>>> There is a period of about 500s or so until stuff starts to happen; 
> >>>> I haven't looked at it. That is before stage-ins start too, though, 
> >>>> which means that i think this...
> >>>>
> >>>>> If the 700-second delay figure is true, and stage-in was 
> >>>>> eliminated by copying
> >>>>> input files right to the /tmp workdir rather than first to 
> >>>>> /shared, then we'd
> >>>>> have:
> >>>>>
> >>>>> 1190260 / ( 1290 * 2048 ) = .45 efficiency
> >>>>
> >>>> calculation is not meaningful.
> >>>>
> >>>> I have not looked at what is going on during that 500s startup 
> >>>> time, but I plan to.
> >>>
> >>> Zhao, what SVN rev is your Swift at?  Ben fixed an N^2 mapper 
> >>> logging problem a few weeks ago. Could that cause such a delay, Ben? 
> >>> It would be very obvious in the swift log.
> >> The version is Swift svn swift-r1780 cog-r1956
> >>>
> >>>>
> >>>>> I assume we're paying the same staging price on the output side?
> >>>>
> >>>> not really - the output stageouts go very fast, and also because 
> >>>> job ending is staggered, they don't happen all at once.
> >>>>
> >>>> This is the same with most of the large runs I've seen (of any 
> >>>> application) - stageout tends not to be a problem (or at least, no 
> >>>> where near the problems of stagein).
> >>>>
> >>>> All stageins happen over a period t=400 to t=1100 fairly smoothly. 
> >>>> There's rate limiting still on file operations (100 max) and file 
> >>>> transfers (2000 max) which is being hit still.
> >>>
> >>> I thought Zhao set file operations throttle to 2000 as well.  Sounds 
> >>> like we can test with the latter higher, and find out what's 
> >>> limiting the former.
> >>>
> >>> Zhao, what are your settings for property throttle.file.operations?
> >>> I assume you have throttle.transfers set to 2000.
> >>>
> >>> If its set right, any chance that Swift or Karajan is limiting it 
> >>> somewhere?
> >> 2000 for sure,
> >> throttle.submit=off
> >> throttle.host.submit=off
> >> throttle.score.job.factor=off
> >> throttle.transfers=2000
> >> throttle.file.operation=2000
> >>>>
> >>>> I think there's two directions to proceed in here that make sense 
> >>>> for actual use on single clusters running falkon (rather than 
> >>>> trying to cut out stuff randomly to push up numbers):
> >>>>
> >>>>  i) use some of the data placement features in falkon, rather than 
> >>>> Swift's
> >>>>     relatively simple data management that was designed more for 
> >>>> running
> >>>>     on the grid.
> >>>
> >>> Long term: we should consider how the Coaster implementation could 
> >>> eventually do a similar data placement approach. In the meantime 
> >>> (mid term) examining what interface changes are needed for Falkon 
> >>> data placement might help prepare for that. Need to discuss if that 
> >>> would be a good step or not.
> >>>
> >>>>
> >>>>  ii) do stage-ins using symlinks rather than file copying. this makes
> >>>>      sense when everything is living in a single filesystem, which 
> >>>> again
> >>>>      is not what Swift's data management was originally optimised for.
> >>>
> >>> I assume you mean symlinks from shared/ back to the user's input files?
> >>>
> >>> That sounds worth testing: find out if symlink creation is fast on 
> >>> NFS and GPFS.
> >>>
> >>> Is another approach to copy direct from the user's files to the /tmp 
> >>> workdir (ie wrapper.sh pulls the data in)? Measurement will tell if 
> >>> symlinks alone get adequate performance. Symlinks do seem an easier 
> >>> first step.
> >>>
> >>>> I think option ii) is substantially easier to implement (on the 
> >>>> order of days) and is generally useful in the single-cluster, 
> >>>> local-source-data situation that appears to be what people want to 
> >>>> do for running on the BG/P and scicortex (that is, pretty much 
> >>>> ignoring anything grid-like at all).
> >>>
> >>> Grid-like might mean pulling data to the /tmp workdir directly by 
> >>> the wrapper - but that seems like a harder step, and would need 
> >>> measurement and prototyping of such code before attempting. Data 
> >>> transfer clients that the wrapper script can count on might be an 
> >>> obstacle.
> >>>
> >>>>
> >>>> Option i) is much harder (on the order of months), needing a very 
> >>>> different interface between Swift and Falkon than exists at the 
> >>>> moment.
> >>>>
> >>>>
> >>>>
> >>>
> >
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From iraicu at cs.uchicago.edu  Sun Apr 13 18:22:51 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sun, 13 Apr 2008 18:22:51 -0500
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <1208124543.3191.7.camel@blabla.mcs.anl.gov>
References: <47FEB489.509@mcs.anl.gov> <47FF7E6A.5010001@uchicago.edu>
	<47FF8ACC.1040603@cs.uchicago.edu>	<47FF8BB1.5000108@uchicago.edu>
	<47FF8CC9.6040903@cs.uchicago.edu>	<47FF8DC1.5040607@mcs.anl.gov>
	<47FF9662.2010205@uchicago.edu>	<47FF9D65.3040805@mcs.anl.gov>	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>	<47FFA226.3070303@uchicago.edu>	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>	<47FFCD24.2040704@uchicago.edu>	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>	<48008BAD.20501@uchicago.edu>	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>	<4800BCEF.5040206@mcs.anl.gov>
	<48012EA9.1060102@uchicago.edu>	<Pine.LNX.4.64.0804131635330.31934@dildano.hawaga.org.uk>	<48025C64.9020502@mcs.anl.gov>	<Pine.LNX.4.64.0804131919290.31934@dildano.hawaga.org.uk>	<48026E8B.501@mcs.anl.!
	gov> <48026F6B.9060300@uchicago.e!
	du>	<480280B8.9040605@mcs.anl.! gov>
	<4802902F.7050704@uchicago.edu>
	<1208124543.3191.7.camel@blabla.mcs.anl.gov>
Message-ID: <480295CB.3010806@cs.uchicago.edu>

We are not using GridFTP on the BG/P, where this test was done.  Files 
are already on GPFS, so the stageins are probably just cp (or ln -s) 
from one place to another on GPFS.  Is your suggestion still to set that 
2000 back down to 100?

Ioan

Mihael Hategan wrote:
> Then my guess is that the system itself (swift + server + FS) cannot
> sustain a much higher rate than 100 things per second. In principle
> setting those throttles to 2000 pretty much means that you're trying to
> start 2000 gridftp connections and hence 2000 gridftp processes on the
> server.
>
> On Sun, 2008-04-13 at 17:58 -0500, Zhao Zhang wrote:
>   
>> Hi, Mike
>>
>> It is just a typo in the email. I my property file, it is 
>> "throttle.file.operations=2000". Thanks.
>>
>> zhao
>>
>> Michael Wilde wrote:
>>     
>>>>> If its set right, any chance that Swift or Karajan is limiting it
>>>>> somewhere?
>>>>>           
>>>> 2000 for sure,
>>>> throttle.submit=off
>>>> throttle.host.submit=off
>>>> throttle.score.job.factor=off
>>>> throttle.transfers=2000
>>>> throttle.file.operation=2000
>>>>         
>>> Looks like a typo in your properties, Zhao - if the text above came 
>>> from your swift.properties directly:
>>>
>>>   throttle.file.operation=2000
>>>
>>> vs operations with an s as per the properties doc:
>>>
>>> throttle.file.operations=8
>>> #throttle.file.operations=off
>>>
>>> Which doesnt explain why we're seeing 100 when the default is 8 ???
>>>
>>> - Mike
>>>
>>>
>>> On 4/13/08 3:39 PM, Zhao Zhang wrote:
>>>       
>>>> Hi, Mike
>>>>
>>>> Michael Wilde wrote:
>>>>         
>>>>> Ben, your analysis sounds very good. Some notes below, including 
>>>>> questions for Zhao.
>>>>>
>>>>> On 4/13/08 2:57 PM, Ben Clifford wrote:
>>>>>           
>>>>>>> Ben, can you point me to the graphs for this run? (Zhao's 
>>>>>>> *99cy0z4g.log)
>>>>>>>               
>>>>>> http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g
>>>>>>
>>>>>>             
>>>>>>> Once stage-ins start to complete, are the corresponding jobs 
>>>>>>> initiated quickly, or is Swift doing mostly stage-ins for some 
>>>>>>> period?
>>>>>>>               
>>>>>> In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to 
>>>>>> falkon) pretty much right as the corresponding stagein completes. I 
>>>>>> have no deeper information about when the worker actually starts to 
>>>>>> run.
>>>>>>
>>>>>>             
>>>>>>> Zhao indicated he saw data indicating there was about a 700 second 
>>>>>>> lag from
>>>>>>> workflow start time till the first Falkon jobs started, if I 
>>>>>>> understood
>>>>>>> correctly. Do the graphs confirm this or say something different?
>>>>>>>               
>>>>>> There is a period of about 500s or so until stuff starts to happen; 
>>>>>> I haven't looked at it. That is before stage-ins start too, though, 
>>>>>> which means that i think this...
>>>>>>
>>>>>>             
>>>>>>> If the 700-second delay figure is true, and stage-in was 
>>>>>>> eliminated by copying
>>>>>>> input files right to the /tmp workdir rather than first to 
>>>>>>> /shared, then we'd
>>>>>>> have:
>>>>>>>
>>>>>>> 1190260 / ( 1290 * 2048 ) = .45 efficiency
>>>>>>>               
>>>>>> calculation is not meaningful.
>>>>>>
>>>>>> I have not looked at what is going on during that 500s startup 
>>>>>> time, but I plan to.
>>>>>>             
>>>>> Zhao, what SVN rev is your Swift at?  Ben fixed an N^2 mapper 
>>>>> logging problem a few weeks ago. Could that cause such a delay, Ben? 
>>>>> It would be very obvious in the swift log.
>>>>>           
>>>> The version is Swift svn swift-r1780 cog-r1956
>>>>         
>>>>>>> I assume we're paying the same staging price on the output side?
>>>>>>>               
>>>>>> not really - the output stageouts go very fast, and also because 
>>>>>> job ending is staggered, they don't happen all at once.
>>>>>>
>>>>>> This is the same with most of the large runs I've seen (of any 
>>>>>> application) - stageout tends not to be a problem (or at least, no 
>>>>>> where near the problems of stagein).
>>>>>>
>>>>>> All stageins happen over a period t=400 to t=1100 fairly smoothly. 
>>>>>> There's rate limiting still on file operations (100 max) and file 
>>>>>> transfers (2000 max) which is being hit still.
>>>>>>             
>>>>> I thought Zhao set file operations throttle to 2000 as well.  Sounds 
>>>>> like we can test with the latter higher, and find out what's 
>>>>> limiting the former.
>>>>>
>>>>> Zhao, what are your settings for property throttle.file.operations?
>>>>> I assume you have throttle.transfers set to 2000.
>>>>>
>>>>> If its set right, any chance that Swift or Karajan is limiting it 
>>>>> somewhere?
>>>>>           
>>>> 2000 for sure,
>>>> throttle.submit=off
>>>> throttle.host.submit=off
>>>> throttle.score.job.factor=off
>>>> throttle.transfers=2000
>>>> throttle.file.operation=2000
>>>>         
>>>>>> I think there's two directions to proceed in here that make sense 
>>>>>> for actual use on single clusters running falkon (rather than 
>>>>>> trying to cut out stuff randomly to push up numbers):
>>>>>>
>>>>>>  i) use some of the data placement features in falkon, rather than 
>>>>>> Swift's
>>>>>>     relatively simple data management that was designed more for 
>>>>>> running
>>>>>>     on the grid.
>>>>>>             
>>>>> Long term: we should consider how the Coaster implementation could 
>>>>> eventually do a similar data placement approach. In the meantime 
>>>>> (mid term) examining what interface changes are needed for Falkon 
>>>>> data placement might help prepare for that. Need to discuss if that 
>>>>> would be a good step or not.
>>>>>
>>>>>           
>>>>>>  ii) do stage-ins using symlinks rather than file copying. this makes
>>>>>>      sense when everything is living in a single filesystem, which 
>>>>>> again
>>>>>>      is not what Swift's data management was originally optimised for.
>>>>>>             
>>>>> I assume you mean symlinks from shared/ back to the user's input files?
>>>>>
>>>>> That sounds worth testing: find out if symlink creation is fast on 
>>>>> NFS and GPFS.
>>>>>
>>>>> Is another approach to copy direct from the user's files to the /tmp 
>>>>> workdir (ie wrapper.sh pulls the data in)? Measurement will tell if 
>>>>> symlinks alone get adequate performance. Symlinks do seem an easier 
>>>>> first step.
>>>>>
>>>>>           
>>>>>> I think option ii) is substantially easier to implement (on the 
>>>>>> order of days) and is generally useful in the single-cluster, 
>>>>>> local-source-data situation that appears to be what people want to 
>>>>>> do for running on the BG/P and scicortex (that is, pretty much 
>>>>>> ignoring anything grid-like at all).
>>>>>>             
>>>>> Grid-like might mean pulling data to the /tmp workdir directly by 
>>>>> the wrapper - but that seems like a harder step, and would need 
>>>>> measurement and prototyping of such code before attempting. Data 
>>>>> transfer clients that the wrapper script can count on might be an 
>>>>> obstacle.
>>>>>
>>>>>           
>>>>>> Option i) is much harder (on the order of months), needing a very 
>>>>>> different interface between Swift and Falkon than exists at the 
>>>>>> moment.
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
>>     
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From iraicu at cs.uchicago.edu  Sun Apr 13 18:31:42 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sun, 13 Apr 2008 18:31:42 -0500
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <48028E19.7020400@mcs.anl.gov>
References: <47FEB489.509@mcs.anl.gov> <47FF0B1E.9030405@uchicago.edu>
	<47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu>
	<47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu>
	<47FF8CC9.6040903@cs.uchicago.edu>
	<47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu>
	<47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu>
	<Pine.LNX.4.64.0804131635330.31934@dildano.hawaga.org.uk>
	<48025C64.9020502@mcs.anl.gov>
	<Pine.LNX.4.64.0804131919290.31934@dildano.hawaga.org.uk>
	<48026E8B.501@mcs.anl.! gov> <48026F6B.9060300@uchicago.ed u>
	<48027249.2070208@cs.uchica! go.edu> <48028E19.7020400@mcs.an
	l.gov>
Message-ID: <480297DE.3010501@cs.uchicago.edu>


Michael Wilde wrote:
> Its not clear to me whats best here, for 3 reasons:
>
> 1) We should set file.transfers and file.operations to a value that 
> prevents Swift from adversely impacting performance on shared resources.
>
> Since Swift must run on the login node and hits the shared cluster 
> networks, we should test carefully.
>
> 2) Its not clear to me how many concurrent operations the login hosts 
> can sustain before topping out, and how this number depends on file 
> size. Do you know this from the GPFS benchmarks? 
1 node can sustain about 71 file reads/sec (1B each), and 512 nodes can 
sustain 362 reads/sec.  10KB files are similar, 66/sec and 315/sec.  
Read+write for 1B is 31/sec and 79/sec, and for 10KB is 31/sec and 
81/sec.  Does this explain anything?  How fast were stage-ins going?  
Does a stage-in mean copy a file from one place to another?  Then we are 
looking at the 1 node performance of 10KB read+write, which would be 31 
ops/sec.  Would all the stage-in be happening during the 500 second idle 
time?  If yes, then that is about 24 files/sec, which is awfully close 
to the 31 files/sec from our benchmark.  If not, then its pure coincidence.
> And did you measure impact on system response during those benchmarks?
No.

Ioan
>
> I think that the overall system would top out well before 2000 
> concurrent transfers, but I could be wrong. Going much higher than the 
> point where concurrency increases the data rate, it seems, would cause 
> the rate to drop due to contention and context switching.
>
> 3) If I/O operations are fast compared to the job length and 
> completion rate, you dont have to set these values to the same as the 
> max number of input files that can be demanded at once.
>
> I think we want to set the I/O operation concurrency to a value that 
> achieves the highest operation rate we can sustain while keeping 
> overall system performance at some acceptable level (tbd).
>
> So first we need to find the concurrency level that maximizes ops/sec, 
> (which may be filesize dependent) and then possibly back that off to 
> reduce system impact.
>
> It seems to me that finding the right I/O concurrency setting is 
> complex and non-obvious, and I'm interested in what Ben and Mihael 
> suggest here.
>
> - Mike
>
> On 4/13/08 3:51 PM, Ioan Raicu wrote:
>> But we have 2X input files as opposed to number of jobs and CPUs.  We 
>> have 2048 CPUs, shouldn't we set all file I/O operations to at least 
>> 4096... and that means that files won't be ready for the next jobs 
>> once the first ones start completing... so we should really set 
>> things to twice that, so 8192 is the number I'd set on all file 
>> operations for this app on 2K CPUs.
>> Ioan
>>
>> Zhao Zhang wrote:
>>> Hi, Mike
>>>
>>> Michael Wilde wrote:
>>>> Ben, your analysis sounds very good. Some notes below, including 
>>>> questions for Zhao.
>>>>
>>>> On 4/13/08 2:57 PM, Ben Clifford wrote:
>>>>>
>>>>>> Ben, can you point me to the graphs for this run? (Zhao's 
>>>>>> *99cy0z4g.log)
>>>>>
>>>>> http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g
>>>>>
>>>>>> Once stage-ins start to complete, are the corresponding jobs 
>>>>>> initiated quickly, or is Swift doing mostly stage-ins for some 
>>>>>> period?
>>>>>
>>>>> In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to 
>>>>> falkon) pretty much right as the corresponding stagein completes. 
>>>>> I have no deeper information about when the worker actually starts 
>>>>> to run.
>>>>>
>>>>>> Zhao indicated he saw data indicating there was about a 700 
>>>>>> second lag from
>>>>>> workflow start time till the first Falkon jobs started, if I 
>>>>>> understood
>>>>>> correctly. Do the graphs confirm this or say something different?
>>>>>
>>>>> There is a period of about 500s or so until stuff starts to 
>>>>> happen; I haven't looked at it. That is before stage-ins start 
>>>>> too, though, which means that i think this...
>>>>>
>>>>>> If the 700-second delay figure is true, and stage-in was 
>>>>>> eliminated by copying
>>>>>> input files right to the /tmp workdir rather than first to 
>>>>>> /shared, then we'd
>>>>>> have:
>>>>>>
>>>>>> 1190260 / ( 1290 * 2048 ) = .45 efficiency
>>>>>
>>>>> calculation is not meaningful.
>>>>>
>>>>> I have not looked at what is going on during that 500s startup 
>>>>> time, but I plan to.
>>>>
>>>> Zhao, what SVN rev is your Swift at?  Ben fixed an N^2 mapper 
>>>> logging problem a few weeks ago. Could that cause such a delay, 
>>>> Ben? It would be very obvious in the swift log.
>>> The version is Swift svn swift-r1780 cog-r1956
>>>>
>>>>>
>>>>>> I assume we're paying the same staging price on the output side?
>>>>>
>>>>> not really - the output stageouts go very fast, and also because 
>>>>> job ending is staggered, they don't happen all at once.
>>>>>
>>>>> This is the same with most of the large runs I've seen (of any 
>>>>> application) - stageout tends not to be a problem (or at least, no 
>>>>> where near the problems of stagein).
>>>>>
>>>>> All stageins happen over a period t=400 to t=1100 fairly smoothly. 
>>>>> There's rate limiting still on file operations (100 max) and file 
>>>>> transfers (2000 max) which is being hit still.
>>>>
>>>> I thought Zhao set file operations throttle to 2000 as well.  
>>>> Sounds like we can test with the latter higher, and find out what's 
>>>> limiting the former.
>>>>
>>>> Zhao, what are your settings for property throttle.file.operations?
>>>> I assume you have throttle.transfers set to 2000.
>>>>
>>>> If its set right, any chance that Swift or Karajan is limiting it 
>>>> somewhere?
>>> 2000 for sure,
>>> throttle.submit=off
>>> throttle.host.submit=off
>>> throttle.score.job.factor=off
>>> throttle.transfers=2000
>>> throttle.file.operation=2000
>>>>>
>>>>> I think there's two directions to proceed in here that make sense 
>>>>> for actual use on single clusters running falkon (rather than 
>>>>> trying to cut out stuff randomly to push up numbers):
>>>>>
>>>>>  i) use some of the data placement features in falkon, rather than 
>>>>> Swift's
>>>>>     relatively simple data management that was designed more for 
>>>>> running
>>>>>     on the grid.
>>>>
>>>> Long term: we should consider how the Coaster implementation could 
>>>> eventually do a similar data placement approach. In the meantime 
>>>> (mid term) examining what interface changes are needed for Falkon 
>>>> data placement might help prepare for that. Need to discuss if that 
>>>> would be a good step or not.
>>>>
>>>>>
>>>>>  ii) do stage-ins using symlinks rather than file copying. this makes
>>>>>      sense when everything is living in a single filesystem, which 
>>>>> again
>>>>>      is not what Swift's data management was originally optimised 
>>>>> for.
>>>>
>>>> I assume you mean symlinks from shared/ back to the user's input 
>>>> files?
>>>>
>>>> That sounds worth testing: find out if symlink creation is fast on 
>>>> NFS and GPFS.
>>>>
>>>> Is another approach to copy direct from the user's files to the 
>>>> /tmp workdir (ie wrapper.sh pulls the data in)? Measurement will 
>>>> tell if symlinks alone get adequate performance. Symlinks do seem 
>>>> an easier first step.
>>>>
>>>>> I think option ii) is substantially easier to implement (on the 
>>>>> order of days) and is generally useful in the single-cluster, 
>>>>> local-source-data situation that appears to be what people want to 
>>>>> do for running on the BG/P and scicortex (that is, pretty much 
>>>>> ignoring anything grid-like at all).
>>>>
>>>> Grid-like might mean pulling data to the /tmp workdir directly by 
>>>> the wrapper - but that seems like a harder step, and would need 
>>>> measurement and prototyping of such code before attempting. Data 
>>>> transfer clients that the wrapper script can count on might be an 
>>>> obstacle.
>>>>
>>>>>
>>>>> Option i) is much harder (on the order of months), needing a very 
>>>>> different interface between Swift and Falkon than exists at the 
>>>>> moment.
>>>>>
>>>>>
>>>>>
>>>>
>>
>

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From hategan at mcs.anl.gov  Sun Apr 13 17:41:20 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 13 Apr 2008 17:41:20 -0500
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <480295CB.3010806@cs.uchicago.edu>
References: <47FEB489.509@mcs.anl.gov> <47FF7E6A.5010001@uchicago.edu>
	<47FF8ACC.1040603@cs.uchicago.edu>	<47FF8BB1.5000108@uchicago.edu>
	<47FF8CC9.6040903@cs.uchicago.edu>	<47FF8DC1.5040607@mcs.anl.gov>
	<47FF9662.2010205@uchicago.edu>	<47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu>
	<Pine.LNX.4.64.0804131635330.31934@dildano.hawaga.org.uk>
	<48025C64.9020502@mcs.anl.gov>
	<Pine.LNX.4.64.0804131919290.31934@dildano.hawaga.org.uk>
	<48026E8B.501@mcs.anl.! gov> <48026F6B.9060300@uchicago.e! du>
	<480280B8.9040605@mcs.anl.! gov>  <4802902F.7050704@uchicago.edu>
	<1208124543.3191.7.camel@blabla.mcs.anl.gov>
	<480295CB.3010806@cs.uchicago.edu>
Message-ID: <1208126480.3732.3.camel@blabla.mcs.anl.gov>


On Sun, 2008-04-13 at 18:22 -0500, Ioan Raicu wrote:
> We are not using GridFTP on the BG/P, where this test was done.  Files 
> are already on GPFS, so the stageins are probably just cp (or ln -s) 
> from one place to another on GPFS.  Is your suggestion still to set that 
> 2000 back down to 100?

I see. So it's the local provider. Well, mileage may vary. 100
concurrent transfers doesn't seem very far from what I'd expect if we're
talking about small files.

> 
> Ioan
> 
> Mihael Hategan wrote:
> > Then my guess is that the system itself (swift + server + FS) cannot
> > sustain a much higher rate than 100 things per second. In principle
> > setting those throttles to 2000 pretty much means that you're trying to
> > start 2000 gridftp connections and hence 2000 gridftp processes on the
> > server.
> >
> > On Sun, 2008-04-13 at 17:58 -0500, Zhao Zhang wrote:
> >   
> >> Hi, Mike
> >>
> >> It is just a typo in the email. I my property file, it is 
> >> "throttle.file.operations=2000". Thanks.
> >>
> >> zhao
> >>
> >> Michael Wilde wrote:
> >>     
> >>>>> If its set right, any chance that Swift or Karajan is limiting it
> >>>>> somewhere?
> >>>>>           
> >>>> 2000 for sure,
> >>>> throttle.submit=off
> >>>> throttle.host.submit=off
> >>>> throttle.score.job.factor=off
> >>>> throttle.transfers=2000
> >>>> throttle.file.operation=2000
> >>>>         
> >>> Looks like a typo in your properties, Zhao - if the text above came 
> >>> from your swift.properties directly:
> >>>
> >>>   throttle.file.operation=2000
> >>>
> >>> vs operations with an s as per the properties doc:
> >>>
> >>> throttle.file.operations=8
> >>> #throttle.file.operations=off
> >>>
> >>> Which doesnt explain why we're seeing 100 when the default is 8 ???
> >>>
> >>> - Mike
> >>>
> >>>
> >>> On 4/13/08 3:39 PM, Zhao Zhang wrote:
> >>>       
> >>>> Hi, Mike
> >>>>
> >>>> Michael Wilde wrote:
> >>>>         
> >>>>> Ben, your analysis sounds very good. Some notes below, including 
> >>>>> questions for Zhao.
> >>>>>
> >>>>> On 4/13/08 2:57 PM, Ben Clifford wrote:
> >>>>>           
> >>>>>>> Ben, can you point me to the graphs for this run? (Zhao's 
> >>>>>>> *99cy0z4g.log)
> >>>>>>>               
> >>>>>> http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g
> >>>>>>
> >>>>>>             
> >>>>>>> Once stage-ins start to complete, are the corresponding jobs 
> >>>>>>> initiated quickly, or is Swift doing mostly stage-ins for some 
> >>>>>>> period?
> >>>>>>>               
> >>>>>> In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to 
> >>>>>> falkon) pretty much right as the corresponding stagein completes. I 
> >>>>>> have no deeper information about when the worker actually starts to 
> >>>>>> run.
> >>>>>>
> >>>>>>             
> >>>>>>> Zhao indicated he saw data indicating there was about a 700 second 
> >>>>>>> lag from
> >>>>>>> workflow start time till the first Falkon jobs started, if I 
> >>>>>>> understood
> >>>>>>> correctly. Do the graphs confirm this or say something different?
> >>>>>>>               
> >>>>>> There is a period of about 500s or so until stuff starts to happen; 
> >>>>>> I haven't looked at it. That is before stage-ins start too, though, 
> >>>>>> which means that i think this...
> >>>>>>
> >>>>>>             
> >>>>>>> If the 700-second delay figure is true, and stage-in was 
> >>>>>>> eliminated by copying
> >>>>>>> input files right to the /tmp workdir rather than first to 
> >>>>>>> /shared, then we'd
> >>>>>>> have:
> >>>>>>>
> >>>>>>> 1190260 / ( 1290 * 2048 ) = .45 efficiency
> >>>>>>>               
> >>>>>> calculation is not meaningful.
> >>>>>>
> >>>>>> I have not looked at what is going on during that 500s startup 
> >>>>>> time, but I plan to.
> >>>>>>             
> >>>>> Zhao, what SVN rev is your Swift at?  Ben fixed an N^2 mapper 
> >>>>> logging problem a few weeks ago. Could that cause such a delay, Ben? 
> >>>>> It would be very obvious in the swift log.
> >>>>>           
> >>>> The version is Swift svn swift-r1780 cog-r1956
> >>>>         
> >>>>>>> I assume we're paying the same staging price on the output side?
> >>>>>>>               
> >>>>>> not really - the output stageouts go very fast, and also because 
> >>>>>> job ending is staggered, they don't happen all at once.
> >>>>>>
> >>>>>> This is the same with most of the large runs I've seen (of any 
> >>>>>> application) - stageout tends not to be a problem (or at least, no 
> >>>>>> where near the problems of stagein).
> >>>>>>
> >>>>>> All stageins happen over a period t=400 to t=1100 fairly smoothly. 
> >>>>>> There's rate limiting still on file operations (100 max) and file 
> >>>>>> transfers (2000 max) which is being hit still.
> >>>>>>             
> >>>>> I thought Zhao set file operations throttle to 2000 as well.  Sounds 
> >>>>> like we can test with the latter higher, and find out what's 
> >>>>> limiting the former.
> >>>>>
> >>>>> Zhao, what are your settings for property throttle.file.operations?
> >>>>> I assume you have throttle.transfers set to 2000.
> >>>>>
> >>>>> If its set right, any chance that Swift or Karajan is limiting it 
> >>>>> somewhere?
> >>>>>           
> >>>> 2000 for sure,
> >>>> throttle.submit=off
> >>>> throttle.host.submit=off
> >>>> throttle.score.job.factor=off
> >>>> throttle.transfers=2000
> >>>> throttle.file.operation=2000
> >>>>         
> >>>>>> I think there's two directions to proceed in here that make sense 
> >>>>>> for actual use on single clusters running falkon (rather than 
> >>>>>> trying to cut out stuff randomly to push up numbers):
> >>>>>>
> >>>>>>  i) use some of the data placement features in falkon, rather than 
> >>>>>> Swift's
> >>>>>>     relatively simple data management that was designed more for 
> >>>>>> running
> >>>>>>     on the grid.
> >>>>>>             
> >>>>> Long term: we should consider how the Coaster implementation could 
> >>>>> eventually do a similar data placement approach. In the meantime 
> >>>>> (mid term) examining what interface changes are needed for Falkon 
> >>>>> data placement might help prepare for that. Need to discuss if that 
> >>>>> would be a good step or not.
> >>>>>
> >>>>>           
> >>>>>>  ii) do stage-ins using symlinks rather than file copying. this makes
> >>>>>>      sense when everything is living in a single filesystem, which 
> >>>>>> again
> >>>>>>      is not what Swift's data management was originally optimised for.
> >>>>>>             
> >>>>> I assume you mean symlinks from shared/ back to the user's input files?
> >>>>>
> >>>>> That sounds worth testing: find out if symlink creation is fast on 
> >>>>> NFS and GPFS.
> >>>>>
> >>>>> Is another approach to copy direct from the user's files to the /tmp 
> >>>>> workdir (ie wrapper.sh pulls the data in)? Measurement will tell if 
> >>>>> symlinks alone get adequate performance. Symlinks do seem an easier 
> >>>>> first step.
> >>>>>
> >>>>>           
> >>>>>> I think option ii) is substantially easier to implement (on the 
> >>>>>> order of days) and is generally useful in the single-cluster, 
> >>>>>> local-source-data situation that appears to be what people want to 
> >>>>>> do for running on the BG/P and scicortex (that is, pretty much 
> >>>>>> ignoring anything grid-like at all).
> >>>>>>             
> >>>>> Grid-like might mean pulling data to the /tmp workdir directly by 
> >>>>> the wrapper - but that seems like a harder step, and would need 
> >>>>> measurement and prototyping of such code before attempting. Data 
> >>>>> transfer clients that the wrapper script can count on might be an 
> >>>>> obstacle.
> >>>>>
> >>>>>           
> >>>>>> Option i) is much harder (on the order of months), needing a very 
> >>>>>> different interface between Swift and Falkon than exists at the 
> >>>>>> moment.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>             
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>
> >>     
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> >   
> 
> -- 
> ===================================================
> Ioan Raicu
> Ph.D. Candidate
> ===================================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ===================================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
> http://dev.globus.org/wiki/Incubator/Falkon
> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> ===================================================
> ===================================================
> 
> 


From iraicu at cs.uchicago.edu  Sun Apr 13 18:45:28 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sun, 13 Apr 2008 18:45:28 -0500
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <1208126480.3732.3.camel@blabla.mcs.anl.gov>
References: <47FEB489.509@mcs.anl.gov>
	<47FF8CC9.6040903@cs.uchicago.edu>	<47FF8DC1.5040607@mcs.anl.gov>	
	<47FF9662.2010205@uchicago.edu>	<47FF9D65.3040805@mcs.anl.gov>	
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>	
	<47FFA226.3070303@uchicago.edu>	
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>	
	<47FFCD24.2040704@uchicago.edu>	
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>	
	<48008BAD.20501@uchicago.edu>	
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>	
	<4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu>	
	<Pine.LNX.4.64.0804131635330.31934@dildano.hawaga.org.uk>	
	<48025C64.9020502@mcs.anl.gov>	
	<Pine.LNX.4.64.0804131919290.31934@dildano.hawaga.org.uk>	
	<48026E8B.501@mcs.anl.! gov> <48026F6B.9060300@uchicago.e! du>	
	<480280B8.9040605@mcs.anl.! gov> <4802902F.7050704@uchicago.edu>	
	<1208124543.3191.7.camel@blabla.mcs.anl.gov>	
	<480295CB.3010806@cs.uchicago.edu>
	<1208126480.3732.3.camel@blabla.mcs.anl.gov>
Message-ID: <48029B18.5080900@cs.uchicago.edu>

OK, sounds good.  Zhao, when you get the chance, maybe you could try 100 
instead of 2000 for the file limits?  I don't know if this is a top 
priority right now, but certainly for after SC.

"throttle.file.operations=100"

Ioan

Mihael Hategan wrote:
> On Sun, 2008-04-13 at 18:22 -0500, Ioan Raicu wrote:
>   
>> We are not using GridFTP on the BG/P, where this test was done.  Files 
>> are already on GPFS, so the stageins are probably just cp (or ln -s) 
>> from one place to another on GPFS.  Is your suggestion still to set that 
>> 2000 back down to 100?
>>     
>
> I see. So it's the local provider. Well, mileage may vary. 100
> concurrent transfers doesn't seem very far from what I'd expect if we're
> talking about small files.
>
>   
>> Ioan
>>
>> Mihael Hategan wrote:
>>     
>>> Then my guess is that the system itself (swift + server + FS) cannot
>>> sustain a much higher rate than 100 things per second. In principle
>>> setting those throttles to 2000 pretty much means that you're trying to
>>> start 2000 gridftp connections and hence 2000 gridftp processes on the
>>> server.
>>>
>>> On Sun, 2008-04-13 at 17:58 -0500, Zhao Zhang wrote:
>>>   
>>>       
>>>> Hi, Mike
>>>>
>>>> It is just a typo in the email. I my property file, it is 
>>>> "throttle.file.operations=2000". Thanks.
>>>>
>>>> zhao
>>>>
>>>> Michael Wilde wrote:
>>>>     
>>>>         
>>>>>>> If its set right, any chance that Swift or Karajan is limiting it
>>>>>>> somewhere?
>>>>>>>           
>>>>>>>               
>>>>>> 2000 for sure,
>>>>>> throttle.submit=off
>>>>>> throttle.host.submit=off
>>>>>> throttle.score.job.factor=off
>>>>>> throttle.transfers=2000
>>>>>> throttle.file.operation=2000
>>>>>>         
>>>>>>             
>>>>> Looks like a typo in your properties, Zhao - if the text above came 
>>>>> from your swift.properties directly:
>>>>>
>>>>>   throttle.file.operation=2000
>>>>>
>>>>> vs operations with an s as per the properties doc:
>>>>>
>>>>> throttle.file.operations=8
>>>>> #throttle.file.operations=off
>>>>>
>>>>> Which doesnt explain why we're seeing 100 when the default is 8 ???
>>>>>
>>>>> - Mike
>>>>>
>>>>>
>>>>> On 4/13/08 3:39 PM, Zhao Zhang wrote:
>>>>>       
>>>>>           
>>>>>> Hi, Mike
>>>>>>
>>>>>> Michael Wilde wrote:
>>>>>>         
>>>>>>             
>>>>>>> Ben, your analysis sounds very good. Some notes below, including 
>>>>>>> questions for Zhao.
>>>>>>>
>>>>>>> On 4/13/08 2:57 PM, Ben Clifford wrote:
>>>>>>>           
>>>>>>>               
>>>>>>>>> Ben, can you point me to the graphs for this run? (Zhao's 
>>>>>>>>> *99cy0z4g.log)
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g
>>>>>>>>
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> Once stage-ins start to complete, are the corresponding jobs 
>>>>>>>>> initiated quickly, or is Swift doing mostly stage-ins for some 
>>>>>>>>> period?
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to 
>>>>>>>> falkon) pretty much right as the corresponding stagein completes. I 
>>>>>>>> have no deeper information about when the worker actually starts to 
>>>>>>>> run.
>>>>>>>>
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> Zhao indicated he saw data indicating there was about a 700 second 
>>>>>>>>> lag from
>>>>>>>>> workflow start time till the first Falkon jobs started, if I 
>>>>>>>>> understood
>>>>>>>>> correctly. Do the graphs confirm this or say something different?
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> There is a period of about 500s or so until stuff starts to happen; 
>>>>>>>> I haven't looked at it. That is before stage-ins start too, though, 
>>>>>>>> which means that i think this...
>>>>>>>>
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> If the 700-second delay figure is true, and stage-in was 
>>>>>>>>> eliminated by copying
>>>>>>>>> input files right to the /tmp workdir rather than first to 
>>>>>>>>> /shared, then we'd
>>>>>>>>> have:
>>>>>>>>>
>>>>>>>>> 1190260 / ( 1290 * 2048 ) = .45 efficiency
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> calculation is not meaningful.
>>>>>>>>
>>>>>>>> I have not looked at what is going on during that 500s startup 
>>>>>>>> time, but I plan to.
>>>>>>>>             
>>>>>>>>                 
>>>>>>> Zhao, what SVN rev is your Swift at?  Ben fixed an N^2 mapper 
>>>>>>> logging problem a few weeks ago. Could that cause such a delay, Ben? 
>>>>>>> It would be very obvious in the swift log.
>>>>>>>           
>>>>>>>               
>>>>>> The version is Swift svn swift-r1780 cog-r1956
>>>>>>         
>>>>>>             
>>>>>>>>> I assume we're paying the same staging price on the output side?
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> not really - the output stageouts go very fast, and also because 
>>>>>>>> job ending is staggered, they don't happen all at once.
>>>>>>>>
>>>>>>>> This is the same with most of the large runs I've seen (of any 
>>>>>>>> application) - stageout tends not to be a problem (or at least, no 
>>>>>>>> where near the problems of stagein).
>>>>>>>>
>>>>>>>> All stageins happen over a period t=400 to t=1100 fairly smoothly. 
>>>>>>>> There's rate limiting still on file operations (100 max) and file 
>>>>>>>> transfers (2000 max) which is being hit still.
>>>>>>>>             
>>>>>>>>                 
>>>>>>> I thought Zhao set file operations throttle to 2000 as well.  Sounds 
>>>>>>> like we can test with the latter higher, and find out what's 
>>>>>>> limiting the former.
>>>>>>>
>>>>>>> Zhao, what are your settings for property throttle.file.operations?
>>>>>>> I assume you have throttle.transfers set to 2000.
>>>>>>>
>>>>>>> If its set right, any chance that Swift or Karajan is limiting it 
>>>>>>> somewhere?
>>>>>>>           
>>>>>>>               
>>>>>> 2000 for sure,
>>>>>> throttle.submit=off
>>>>>> throttle.host.submit=off
>>>>>> throttle.score.job.factor=off
>>>>>> throttle.transfers=2000
>>>>>> throttle.file.operation=2000
>>>>>>         
>>>>>>             
>>>>>>>> I think there's two directions to proceed in here that make sense 
>>>>>>>> for actual use on single clusters running falkon (rather than 
>>>>>>>> trying to cut out stuff randomly to push up numbers):
>>>>>>>>
>>>>>>>>  i) use some of the data placement features in falkon, rather than 
>>>>>>>> Swift's
>>>>>>>>     relatively simple data management that was designed more for 
>>>>>>>> running
>>>>>>>>     on the grid.
>>>>>>>>             
>>>>>>>>                 
>>>>>>> Long term: we should consider how the Coaster implementation could 
>>>>>>> eventually do a similar data placement approach. In the meantime 
>>>>>>> (mid term) examining what interface changes are needed for Falkon 
>>>>>>> data placement might help prepare for that. Need to discuss if that 
>>>>>>> would be a good step or not.
>>>>>>>
>>>>>>>           
>>>>>>>               
>>>>>>>>  ii) do stage-ins using symlinks rather than file copying. this makes
>>>>>>>>      sense when everything is living in a single filesystem, which 
>>>>>>>> again
>>>>>>>>      is not what Swift's data management was originally optimised for.
>>>>>>>>             
>>>>>>>>                 
>>>>>>> I assume you mean symlinks from shared/ back to the user's input files?
>>>>>>>
>>>>>>> That sounds worth testing: find out if symlink creation is fast on 
>>>>>>> NFS and GPFS.
>>>>>>>
>>>>>>> Is another approach to copy direct from the user's files to the /tmp 
>>>>>>> workdir (ie wrapper.sh pulls the data in)? Measurement will tell if 
>>>>>>> symlinks alone get adequate performance. Symlinks do seem an easier 
>>>>>>> first step.
>>>>>>>
>>>>>>>           
>>>>>>>               
>>>>>>>> I think option ii) is substantially easier to implement (on the 
>>>>>>>> order of days) and is generally useful in the single-cluster, 
>>>>>>>> local-source-data situation that appears to be what people want to 
>>>>>>>> do for running on the BG/P and scicortex (that is, pretty much 
>>>>>>>> ignoring anything grid-like at all).
>>>>>>>>             
>>>>>>>>                 
>>>>>>> Grid-like might mean pulling data to the /tmp workdir directly by 
>>>>>>> the wrapper - but that seems like a harder step, and would need 
>>>>>>> measurement and prototyping of such code before attempting. Data 
>>>>>>> transfer clients that the wrapper script can count on might be an 
>>>>>>> obstacle.
>>>>>>>
>>>>>>>           
>>>>>>>               
>>>>>>>> Option i) is much harder (on the order of months), needing a very 
>>>>>>>> different interface between Swift and Falkon than exists at the 
>>>>>>>> moment.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>             
>>>>>>>>                 
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>
>>>>     
>>>>         
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>>   
>>>       
>> -- 
>> ===================================================
>> Ioan Raicu
>> Ph.D. Candidate
>> ===================================================
>> Distributed Systems Laboratory
>> Computer Science Department
>> University of Chicago
>> 1100 E. 58th Street, Ryerson Hall
>> Chicago, IL 60637
>> ===================================================
>> Email: iraicu at cs.uchicago.edu
>> Web:   http://www.cs.uchicago.edu/~iraicu
>> http://dev.globus.org/wiki/Incubator/Falkon
>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>> ===================================================
>> ===================================================
>>
>>
>>     
>
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From benc at hawaga.org.uk  Sun Apr 13 23:26:15 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 14 Apr 2008 04:26:15 +0000 (GMT)
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <480289B7.4050207@mcs.anl.gov>
References: <47FEB489.509@mcs.anl.gov> <47FF0B1E.9030405@uchicago.edu>
	<47FF2C1F.9030102@uchicago.edu> <47FF7E6A.5010001@uchicago.edu>
	<47FF8ACC.1040603@cs.uchicago.edu> <47FF8BB1.5000108@uchicago.edu>
	<47FF8CC9.6040903@cs.uchicago.edu> <47FF8DC1.5040607@mcs.anl.gov>
	<47FF9662.2010205@uchicago.edu> <47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu>
	<Pine.LNX.4.64.0804131635330.31934@dildano.hawaga.org.uk>
	<48025C64.9020502@mcs.anl.gov>
	<Pine.LNX.4.64.0804131919290.31934@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804132056281.31934@dildano.hawaga.org.uk>
	<480289B7.4050207@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0804140424070.31934@dildano.hawaga.org.uk>


On Sun, 13 Apr 2008, Michael Wilde wrote:

> That might be a low rate, but its also not clear why its creating so many
> handles: I thought we only have about 6K jobs here, with 2 files in and 1 file
> out per job:

all the local variables, intermediate values and individual pieces of 
structures count as datasets. 18 for each iteration of the loop seems 
roughly correct.

-- 


From benc at hawaga.org.uk  Sun Apr 13 23:28:19 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 14 Apr 2008 04:28:19 +0000 (GMT)
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <1208124543.3191.7.camel@blabla.mcs.anl.gov>
References: <47FEB489.509@mcs.anl.gov> <47FF8ACC.1040603@cs.uchicago.edu>
	<47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu>
	<47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu>
	<47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu>
	<Pine.LNX.4.64.0804131635330.31934@dildano.hawaga.org.uk>
	<48025C64.9020502@mcs.anl.gov>
	<Pine.LNX.4.64.0804131919290.31934@dildano.hawaga.org.uk>
	<48026E8B.501@mcs.anl.!
	gov> <48026F6B.9060300@uchicago.e! du> <480280B8.9040605@mcs.anl.! gov>
	<4802902F.7050704@uchicago.edu>
	<1208124543.3191.7.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0804140427350.645@dildano.hawaga.org.uk>


On Sun, 13 Apr 2008, Mihael Hategan wrote:

> Then my guess is that the system itself (swift + server + FS) cannot
> sustain a much higher rate than 100 things per second

The graph shows it maxing out at exactly 100, which is suspicious if its a 
soft load limit.

-- 


From wilde at mcs.anl.gov  Mon Apr 14 13:49:24 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 14 Apr 2008 13:49:24 -0500
Subject: [Swift-devel] Can not reach advertised gridftp server on Abe
Message-ID: <4803A734.9030407@mcs.anl.gov>

Hi Help Team,

When I try to reach the gridftp server on Abe advertised in the Userinfo 
pages at, which is:

   gridftp-abe.ncsa.teragrid.org

on page:

http://www.teragrid.org/userinfo/hardware/resources.php?type=compute&select=single&id=50&PHPSESSID=2379360d3ce483f8f90532609354cd73

then I get the following error:

# my cert is ok:

login$ globus-job-run grid-abe.ncsa.teragrid.org /bin/hostname
honest4.ncsa.uiuc.edu

# this fails:

login$ globus-url-copy file:///etc/passwd 
gsiftp://gridftp-abe.ncsa.teragrid.org/cfs/scratch/users/wilde/fromci

error: globus_ftp_client: the server responded with an error
530 530-Login incorrect. : IPC connection failed.
530-globus_xio_gsi: gss_init_sec_context failed.
530-GSS Major Status: Unexpected Gatekeeper or Service Name
530-globus_gsi_gssapi: Authorization denied: The name of the remote host 
(abe-ipib-gw02.ncsa.uiuc.edu), and the expected name for the remote host 
(abe-gw02) do not match. This happens when the name in the host 
certificate does not match the information obtained from DNS and is 
often a DNS configuration problem.
530 End.


# while gridftp to the GRAM gatekeeper host works:

login$ globus-url-copy file:///etc/passwd 
gsiftp://grid-abe.ncsa.teragrid.org/cfs/scratch/users/wilde/fromci
login$

--

A problem due to multi-homed hosts or multiple hostname aliases?

Is gridftp-abe a beefier data server than grid-abe, and if so should the 
problem above get fixed?

Thanks,

- Mike


From benc at hawaga.org.uk  Mon Apr 14 14:31:41 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 14 Apr 2008 19:31:41 +0000 (GMT)
Subject: [Swift-devel] Re: Another performance comparison of DOCK
In-Reply-To: <Pine.LNX.4.64.0804132056281.31934@dildano.hawaga.org.uk>
References: <47FEB489.509@mcs.anl.gov> <47FEF5CA.60700@cs.uchicago.edu>
	<47FF0B1E.9030405@uchicago.edu> <47FF2C1F.9030102@uchicago.edu>
	<47FF7E6A.5010001@uchicago.edu> <47FF8ACC.1040603@cs.uchicago.edu>
	<47FF8BB1.5000108@uchicago.edu> <47FF8CC9.6040903@cs.uchicago.edu>
	<47FF8DC1.5040607@mcs.anl.gov> <47FF9662.2010205@uchicago.edu>
	<47FF9D65.3040805@mcs.anl.gov>
	<Pine.LNX.4.64.0804111721320.31934@dildano.hawaga.org.uk>
	<47FFA226.3070303@uchicago.edu>
	<Pine.LNX.4.64.0804111740580.31934@dildano.hawaga.org.uk>
	<47FFCD24.2040704@uchicago.edu>
	<Pine.LNX.4.64.0804120841240.31934@dildano.hawaga.org.uk>
	<48008BAD.20501@uchicago.edu>
	<Pine.LNX.4.64.0804121017080.31934@dildano.hawaga.org.uk>
	<4800BCEF.5040206@mcs.anl.gov> <48012EA9.1060102@uchicago.edu>
	<Pine.LNX.4.64.0804131635330.31934@dildano.hawaga.org.uk>
	<48025C64.9020502@mcs.anl.gov>
	<Pine.LNX.4.64.0804131919290.31934@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804132056281.31934@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0804141930290.31934@dildano.hawaga.org.uk>


> What happens in the log file for this period is lots of DSHandle creation 
> (vdl:new) - it creates 115596 datasets in 451 seconds

Swift r1790 introduces constant interning. When constants are used in a 
foreach or iterate loop, this should reduce the number of DSHandles 
created (once per SwiftScript program per constant rather than once per 
iteration per constant in <r1790)

-- 


From help at teragrid.org  Mon Apr 14 15:04:09 2008
From: help at teragrid.org (help at teragrid.org)
Date: Mon, 14 Apr 2008 15:04:09 -0500
Subject: [Swift-devel] Re: Can not reach advertised gridftp server on Abe
In-Reply-To: <4803A734.9030407@mcs.anl.gov>  from Michael Wilde on Mon,
	14 Apr 2008 13:49:24 -0500
Message-ID: <200804142004.m3EK49sM024366@zanamavir.ncsa.uiuc.edu>

FROM: Jackson, Weddie
(Concerning ticket No. 154784)
==============================
Hello Michael,

Our Grid Services Folks are working on the issue, in the meantime you can use 

  login-abe.ncsa.teragrid.org

instead of 

  gridftp-abe.ncsa.teragrid.org


and we will notify you once the issue has been resolved.


We appologize for any inconvenience.


Thanks,
-Weddie
------------------------
Weddie Jackson
NCSA Consulting Services
------------------------


Michael Wilde <help at teragrid.org> writes:
>Hi Help Team,
>
>When I try to reach the gridftp server on Abe advertised in the Userinfo 
>pages at, which is:
>
>   gridftp-abe.ncsa.teragrid.org
>
>on page:
>
>http://www.teragrid.org/userinfo/hardware/resources.php?
type=compute&select=single&id=50&PHPSESSID=2379360d3ce483f8f90532609354cd73
>
>then I get the following error:
>
># my cert is ok:
>
>login$ globus-job-run grid-abe.ncsa.teragrid.org /bin/hostname
>honest4.ncsa.uiuc.edu
>
># this fails:
>
>login$ globus-url-copy file:///etc/passwd 
>gsiftp://gridftp-abe.ncsa.teragrid.org/cfs/scratch/users/wilde/fromci
>
>error: globus_ftp_client: the server responded with an error
>530 530-Login incorrect. : IPC connection failed.
>530-globus_xio_gsi: gss_init_sec_context failed.
>530-GSS Major Status: Unexpected Gatekeeper or Service Name
>530-globus_gsi_gssapi: Authorization denied: The name of the remote host 
>(abe-ipib-gw02.ncsa.uiuc.edu), and the expected name for the remote host 
>(abe-gw02) do not match. This happens when the name in the host 
>certificate does not match the information obtained from DNS and is 
>often a DNS configuration problem.
>530 End.
>
>
># while gridftp to the GRAM gatekeeper host works:
>
>login$ globus-url-copy file:///etc/passwd 
>gsiftp://grid-abe.ncsa.teragrid.org/cfs/scratch/users/wilde/fromci
>login$


From help at teragrid.org  Mon Apr 14 15:38:28 2008
From: help at teragrid.org (help at teragrid.org)
Date: Mon, 14 Apr 2008 15:38:28 -0500
Subject: [Swift-devel] [Fwd: Re: Can not reach advertised gridftp server on
	Abe ]
Message-ID: <200804142038.m3EKcSp6019304@amantadine.ncsa.uiuc.edu>

FROM: Jackson, Weddie
(Concerning ticket No. 154784)
==============================
Hello Michael,

Can you try reach the Abe's GridFTP server "gridftp-abe.ncsa.teragrid.org" 
again, our Grid Folks beleived that they have resolved the issue.

Please let us know whether or not you are still seeing issues when using 
"gridftp-abe.ncsa.teragrid.org".

Thanks,
-Weddie
------------------------
Weddie Jackson
NCSA Consulting Services
------------------------


__________Original Message__________
From: help at teragrid.org
  To: Michael Wilde <wilde at mcs.anl.gov>
Subj: Re: Can not reach advertised gridftp server on Abe
  Cc: swift-devel <swift-devel at ci.uchicago.edu>, 
      Mike Kubal <mikekubal at yahoo.com>


FROM: Jackson, Weddie
(Concerning ticket No. 154784)
==============================
Hello Michael,

Our Grid Services Folks are working on the issue, in the meantime you can use 

  login-abe.ncsa.teragrid.org

instead of 

  gridftp-abe.ncsa.teragrid.org


and we will notify you once the issue has been resolved.


We appologize for any inconvenience.


Thanks,
-Weddie
------------------------
Weddie Jackson
NCSA Consulting Services
------------------------


Michael Wilde <help at teragrid.org> writes:
>Hi Help Team,
>
>When I try to reach the gridftp server on Abe advertised in the Userinfo 
>pages at, which is:
>
>   gridftp-abe.ncsa.teragrid.org
>
>on page:
>
>http://www.teragrid.org/userinfo/hardware/resources.php?
type=compute&select=single&id=50&PHPSESSID=2379360d3ce483f8f90532609354cd73
>
>then I get the following error:
>
># my cert is ok:
>
>login$ globus-job-run grid-abe.ncsa.teragrid.org /bin/hostname
>honest4.ncsa.uiuc.edu
>
># this fails:
>
>login$ globus-url-copy file:///etc/passwd 
>gsiftp://gridftp-abe.ncsa.teragrid.org/cfs/scratch/users/wilde/fromci
>
>error: globus_ftp_client: the server responded with an error
>530 530-Login incorrect. : IPC connection failed.
>530-globus_xio_gsi: gss_init_sec_context failed.
>530-GSS Major Status: Unexpected Gatekeeper or Service Name
>530-globus_gsi_gssapi: Authorization denied: The name of the remote host 
>(abe-ipib-gw02.ncsa.uiuc.edu), and the expected name for the remote host 
>(abe-gw02) do not match. This happens when the name in the host 
>certificate does not match the information obtained from DNS and is 
>often a DNS configuration problem.
>530 End.
>
>
># while gridftp to the GRAM gatekeeper host works:
>
>login$ globus-url-copy file:///etc/passwd 
>gsiftp://grid-abe.ncsa.teragrid.org/cfs/scratch/users/wilde/fromci
>login$


From benc at hawaga.org.uk  Mon Apr 14 18:21:17 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 14 Apr 2008 23:21:17 +0000 (GMT)
Subject: [Swift-devel] hardlinks instead of copies on local file systems
Message-ID: <Pine.LNX.4.64.0804142312430.31934@dildano.hawaga.org.uk>


I hacked up a quick provider which uses unix hard links instead of copying 
in order to transfer files. This is a dirty hack to see if it has any 
performance improvements of copying, and lacks error handling. Most 
notably, Swift will fail in strange ways when: i) an output file already 
exists (other providers tend to overwrite) and ii) when the input data 
file is on a different file system (so hard links cannot work) to the site 
shared working directory.

To try this out:

i) untar http://www.ci.uchicago.edu/~benc/provider-ln-20080414.tar.gz into 
cog/modules/

ii) edit cog/modules/vdsk/dependencies.xml to include a new target 
provider-ln (like the existing karajan, provider-localscheduler and 
provider-dcache targets).

iii) ant redist  in vdsk/

iv) set your sites file to refer to provider-ln, like this:

  <pool handle="localhost">
    <filesystem  provider="ln" />
    <execution provider="local" />
    <workdirectory >/var/tmp</workdirectory>
  </pool>

v) fire!

I've tested this on my laptop. I haven't tested it on GPFS.

I deliberately use hard links rather than symlinks here:

 i) when hard linking, the new link is a first order reference to the 
file, just like the original. deleting the original link does not delete 
the file. this is important for stageout - the output file needs to stay 
on the file system, not be deleted with the site working directory.

 ii) symlinks require access to the original directory, whilst hardlinks 
go straight to the inode without indirecting via the original directory. 
this is probably important for GPFS scalability - it means there is one 
less filesystem object to interact with when opening the file.

-- 


From wilde at mcs.anl.gov  Tue Apr 15 00:27:25 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 15 Apr 2008 00:27:25 -0500
Subject: [Swift-devel] hardlinks instead of copies on local file systems
In-Reply-To: <Pine.LNX.4.64.0804142312430.31934@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0804142312430.31934@dildano.hawaga.org.uk>
Message-ID: <48043CBD.5080004@mcs.anl.gov>

Excellent! Hope to try later in the week.

- Mike


On 4/14/08 6:21 PM, Ben Clifford wrote:
> I hacked up a quick provider which uses unix hard links instead of copying 
> in order to transfer files. This is a dirty hack to see if it has any 
> performance improvements of copying, and lacks error handling. Most 
> notably, Swift will fail in strange ways when: i) an output file already 
> exists (other providers tend to overwrite) and ii) when the input data 
> file is on a different file system (so hard links cannot work) to the site 
> shared working directory.
> 
> To try this out:
> 
> i) untar http://www.ci.uchicago.edu/~benc/provider-ln-20080414.tar.gz into 
> cog/modules/
> 
> ii) edit cog/modules/vdsk/dependencies.xml to include a new target 
> provider-ln (like the existing karajan, provider-localscheduler and 
> provider-dcache targets).
> 
> iii) ant redist  in vdsk/
> 
> iv) set your sites file to refer to provider-ln, like this:
> 
>   <pool handle="localhost">
>     <filesystem  provider="ln" />
>     <execution provider="local" />
>     <workdirectory >/var/tmp</workdirectory>
>   </pool>
> 
> v) fire!
> 
> I've tested this on my laptop. I haven't tested it on GPFS.
> 
> I deliberately use hard links rather than symlinks here:
> 
>  i) when hard linking, the new link is a first order reference to the 
> file, just like the original. deleting the original link does not delete 
> the file. this is important for stageout - the output file needs to stay 
> on the file system, not be deleted with the site working directory.
> 
>  ii) symlinks require access to the original directory, whilst hardlinks 
> go straight to the inode without indirecting via the original directory. 
> this is probably important for GPFS scalability - it means there is one 
> less filesystem object to interact with when opening the file.
> 


From duxu at mcs.anl.gov  Tue Apr 15 08:47:48 2008
From: duxu at mcs.anl.gov (Xu Du)
Date: Tue, 15 Apr 2008 08:47:48 -0500
Subject: [Swift-devel] SWIFT INNOVATION FOR BOINC: prototype has been worked
	out
Message-ID: <001001c89eff$59890460$9a01a8c0@karen>

Dear Mike,

By hard working of last week, a prototype has been worked out, of course, we will continue to improve it.  

The following is the last weekly report. 

Any suggestion and comment are welcome. 

Regards,
Xu

--------------------------------------------------------------------------------------
              Weekly Report Mar.7-Apr.13

Done:
1. A prototype has been worked out. Up to now, applications can be dispatched from swift to BOINC normally, and the results can also be returned correctly after the jobs are computed by BOINC.

Issues
2. A lot parameters required by BOINC are not added at this moment, such as deadline, and resource consuming specification, etc. It seems that Swift is not very open to add additional parameters.

To Do:
1. Test and debug the system;
2. Update the design document;
3. Draft a document about how to write and modify providers.

From benc at hawaga.org.uk  Wed Apr 16 14:42:39 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 16 Apr 2008 19:42:39 +0000 (GMT)
Subject: [Swift-devel] Swift 0.5 released.
Message-ID: <Pine.LNX.4.64.0804161938410.31934@dildano.hawaga.org.uk>


Swift 0.5 is now available for download from
http://www.ci.uchicago.edu/swift/packages/vdsk-0.5.tar.gz

This is intended to address a number of bugs that were present in 0.4, 
most notably data channel reuse in GridFTP and a number of problems with 
recent compiler enhancements.

For more information about Swift, visit http://www.ci.uchicago.edu/swift/

-- 


From skenny at uchicago.edu  Thu Apr 17 11:17:20 2008
From: skenny at uchicago.edu (skenny at uchicago.edu)
Date: Thu, 17 Apr 2008 11:17:20 -0500 (CDT)
Subject: [Swift-devel] cleanup.sh
Message-ID: <20080417111720.BDP20122@m4500-02.uchicago.edu>

hey kids, i've got a simple little script for cleaning
up my project directory after a run and also committing log
files to ben's repository...mike i think you mentioned wanting
to include such a thing with swift (?) 

does anyone want this? if so where should i put it?

sarah


From mikekubal at yahoo.com  Thu Apr 17 11:56:47 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Thu, 17 Apr 2008 09:56:47 -0700 (PDT)
Subject: [Swift-devel] syntax for mapping files on remote resource
Message-ID: <98521.23636.qm@web52312.mail.re2.yahoo.com>

Hypothetically, what would the Swift syntax be if I
wanted to map all the files in a directory on the
remote resource, say 
/cfs/scratch/users/mkubal/Compound_Databases/Zinc_Drug_Like,
into file fls[] ?

Thanks,

Mike


      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


From benc at hawaga.org.uk  Thu Apr 17 12:54:30 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 17 Apr 2008 17:54:30 +0000 (GMT)
Subject: [Swift-devel] syntax for mapping files on remote resource
In-Reply-To: <98521.23636.qm@web52312.mail.re2.yahoo.com>
References: <98521.23636.qm@web52312.mail.re2.yahoo.com>
Message-ID: <Pine.LNX.4.64.0804171753110.645@dildano.hawaga.org.uk>


> Hypothetically, what would the Swift syntax be if I
> wanted to map all the files in a directory on the
> remote resource, say 
> /cfs/scratch/users/mkubal/Compound_Databases/Zinc_Drug_Like,
> into file fls[] ?

try something like:

file fls[] 
<filesys_mapper;location="gsiftp://tg-gridftp.uc.teragrid.org/home/mkubal">

I think something like that should work, though I haven't tried it out as 
I'm working on not-Swift this week.

-- 


From wilde at mcs.anl.gov  Thu Apr 17 13:31:33 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 17 Apr 2008 13:31:33 -0500
Subject: [Swift-devel] Re: cleanup.sh
In-Reply-To: <20080417111720.BDP20122@m4500-02.uchicago.edu>
References: <20080417111720.BDP20122@m4500-02.uchicago.edu>
Message-ID: <48079785.900@mcs.anl.gov>

Hi Sarah,

Yes, everyone wants this!

For the moment, make a SwiftTools directory in the Swift SVN at the same 
level as the main directory, just like Ben's log tools.

As soon as the tools are "production ready" we should move them to 
swift/bin.

Keeping them in SwiftTools while theyre young will help us remind users 
that they are preliminary.

We should discuss how to document commands. For the moment, having the 
command emit a man-page-like --help note would be a good start.

- Mike

ps. I'll send you my old version of this function for you to peruse for 
additional things to capture.


On 4/17/08 11:17 AM, skenny at uchicago.edu wrote:
> hey kids, i've got a simple little script for cleaning
> up my project directory after a run and also committing log
> files to ben's repository...mike i think you mentioned wanting
> to include such a thing with swift (?) 
> 
> does anyone want this? if so where should i put it?
> 
> sarah
> 
> 


From mikekubal at yahoo.com  Thu Apr 17 13:51:38 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Thu, 17 Apr 2008 11:51:38 -0700 (PDT)
Subject: [Swift-devel] syntax for mapping files on remote resource
In-Reply-To: <Pine.LNX.4.64.0804171753110.645@dildano.hawaga.org.uk>
Message-ID: <430068.70879.qm@web52306.mail.re2.yahoo.com>

Hi Ben,

Please take a look at the Test_FRED logs in
~mkubal/Swift_for_LigandAtlas on wiggum. I'm
attempting to run on NCSA's Abe.

I tried different variations on the syntax:

file
fls[]<filesys_mapper;location="gsiftp://grid-abe.ncsa.teragrid.org/cfs/scratch/users/mkubal/Compound_Databases/Zinc_Drug_Like/zinc.drug_like.chunks3500">;

I get the error:

java.lang.RuntimeException:
org.globus.cog.abstraction.impl.file.FileResourceException:
Could not get list of files in
cfs/scratch/users/mkubal/Compound_Databases/Zinc_Drug_Like/zinc.drug_like.chunks3500
from server
Caused by:
        Server refused performing the request. Custom
message:  (error code 1) [Nested exception message: 
Custom message: Unexpected reply: 451 refusing to
store with active mode
org.globus.ftp.exception.DataChannelException:
setPassive() must match store() and setActive() -
retrieve()  (error code 2)
org.globus.ftp.exception.DataChannelException:
setPassive() must match store() and setActive() -
retrieve()  (error code 2)

Thanks,

Mike

--- Ben Clifford <benc at hawaga.org.uk> wrote:

> 
> > Hypothetically, what would the Swift syntax be if
> I
> > wanted to map all the files in a directory on the
> > remote resource, say 
> >
>
/cfs/scratch/users/mkubal/Compound_Databases/Zinc_Drug_Like,
> > into file fls[] ?
> 
> try something like:
> 
> file fls[] 
>
<filesys_mapper;location="gsiftp://tg-gridftp.uc.teragrid.org/home/mkubal">
> 
> I think something like that should work, though I
> haven't tried it out as 
> I'm working on not-Swift this week.
> 
> -- 
> 


      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


From benc at hawaga.org.uk  Thu Apr 17 13:55:10 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 17 Apr 2008 18:55:10 +0000 (GMT)
Subject: [Swift-devel] syntax for mapping files on remote resource
In-Reply-To: <430068.70879.qm@web52306.mail.re2.yahoo.com>
References: <430068.70879.qm@web52306.mail.re2.yahoo.com>
Message-ID: <Pine.LNX.4.64.0804171852540.31934@dildano.hawaga.org.uk>


> setPassive() must match store() and setActive() -                             
> retrieve()  (error code 2)               

I've seen errors like that with some problems that were fixed with data 
channel reuse for file transfers recently (after 0.4 and before 0.5).

How recent/old is your Swift install? (paste the version line from the 
start of a run)

-- 


From mikekubal at yahoo.com  Thu Apr 17 14:04:39 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Thu, 17 Apr 2008 12:04:39 -0700 (PDT)
Subject: [Swift-devel] syntax for mapping files on remote resource
In-Reply-To: <Pine.LNX.4.64.0804171852540.31934@dildano.hawaga.org.uk>
Message-ID: <929699.91539.qm@web52307.mail.re2.yahoo.com>

Swift svn swift-r1771 cog-r1936

--- Ben Clifford <benc at hawaga.org.uk> wrote:

> 
> > setPassive() must match store() and setActive() - 
>                            
> > retrieve()  (error code 2)               
> 
> I've seen errors like that with some problems that
> were fixed with data 
> channel reuse for file transfers recently (after 0.4
> and before 0.5).
> 
> How recent/old is your Swift install? (paste the
> version line from the 
> start of a run)
> 
> -- 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
>
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 


      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


From benc at hawaga.org.uk  Thu Apr 17 14:51:19 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 17 Apr 2008 19:51:19 +0000 (GMT)
Subject: [Swift-devel] syntax for mapping files on remote resource
In-Reply-To: <929699.91539.qm@web52307.mail.re2.yahoo.com>
References: <929699.91539.qm@web52307.mail.re2.yahoo.com>
Message-ID: <Pine.LNX.4.64.0804171949030.31934@dildano.hawaga.org.uk>


On Thu, 17 Apr 2008, Mike Kubal wrote:

> Swift svn swift-r1771 cog-r1936

ok. You need cog at least r1956 to get the fixes in data channel reuse 
that stopped this error message in other situations. Can you get the 
latest swift and cog SVNs and see if those fix this.

-- 


From wilde at mcs.anl.gov  Thu Apr 17 14:53:46 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 17 Apr 2008 14:53:46 -0500
Subject: [Swift-devel] syntax for mapping files on remote resource
In-Reply-To: <Pine.LNX.4.64.0804171949030.31934@dildano.hawaga.org.uk>
References: <929699.91539.qm@web52307.mail.re2.yahoo.com>
	<Pine.LNX.4.64.0804171949030.31934@dildano.hawaga.org.uk>
Message-ID: <4807AACA.7030105@mcs.anl.gov>

Mike, Ive been tied up, but I'll get this installed on Abe etc - 
tomorrow i hope. But feel free to push on ahead of me.

Mike

On 4/17/08 2:51 PM, Ben Clifford wrote:
> On Thu, 17 Apr 2008, Mike Kubal wrote:
> 
>> Swift svn swift-r1771 cog-r1936
> 
> ok. You need cog at least r1956 to get the fixes in data channel reuse 
> that stopped this error message in other situations. Can you get the 
> latest swift and cog SVNs and see if those fix this.
> 


From mikekubal at yahoo.com  Thu Apr 17 15:44:53 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Thu, 17 Apr 2008 13:44:53 -0700 (PDT)
Subject: [Swift-devel] syntax for mapping files on remote resource
In-Reply-To: <Pine.LNX.4.64.0804171949030.31934@dildano.hawaga.org.uk>
Message-ID: <583991.50629.qm@web52312.mail.re2.yahoo.com>

I updated to Swift svn swift-r1791 cog-r1962

but I'm still getting the 

java.lang.RuntimeException:
org.globus.cog.abstraction.impl.file.FileResourceException:
Could not get list of files in
cfs/scratch/users/mkubal/Compound_Databases/Zinc_Drug_Like/zinc.drug_like.chunks3500
from server
Caused by:
        Server refused performing the request. Custom
message:  (error code 1) [Nested exception message: 
Custom message: Unexpected reply: 500-Command failed.
: System error in stat: No such file or directory

Thanks,

Mike
--- Ben Clifford <benc at hawaga.org.uk> wrote:

> 
> On Thu, 17 Apr 2008, Mike Kubal wrote:
> 
> > Swift svn swift-r1771 cog-r1936
> 
> ok. You need cog at least r1956 to get the fixes in
> data channel reuse 
> that stopped this error message in other situations.
> Can you get the 
> latest swift and cog SVNs and see if those fix this.
> 
> -- 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
>
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 


      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


From hategan at mcs.anl.gov  Thu Apr 17 15:48:29 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 17 Apr 2008 15:48:29 -0500
Subject: [Swift-devel] syntax for mapping files on remote resource
In-Reply-To: <583991.50629.qm@web52312.mail.re2.yahoo.com>
References: <583991.50629.qm@web52312.mail.re2.yahoo.com>
Message-ID: <1208465309.9676.12.camel@localhost>

Right. You'll need a few more slashes after the host name:
grid-abe....org//cfs/...

or maybe even

grid-abe....org///cfs/....

Mihael

On Thu, 2008-04-17 at 13:44 -0700, Mike Kubal wrote:
> I updated to Swift svn swift-r1791 cog-r1962
> 
> but I'm still getting the 
> 
> java.lang.RuntimeException:
> org.globus.cog.abstraction.impl.file.FileResourceException:
> Could not get list of files in
> cfs/scratch/users/mkubal/Compound_Databases/Zinc_Drug_Like/zinc.drug_like.chunks3500
> from server
> Caused by:
>         Server refused performing the request. Custom
> message:  (error code 1) [Nested exception message: 
> Custom message: Unexpected reply: 500-Command failed.
> : System error in stat: No such file or directory
> 
> Thanks,
> 
> Mike
> --- Ben Clifford <benc at hawaga.org.uk> wrote:
> 
> > 
> > On Thu, 17 Apr 2008, Mike Kubal wrote:
> > 
> > > Swift svn swift-r1771 cog-r1936
> > 
> > ok. You need cog at least r1956 to get the fixes in
> > data channel reuse 
> > that stopped this error message in other situations.
> > Can you get the 
> > latest swift and cog SVNs and see if those fix this.
> > 
> > -- 
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> >
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> > 
> 
> 
> 
>       ____________________________________________________________________________________
> Be a better friend, newshound, and 
> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From benc at hawaga.org.uk  Thu Apr 17 15:49:22 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 17 Apr 2008 20:49:22 +0000 (GMT)
Subject: [Swift-devel] syntax for mapping files on remote resource
In-Reply-To: <583991.50629.qm@web52312.mail.re2.yahoo.com>
References: <583991.50629.qm@web52312.mail.re2.yahoo.com>
Message-ID: <Pine.LNX.4.64.0804172048590.31934@dildano.hawaga.org.uk>


try gsiftp://hostname//cfs/scratch/whever with *two* / after the hostname 
and before cfs.

On Thu, 17 Apr 2008, Mike Kubal wrote:

> I updated to Swift svn swift-r1791 cog-r1962
> 
> but I'm still getting the 
> 
> java.lang.RuntimeException:
> org.globus.cog.abstraction.impl.file.FileResourceException:
> Could not get list of files in
> cfs/scratch/users/mkubal/Compound_Databases/Zinc_Drug_Like/zinc.drug_like.chunks3500
> from server
> Caused by:
>         Server refused performing the request. Custom
> message:  (error code 1) [Nested exception message: 
> Custom message: Unexpected reply: 500-Command failed.
> : System error in stat: No such file or directory
> 
> Thanks,
> 
> Mike
> --- Ben Clifford <benc at hawaga.org.uk> wrote:
> 
> > 
> > On Thu, 17 Apr 2008, Mike Kubal wrote:
> > 
> > > Swift svn swift-r1771 cog-r1936
> > 
> > ok. You need cog at least r1956 to get the fixes in
> > data channel reuse 
> > that stopped this error message in other situations.
> > Can you get the 
> > latest swift and cog SVNs and see if those fix this.
> > 
> > -- 
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> >
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> > 
> 
> 
> 
>       ____________________________________________________________________________________
> Be a better friend, newshound, and 
> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> 
> 


From mikekubal at yahoo.com  Thu Apr 17 15:58:16 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Thu, 17 Apr 2008 13:58:16 -0700 (PDT)
Subject: [Swift-devel] syntax for mapping files on remote resource
In-Reply-To: <1208465309.9676.12.camel@localhost>
Message-ID: <699514.86613.qm@web52309.mail.re2.yahoo.com>

I had tried //, so I tired /// but no luck, but at
least a different error with the latest cog and swift:

RunID: 20080417-1555-t9mx93ad
Execution failed:
        java.lang.RuntimeException:
java.lang.NullPointerException
Caused by:
        java.lang.NullPointerException ....


--- Mihael Hategan <hategan at mcs.anl.gov> wrote:

> Right. You'll need a few more slashes after the host
> name:
> grid-abe....org//cfs/...
> 
> or maybe even
> 
> grid-abe....org///cfs/....
> 
> Mihael
> 
> On Thu, 2008-04-17 at 13:44 -0700, Mike Kubal wrote:
> > I updated to Swift svn swift-r1791 cog-r1962
> > 
> > but I'm still getting the 
> > 
> > java.lang.RuntimeException:
> >
>
org.globus.cog.abstraction.impl.file.FileResourceException:
> > Could not get list of files in
> >
>
cfs/scratch/users/mkubal/Compound_Databases/Zinc_Drug_Like/zinc.drug_like.chunks3500
> > from server
> > Caused by:
> >         Server refused performing the request.
> Custom
> > message:  (error code 1) [Nested exception
> message: 
> > Custom message: Unexpected reply: 500-Command
> failed.
> > : System error in stat: No such file or directory
> > 
> > Thanks,
> > 
> > Mike
> > --- Ben Clifford <benc at hawaga.org.uk> wrote:
> > 
> > > 
> > > On Thu, 17 Apr 2008, Mike Kubal wrote:
> > > 
> > > > Swift svn swift-r1771 cog-r1936
> > > 
> > > ok. You need cog at least r1956 to get the fixes
> in
> > > data channel reuse 
> > > that stopped this error message in other
> situations.
> > > Can you get the 
> > > latest swift and cog SVNs and see if those fix
> this.
> > > 
> > > -- 
> > > 
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > >
> >
>
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > 
> > > 
> > 
> > 
> > 
> >      
>
____________________________________________________________________________________
> > Be a better friend, newshound, and 
> > know-it-all with Yahoo! Mobile.  Try it now. 
>
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> >
>
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> 
> 


      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


From benc at hawaga.org.uk  Thu Apr 17 15:53:23 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 17 Apr 2008 20:53:23 +0000 (GMT)
Subject: [Swift-devel] syntax for mapping files on remote resource
In-Reply-To: <583991.50629.qm@web52312.mail.re2.yahoo.com>
References: <583991.50629.qm@web52312.mail.re2.yahoo.com>
Message-ID: <Pine.LNX.4.64.0804172052110.31934@dildano.hawaga.org.uk>


> but I'm still getting the 

different error, btw - before it was gridftp error code 451; now its 500. 

not sure which is better ;)

-- 


From hategan at mcs.anl.gov  Thu Apr 17 16:08:24 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 17 Apr 2008 16:08:24 -0500
Subject: [Swift-devel] syntax for mapping files on remote resource
In-Reply-To: <699514.86613.qm@web52309.mail.re2.yahoo.com>
References: <699514.86613.qm@web52309.mail.re2.yahoo.com>
Message-ID: <1208466504.10990.0.camel@localhost>

The log file may have a full stack trace.

On Thu, 2008-04-17 at 13:58 -0700, Mike Kubal wrote:
> I had tried //, so I tired /// but no luck, but at
> least a different error with the latest cog and swift:
> 
> RunID: 20080417-1555-t9mx93ad
> Execution failed:
>         java.lang.RuntimeException:
> java.lang.NullPointerException
> Caused by:
>         java.lang.NullPointerException ....
> 
> 
> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> 
> > Right. You'll need a few more slashes after the host
> > name:
> > grid-abe....org//cfs/...
> > 
> > or maybe even
> > 
> > grid-abe....org///cfs/....
> > 
> > Mihael
> > 
> > On Thu, 2008-04-17 at 13:44 -0700, Mike Kubal wrote:
> > > I updated to Swift svn swift-r1791 cog-r1962
> > > 
> > > but I'm still getting the 
> > > 
> > > java.lang.RuntimeException:
> > >
> >
> org.globus.cog.abstraction.impl.file.FileResourceException:
> > > Could not get list of files in
> > >
> >
> cfs/scratch/users/mkubal/Compound_Databases/Zinc_Drug_Like/zinc.drug_like.chunks3500
> > > from server
> > > Caused by:
> > >         Server refused performing the request.
> > Custom
> > > message:  (error code 1) [Nested exception
> > message: 
> > > Custom message: Unexpected reply: 500-Command
> > failed.
> > > : System error in stat: No such file or directory
> > > 
> > > Thanks,
> > > 
> > > Mike
> > > --- Ben Clifford <benc at hawaga.org.uk> wrote:
> > > 
> > > > 
> > > > On Thu, 17 Apr 2008, Mike Kubal wrote:
> > > > 
> > > > > Swift svn swift-r1771 cog-r1936
> > > > 
> > > > ok. You need cog at least r1956 to get the fixes
> > in
> > > > data channel reuse 
> > > > that stopped this error message in other
> > situations.
> > > > Can you get the 
> > > > latest swift and cog SVNs and see if those fix
> > this.
> > > > 
> > > > -- 
> > > > 
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > >
> > >
> >
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > 
> > > > 
> > > 
> > > 
> > > 
> > >      
> >
> ____________________________________________________________________________________
> > > Be a better friend, newshound, and 
> > > know-it-all with Yahoo! Mobile.  Try it now. 
> >
> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > >
> >
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > 
> > 
> > 
> 
> 
> 
>       ____________________________________________________________________________________
> Be a better friend, newshound, and 
> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> 


From mikekubal at yahoo.com  Thu Apr 17 16:08:54 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Thu, 17 Apr 2008 14:08:54 -0700 (PDT)
Subject: [Swift-devel] syntax for mapping files on remote resource
In-Reply-To: <699514.86613.qm@web52309.mail.re2.yahoo.com>
Message-ID: <635441.69220.qm@web52312.mail.re2.yahoo.com>

It may be something with Abe's ftp server?

--- Mike Kubal <mikekubal at yahoo.com> wrote:

> I had tried //, so I tired /// but no luck, but at
> least a different error with the latest cog and
> swift:
> 
> RunID: 20080417-1555-t9mx93ad
> Execution failed:
>         java.lang.RuntimeException:
> java.lang.NullPointerException
> Caused by:
>         java.lang.NullPointerException ....
> 
> 
> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> 
> > Right. You'll need a few more slashes after the
> host
> > name:
> > grid-abe....org//cfs/...
> > 
> > or maybe even
> > 
> > grid-abe....org///cfs/....
> > 
> > Mihael
> > 
> > On Thu, 2008-04-17 at 13:44 -0700, Mike Kubal
> wrote:
> > > I updated to Swift svn swift-r1791 cog-r1962
> > > 
> > > but I'm still getting the 
> > > 
> > > java.lang.RuntimeException:
> > >
> >
>
org.globus.cog.abstraction.impl.file.FileResourceException:
> > > Could not get list of files in
> > >
> >
>
cfs/scratch/users/mkubal/Compound_Databases/Zinc_Drug_Like/zinc.drug_like.chunks3500
> > > from server
> > > Caused by:
> > >         Server refused performing the request.
> > Custom
> > > message:  (error code 1) [Nested exception
> > message: 
> > > Custom message: Unexpected reply: 500-Command
> > failed.
> > > : System error in stat: No such file or
> directory
> > > 
> > > Thanks,
> > > 
> > > Mike
> > > --- Ben Clifford <benc at hawaga.org.uk> wrote:
> > > 
> > > > 
> > > > On Thu, 17 Apr 2008, Mike Kubal wrote:
> > > > 
> > > > > Swift svn swift-r1771 cog-r1936
> > > > 
> > > > ok. You need cog at least r1956 to get the
> fixes
> > in
> > > > data channel reuse 
> > > > that stopped this error message in other
> > situations.
> > > > Can you get the 
> > > > latest swift and cog SVNs and see if those fix
> > this.
> > > > 
> > > > -- 
> > > > 
> > > >
> _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > >
> > >
> >
>
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > 
> > > > 
> > > 
> > > 
> > > 
> > >      
> >
>
____________________________________________________________________________________
> > > Be a better friend, newshound, and 
> > > know-it-all with Yahoo! Mobile.  Try it now. 
> >
>
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > >
> >
>
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > 
> > 
> > 
> 
> 
> 
>      
>
____________________________________________________________________________________
> Be a better friend, newshound, and 
> know-it-all with Yahoo! Mobile.  Try it now. 
>
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
>
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 


      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


From hategan at mcs.anl.gov  Thu Apr 17 16:14:46 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 17 Apr 2008 16:14:46 -0500
Subject: [Swift-devel] syntax for mapping files on remote resource
In-Reply-To: <635441.69220.qm@web52312.mail.re2.yahoo.com>
References: <635441.69220.qm@web52312.mail.re2.yahoo.com>
Message-ID: <1208466886.10990.10.camel@localhost>

Perhaps, but swift isn't right either in throwing a null pointer
exception.

On Thu, 2008-04-17 at 14:08 -0700, Mike Kubal wrote:
> It may be something with Abe's ftp server?
> 
> --- Mike Kubal <mikekubal at yahoo.com> wrote:
> 
> > I had tried //, so I tired /// but no luck, but at
> > least a different error with the latest cog and
> > swift:
> > 
> > RunID: 20080417-1555-t9mx93ad
> > Execution failed:
> >         java.lang.RuntimeException:
> > java.lang.NullPointerException
> > Caused by:
> >         java.lang.NullPointerException ....
> > 
> > 
> > --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > 
> > > Right. You'll need a few more slashes after the
> > host
> > > name:
> > > grid-abe....org//cfs/...
> > > 
> > > or maybe even
> > > 
> > > grid-abe....org///cfs/....
> > > 
> > > Mihael
> > > 
> > > On Thu, 2008-04-17 at 13:44 -0700, Mike Kubal
> > wrote:
> > > > I updated to Swift svn swift-r1791 cog-r1962
> > > > 
> > > > but I'm still getting the 
> > > > 
> > > > java.lang.RuntimeException:
> > > >
> > >
> >
> org.globus.cog.abstraction.impl.file.FileResourceException:
> > > > Could not get list of files in
> > > >
> > >
> >
> cfs/scratch/users/mkubal/Compound_Databases/Zinc_Drug_Like/zinc.drug_like.chunks3500
> > > > from server
> > > > Caused by:
> > > >         Server refused performing the request.
> > > Custom
> > > > message:  (error code 1) [Nested exception
> > > message: 
> > > > Custom message: Unexpected reply: 500-Command
> > > failed.
> > > > : System error in stat: No such file or
> > directory
> > > > 
> > > > Thanks,
> > > > 
> > > > Mike
> > > > --- Ben Clifford <benc at hawaga.org.uk> wrote:
> > > > 
> > > > > 
> > > > > On Thu, 17 Apr 2008, Mike Kubal wrote:
> > > > > 
> > > > > > Swift svn swift-r1771 cog-r1936
> > > > > 
> > > > > ok. You need cog at least r1956 to get the
> > fixes
> > > in
> > > > > data channel reuse 
> > > > > that stopped this error message in other
> > > situations.
> > > > > Can you get the 
> > > > > latest swift and cog SVNs and see if those fix
> > > this.
> > > > > 
> > > > > -- 
> > > > > 
> > > > >
> > _______________________________________________
> > > > > Swift-devel mailing list
> > > > > Swift-devel at ci.uchicago.edu
> > > > >
> > > >
> > >
> >
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > > 
> > > >      
> > >
> >
> ____________________________________________________________________________________
> > > > Be a better friend, newshound, and 
> > > > know-it-all with Yahoo! Mobile.  Try it now. 
> > >
> >
> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > >
> > >
> >
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > 
> > > 
> > > 
> > 
> > 
> > 
> >      
> >
> ____________________________________________________________________________________
> > Be a better friend, newshound, and 
> > know-it-all with Yahoo! Mobile.  Try it now. 
> >
> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> >
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> > 
> 
> 
> 
>       ____________________________________________________________________________________
> Be a better friend, newshound, and 
> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> 


From hategan at mcs.anl.gov  Fri Apr 18 15:57:16 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 18 Apr 2008 15:57:16 -0500
Subject: [Swift-devel] assignment
Message-ID: <1208552236.5064.6.camel@localhost>

We have to define what it means to make a mapped-var to mapped-var
assignment. And it should probably be a file copy.

Mihael


From benc at hawaga.org.uk  Fri Apr 18 16:24:49 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 18 Apr 2008 21:24:49 +0000 (GMT)
Subject: [Swift-devel] assignment
In-Reply-To: <1208552236.5064.6.camel@localhost>
References: <1208552236.5064.6.camel@localhost>
Message-ID: <Pine.LNX.4.64.0804182124410.645@dildano.hawaga.org.uk>


On Fri, 18 Apr 2008, Mihael Hategan wrote:

> We have to define what it means to make a mapped-var to mapped-var
> assignment. And it should probably be a file copy.

yes.

-- 


From benc at hawaga.org.uk  Sat Apr 19 08:01:46 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 19 Apr 2008 13:01:46 +0000 (GMT)
Subject: [Swift-devel] r1793 restoring of sites.xml LRC
Message-ID: <Pine.LNX.4.64.0804191255500.31934@dildano.hawaga.org.uk>


r1793 puts back in the sites.xml element. I removed that deliberately as a 
config option that does nothing, has done nothing, and likely will not do 
anything either ever or for a long time.

Its displeasing to my sense of user interface aesthetics to have 
configuration options that (deliberately) do nothing; they take up 
documentation space (should anyone bother documenting them); lead to user 
confusion when a user experiments with changing it to no effect; they lead 
to cruft buildup in the code and in configuration files.

If this config option is going to stay, then I think it should at least 
print a warning indicating that it is ignored when specified rather than 
silently being ignored.

-- 


From hategan at mcs.anl.gov  Sat Apr 19 08:50:55 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 19 Apr 2008 08:50:55 -0500
Subject: [Swift-devel] Re: r1793 restoring of sites.xml LRC
In-Reply-To: <Pine.LNX.4.64.0804191255500.31934@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0804191255500.31934@dildano.hawaga.org.uk>
Message-ID: <1208613055.9907.1.camel@localhost>

I forgot we had it before.

The idea was to allow old sites.xml files to be used. I'll remove it.

On Sat, 2008-04-19 at 13:01 +0000, Ben Clifford wrote:
> r1793 puts back in the sites.xml element. I removed that deliberately as a 
> config option that does nothing, has done nothing, and likely will not do 
> anything either ever or for a long time.
> 
> Its displeasing to my sense of user interface aesthetics to have 
> configuration options that (deliberately) do nothing; they take up 
> documentation space (should anyone bother documenting them); lead to user 
> confusion when a user experiments with changing it to no effect; they lead 
> to cruft buildup in the code and in configuration files.
> 
> If this config option is going to stay, then I think it should at least 
> print a warning indicating that it is ignored when specified rather than 
> silently being ignored.
> 


From benc at hawaga.org.uk  Sun Apr 20 12:05:45 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 20 Apr 2008 17:05:45 +0000 (GMT)
Subject: [Swift-devel] CLASSPATH construction order
Message-ID: <Pine.LNX.4.64.0804201701000.31934@dildano.hawaga.org.uk>


The bin/swift wrapper currently constructs a classpath for swift 
automatically, as whatever is already on the classpath in the current 
environment followed by all of the swift classes.

This order seems to cause more problems than it solves - specifically, 
when there are overlapping classes specified in the environment (which I 
have seen with falkon and pegasus users, and potentially is also a problem 
for people with the Globus Toolkit installed).

I think it would be better to construct the classpath the other way round. 
This would remove the ability for people to override internal Swift 
classes by presetting classpath, which, in the above cases, happens 
accidentally and produces obscure errors.

If there is a desire to still be able to override swift classes from the 
environment, which I think there is not, then another Swift specific 
environment variable should be used for the prefix.

-- 


From hategan at mcs.anl.gov  Mon Apr 21 22:12:54 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 21 Apr 2008 22:12:54 -0500
Subject: [Swift-devel] swift 0.5 and gt2
Message-ID: <1208833974.9368.1.camel@localhost>

It may be that if you submit jobs through gt2 with swift 0.5 they may be
run with fork even though other job managers are specified.

This needs to be checked, but that's what my code shows.


From hategan at mcs.anl.gov  Mon Apr 21 22:22:31 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 21 Apr 2008 22:22:31 -0500
Subject: [Swift-devel] swift 0.5 and gt2
In-Reply-To: <1208833974.9368.1.camel@localhost>
References: <1208833974.9368.1.camel@localhost>
Message-ID: <1208834551.9368.3.camel@localhost>

This is fixed in cog r1964.

On Mon, 2008-04-21 at 22:12 -0500, Mihael Hategan wrote:
> It may be that if you submit jobs through gt2 with swift 0.5 they may be
> run with fork even though other job managers are specified.
> 
> This needs to be checked, but that's what my code shows.
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From benc at hawaga.org.uk  Mon Apr 21 22:54:07 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 22 Apr 2008 03:54:07 +0000 (GMT)
Subject: [Swift-devel] swift 0.5 and gt2
In-Reply-To: <1208833974.9368.1.camel@localhost>
References: <1208833974.9368.1.camel@localhost>
Message-ID: <Pine.LNX.4.64.0804220352500.645@dildano.hawaga.org.uk>


On Mon, 21 Apr 2008, Mihael Hategan wrote:

> It may be that if you submit jobs through gt2 with swift 0.5 they may be
> run with fork even though other job managers are specified.
> 
> This needs to be checked, but that's what my code shows.

That's contrary to what the per-site testin in tests/sites/ shows. I just 
checked again with the 0.5 tarball using tests/sites/tgtacc-lsf-gram2.xml 
and I see jobs going into LSF there.

Did you use a jobmanager specification syntax that differs from the syntax 
used in that file? If so, what?

-- 


From hategan at mcs.anl.gov  Tue Apr 22 07:49:36 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 22 Apr 2008 07:49:36 -0500
Subject: [Swift-devel] swift 0.5 and gt2
In-Reply-To: <Pine.LNX.4.64.0804220352500.645@dildano.hawaga.org.uk>
References: <1208833974.9368.1.camel@localhost>
	<Pine.LNX.4.64.0804220352500.645@dildano.hawaga.org.uk>
Message-ID: <1208868576.10512.3.camel@localhost>


On Tue, 2008-04-22 at 03:54 +0000, Ben Clifford wrote:
> 
> On Mon, 21 Apr 2008, Mihael Hategan wrote:
> 
> > It may be that if you submit jobs through gt2 with swift 0.5 they may be
> > run with fork even though other job managers are specified.
> > 
> > This needs to be checked, but that's what my code shows.
> 
> That's contrary to what the per-site testin in tests/sites/ shows. I just 
> checked again with the 0.5 tarball using tests/sites/tgtacc-lsf-gram2.xml 
> and I see jobs going into LSF there.
> 
> Did you use a jobmanager specification syntax that differs from the syntax 
> used in that file? If so, what?

The gt2 provider was using undocumented stuff to read the job manager
from a description. And that undocumented stuff has changed. Can it be
that LSF is the default there?

> 


From benc at hawaga.org.uk  Tue Apr 22 08:16:51 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 22 Apr 2008 13:16:51 +0000 (GMT)
Subject: [Swift-devel] swift 0.5 and gt2
In-Reply-To: <1208868576.10512.3.camel@localhost>
References: <1208833974.9368.1.camel@localhost> 
	<Pine.LNX.4.64.0804220352500.645@dildano.hawaga.org.uk>
	<1208868576.10512.3.camel@localhost>
Message-ID: <Pine.LNX.4.64.0804221316020.31934@dildano.hawaga.org.uk>


On Tue, 22 Apr 2008, Mihael Hategan wrote:

> The gt2 provider was using undocumented stuff to read the job manager
> from a description. And that undocumented stuff has changed. Can it be
> that LSF is the default there?

If I use jobmanager-fii instead of jobmanager-lsf, I get this:

 Caused by:
         Cannot submit job
 Caused by:
         The gatekeeper failed to find the requested service


-- 


From hategan at mcs.anl.gov  Tue Apr 22 08:23:31 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 22 Apr 2008 08:23:31 -0500
Subject: [Swift-devel] swift 0.5 and gt2
In-Reply-To: <Pine.LNX.4.64.0804221316020.31934@dildano.hawaga.org.uk>
References: <1208833974.9368.1.camel@localhost>
	<Pine.LNX.4.64.0804220352500.645@dildano.hawaga.org.uk>
	<1208868576.10512.3.camel@localhost>
	<Pine.LNX.4.64.0804221316020.31934@dildano.hawaga.org.uk>
Message-ID: <1208870611.11068.0.camel@localhost>

But not in the url. I'm talking about using the jobManager attribute to
<execution> or <jobmanager>. If you put it in the url, it's fine.

On Tue, 2008-04-22 at 13:16 +0000, Ben Clifford wrote:
> On Tue, 22 Apr 2008, Mihael Hategan wrote:
> 
> > The gt2 provider was using undocumented stuff to read the job manager
> > from a description. And that undocumented stuff has changed. Can it be
> > that LSF is the default there?
> 
> If I use jobmanager-fii instead of jobmanager-lsf, I get this:
> 
>  Caused by:
>          Cannot submit job
>  Caused by:
>          The gatekeeper failed to find the requested service
> 
> 


From hategan at mcs.anl.gov  Tue Apr 22 08:37:55 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 22 Apr 2008 08:37:55 -0500
Subject: [Swift-devel] swift 0.5 and gt2
In-Reply-To: <1208870611.11068.0.camel@localhost>
References: <1208833974.9368.1.camel@localhost>
	<Pine.LNX.4.64.0804220352500.645@dildano.hawaga.org.uk>
	<1208868576.10512.3.camel@localhost>
	<Pine.LNX.4.64.0804221316020.31934@dildano.hawaga.org.uk>
	<1208870611.11068.0.camel@localhost>
Message-ID: <1208871475.11385.0.camel@localhost>

Maybe we should keep a "known issues" file for each version.

On Tue, 2008-04-22 at 08:23 -0500, Mihael Hategan wrote:
> But not in the url. I'm talking about using the jobManager attribute to
> <execution> or <jobmanager>. If you put it in the url, it's fine.
> 
> On Tue, 2008-04-22 at 13:16 +0000, Ben Clifford wrote:
> > On Tue, 22 Apr 2008, Mihael Hategan wrote:
> > 
> > > The gt2 provider was using undocumented stuff to read the job manager
> > > from a description. And that undocumented stuff has changed. Can it be
> > > that LSF is the default there?
> > 
> > If I use jobmanager-fii instead of jobmanager-lsf, I get this:
> > 
> >  Caused by:
> >          Cannot submit job
> >  Caused by:
> >          The gatekeeper failed to find the requested service
> > 
> > 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From benc at hawaga.org.uk  Tue Apr 22 09:12:22 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 22 Apr 2008 14:12:22 +0000 (GMT)
Subject: [Swift-devel] swift 0.5 and gt2
In-Reply-To: <1208870611.11068.0.camel@localhost>
References: <1208833974.9368.1.camel@localhost> 
	<Pine.LNX.4.64.0804220352500.645@dildano.hawaga.org.uk> 
	<1208868576.10512.3.camel@localhost>
	<Pine.LNX.4.64.0804221316020.31934@dildano.hawaga.org.uk>
	<1208870611.11068.0.camel@localhost>
Message-ID: <Pine.LNX.4.64.0804221411080.31934@dildano.hawaga.org.uk>


On Tue, 22 Apr 2008, Mihael Hategan wrote:

> But not in the url. I'm talking about using the jobManager attribute to
> <execution> or <jobmanager>. If you put it in the url, it's fine.

ok. That is what I was trying to ascertain.

-- 


From hategan at mcs.anl.gov  Tue Apr 22 09:15:39 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 22 Apr 2008 09:15:39 -0500
Subject: [Swift-devel] coasters
Message-ID: <1208873739.12384.7.camel@localhost>

I committed a preliminary coaster code to SVN. It's called
provider-coaster, and it works pretty much like any other provider, with
the following notes:

The job manager is made of 2 or 3 parts:
<coasting-provider>:<remote-provider>[:<remote-job-manager>].

So if you want to, say, start the service on teraport using gt2, then
start workers using gt4 on PBS, you'd say: jobManager="gt2:gt4:pbs". Or
if you wanted to start the service with ssh and then use the local (to
the service) pbs provider to start workers, it would be
jobManager="ssh:pbs".

It's missing a bunch of things. One of them is that the service, once
started, won't shut down by itself, so you should log into the machine
you started it on, and kill it. Another is a better strategy for
allocating workers than "as many as there are jobs". And so on...

org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.main() contains some code to test this. Please be careful if running on a cluster: don't submit too many jobs.


From benc at hawaga.org.uk  Tue Apr 22 09:35:17 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 22 Apr 2008 14:35:17 +0000 (GMT)
Subject: [Swift-devel] coasters
In-Reply-To: <1208873739.12384.7.camel@localhost>
References: <1208873739.12384.7.camel@localhost>
Message-ID: <Pine.LNX.4.64.0804221433370.645@dildano.hawaga.org.uk>


I tried this with swift, by adding a dependency on provider-coaster and
setting:

    <execution provider="coaster" jobmanager="local:local" />

This is on my os x laptop on UC wireless.

examples/first.swift fails like this:

Execution failed:
        Exception in echo:
Arguments: [Hello, world!]
Host: localhost
Directory: first-20080422-0932-v5v2jxa7/jobs/p/echo-p9ytklri
stderr.txt: 

stdout.txt: 

----

Caused by:
        Could not submit job
Caused by:
        Failed to start channel GSSC-https://:1984
Caused by:
        port out of range:-1


-- 


From hategan at mcs.anl.gov  Tue Apr 22 10:23:52 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 22 Apr 2008 10:23:52 -0500
Subject: [Swift-devel] coasters
In-Reply-To: <Pine.LNX.4.64.0804221433370.645@dildano.hawaga.org.uk>
References: <1208873739.12384.7.camel@localhost>
	<Pine.LNX.4.64.0804221433370.645@dildano.hawaga.org.uk>
Message-ID: <1208877832.13505.3.camel@localhost>

You need a valid IP in your cog.properties.

On Tue, 2008-04-22 at 14:35 +0000, Ben Clifford wrote:
> I tried this with swift, by adding a dependency on provider-coaster and
> setting:
> 
>     <execution provider="coaster" jobmanager="local:local" />
> 
> This is on my os x laptop on UC wireless.
> 
> examples/first.swift fails like this:
> 
> Execution failed:
>         Exception in echo:
> Arguments: [Hello, world!]
> Host: localhost
> Directory: first-20080422-0932-v5v2jxa7/jobs/p/echo-p9ytklri
> stderr.txt: 
> 
> stdout.txt: 
> 
> ----
> 
> Caused by:
>         Could not submit job
> Caused by:
>         Failed to start channel GSSC-https://:1984
> Caused by:
>         port out of range:-1
> 
> 


From benc at hawaga.org.uk  Tue Apr 22 10:41:52 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 22 Apr 2008 15:41:52 +0000 (GMT)
Subject: [Swift-devel] coasters
In-Reply-To: <1208877832.13505.3.camel@localhost>
References: <1208873739.12384.7.camel@localhost> 
	<Pine.LNX.4.64.0804221433370.645@dildano.hawaga.org.uk>
	<1208877832.13505.3.camel@localhost>
Message-ID: <Pine.LNX.4.64.0804221540380.31934@dildano.hawaga.org.uk>


On Tue, 22 Apr 2008, Mihael Hategan wrote:

> You need a valid IP in your cog.properties.

Is there a way for me to set that in a way that doesn't involve fiddling 
with files outside of my install/run tree?

(neither GLOBUS_HOSTNAME env or -ip.addr swift command line parameter 
changes the error)

-- 


From hategan at mcs.anl.gov  Tue Apr 22 10:46:14 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 22 Apr 2008 10:46:14 -0500
Subject: [Swift-devel] coasters
In-Reply-To: <Pine.LNX.4.64.0804221540380.31934@dildano.hawaga.org.uk>
References: <1208873739.12384.7.camel@localhost>
	<Pine.LNX.4.64.0804221433370.645@dildano.hawaga.org.uk>
	<1208877832.13505.3.camel@localhost>
	<Pine.LNX.4.64.0804221540380.31934@dildano.hawaga.org.uk>
Message-ID: <1208879174.13851.2.camel@localhost>

GLOBUS_HOSTNAME should work, so it might be another problem. Send log
file. Also, try invoking the integrated test
(org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.main()).

On Tue, 2008-04-22 at 15:41 +0000, Ben Clifford wrote:
> 
> On Tue, 22 Apr 2008, Mihael Hategan wrote:
> 
> > You need a valid IP in your cog.properties.
> 
> Is there a way for me to set that in a way that doesn't involve fiddling 
> with files outside of my install/run tree?
> 
> (neither GLOBUS_HOSTNAME env or -ip.addr swift command line parameter 
> changes the error)
> 


From benc at hawaga.org.uk  Tue Apr 22 10:58:26 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 22 Apr 2008 15:58:26 +0000 (GMT)
Subject: [Swift-devel] coasters
In-Reply-To: <1208879174.13851.2.camel@localhost>
References: <1208873739.12384.7.camel@localhost> 
	<Pine.LNX.4.64.0804221433370.645@dildano.hawaga.org.uk> 
	<1208877832.13505.3.camel@localhost>
	<Pine.LNX.4.64.0804221540380.31934@dildano.hawaga.org.uk>
	<1208879174.13851.2.camel@localhost>
Message-ID: <Pine.LNX.4.64.0804221557220.31934@dildano.hawaga.org.uk>

On Tue, 22 Apr 2008, Mihael Hategan wrote:

> Also, try invoking the integrated test
> (org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.main()).

I transplated that class name into the swift wrapper script to get the 
environmental setup instead of running Loader:

$ diff swift coaster-test 
3c3
< EXEC=org.griphyn.vdl.karajan.Loader
---
> 
EXEC=org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler

and I get this error:

$ ./coaster-test 
Started local service: 128.135.199.187:50000
org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could 
not submit job
        at 
org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:81)
        at 
org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submitTask(JobSubmissionTaskHandler.java:229)
        at 
org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.main(JobSubmissionTaskHandler.java:238)
Caused by: 
org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could 
not start coaster service
        at 
org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.reserveService(ServiceManager.java:80)
        at 
org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:68)
        ... 2 more
Caused by: 
org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: 
org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could 
not find bootstrap script in classpath
        at 
org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.buildTask(ServiceManager.java:163)
        at 
org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.startService(ServiceManager.java:102)
        at 
org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.reserveService(ServiceManager.java:74)
        ... 3 more
Caused by: 
org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could 
not find bootstrap script in classpath
        at 
org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.loadBootstrapScript(ServiceManager.java:175)
        at 
org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.buildTask(ServiceManager.java:143)
        ... 5 more

-- 


From hategan at mcs.anl.gov  Tue Apr 22 11:01:20 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 22 Apr 2008 11:01:20 -0500
Subject: [Swift-devel] coasters
In-Reply-To: <Pine.LNX.4.64.0804221557220.31934@dildano.hawaga.org.uk>
References: <1208873739.12384.7.camel@localhost>
	<Pine.LNX.4.64.0804221433370.645@dildano.hawaga.org.uk>
	<1208877832.13505.3.camel@localhost>
	<Pine.LNX.4.64.0804221540380.31934@dildano.hawaga.org.uk>
	<1208879174.13851.2.camel@localhost>
	<Pine.LNX.4.64.0804221557220.31934@dildano.hawaga.org.uk>
Message-ID: <1208880080.14131.0.camel@localhost>

Hmm. Right. Copy libexec/bootstrap.sh to resources/

On Tue, 2008-04-22 at 15:58 +0000, Ben Clifford wrote:
> On Tue, 22 Apr 2008, Mihael Hategan wrote:
> 
> > Also, try invoking the integrated test
> > (org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.main()).
> 
> I transplated that class name into the swift wrapper script to get the 
> environmental setup instead of running Loader:
> 
> $ diff swift coaster-test 
> 3c3
> < EXEC=org.griphyn.vdl.karajan.Loader
> ---
> > 
> EXEC=org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler
> 
> and I get this error:
> 
> $ ./coaster-test 
> Started local service: 128.135.199.187:50000
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could 
> not submit job
>         at 
> org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:81)
>         at 
> org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submitTask(JobSubmissionTaskHandler.java:229)
>         at 
> org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.main(JobSubmissionTaskHandler.java:238)
> Caused by: 
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could 
> not start coaster service
>         at 
> org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.reserveService(ServiceManager.java:80)
>         at 
> org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:68)
>         ... 2 more
> Caused by: 
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: 
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could 
> not find bootstrap script in classpath
>         at 
> org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.buildTask(ServiceManager.java:163)
>         at 
> org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.startService(ServiceManager.java:102)
>         at 
> org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.reserveService(ServiceManager.java:74)
>         ... 3 more
> Caused by: 
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could 
> not find bootstrap script in classpath
>         at 
> org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.loadBootstrapScript(ServiceManager.java:175)
>         at 
> org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.buildTask(ServiceManager.java:143)
>         ... 5 more
> 


From hategan at mcs.anl.gov  Tue Apr 22 11:04:08 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 22 Apr 2008 11:04:08 -0500
Subject: [Swift-devel] coasters
In-Reply-To: <1208880080.14131.0.camel@localhost>
References: <1208873739.12384.7.camel@localhost>
	<Pine.LNX.4.64.0804221433370.645@dildano.hawaga.org.uk>
	<1208877832.13505.3.camel@localhost>
	<Pine.LNX.4.64.0804221540380.31934@dildano.hawaga.org.uk>
	<1208879174.13851.2.camel@localhost>
	<Pine.LNX.4.64.0804221557220.31934@dildano.hawaga.org.uk>
	<1208880080.14131.0.camel@localhost>
Message-ID: <1208880248.14131.2.camel@localhost>

Or update to 1970 and re-compile.

On Tue, 2008-04-22 at 11:01 -0500, Mihael Hategan wrote:
> Hmm. Right. Copy libexec/bootstrap.sh to resources/
> 
> On Tue, 2008-04-22 at 15:58 +0000, Ben Clifford wrote:
> > On Tue, 22 Apr 2008, Mihael Hategan wrote:
> > 
> > > Also, try invoking the integrated test
> > > (org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.main()).
> > 
> > I transplated that class name into the swift wrapper script to get the 
> > environmental setup instead of running Loader:
> > 
> > $ diff swift coaster-test 
> > 3c3
> > < EXEC=org.griphyn.vdl.karajan.Loader
> > ---
> > > 
> > EXEC=org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler
> > 
> > and I get this error:
> > 
> > $ ./coaster-test 
> > Started local service: 128.135.199.187:50000
> > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could 
> > not submit job
> >         at 
> > org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:81)
> >         at 
> > org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submitTask(JobSubmissionTaskHandler.java:229)
> >         at 
> > org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.main(JobSubmissionTaskHandler.java:238)
> > Caused by: 
> > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could 
> > not start coaster service
> >         at 
> > org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.reserveService(ServiceManager.java:80)
> >         at 
> > org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:68)
> >         ... 2 more
> > Caused by: 
> > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: 
> > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could 
> > not find bootstrap script in classpath
> >         at 
> > org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.buildTask(ServiceManager.java:163)
> >         at 
> > org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.startService(ServiceManager.java:102)
> >         at 
> > org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.reserveService(ServiceManager.java:74)
> >         ... 3 more
> > Caused by: 
> > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could 
> > not find bootstrap script in classpath
> >         at 
> > org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.loadBootstrapScript(ServiceManager.java:175)
> >         at 
> > org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.buildTask(ServiceManager.java:143)
> >         ... 5 more
> > 


From benc at hawaga.org.uk  Tue Apr 22 11:06:38 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 22 Apr 2008 16:06:38 +0000 (GMT)
Subject: [Swift-devel] coasters
In-Reply-To: <Pine.LNX.4.64.0804221557220.31934@dildano.hawaga.org.uk>
References: <1208873739.12384.7.camel@localhost> 
	<Pine.LNX.4.64.0804221433370.645@dildano.hawaga.org.uk> 
	<1208877832.13505.3.camel@localhost>
	<Pine.LNX.4.64.0804221540380.31934@dildano.hawaga.org.uk>
	<1208879174.13851.2.camel@localhost>
	<Pine.LNX.4.64.0804221557220.31934@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0804221606270.31934@dildano.hawaga.org.uk>

here is a log file for a swift run attempt:

http://www.ci.uchicago.edu/~benc/tmp/first-20080422-1104-491mdor6.log

-- 


From benc at hawaga.org.uk  Tue Apr 22 11:37:13 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 22 Apr 2008 16:37:13 +0000 (GMT)
Subject: [Swift-devel] coasters
In-Reply-To: <1208880248.14131.2.camel@localhost>
References: <1208873739.12384.7.camel@localhost> 
	<Pine.LNX.4.64.0804221433370.645@dildano.hawaga.org.uk> 
	<1208877832.13505.3.camel@localhost>
	<Pine.LNX.4.64.0804221540380.31934@dildano.hawaga.org.uk>
	<1208879174.13851.2.camel@localhost>
	<Pine.LNX.4.64.0804221557220.31934@dildano.hawaga.org.uk>
	<1208880080.14131.0.camel@localhost>
	<1208880248.14131.2.camel@localhost>
Message-ID: <Pine.LNX.4.64.0804221632210.31934@dildano.hawaga.org.uk>


On Tue, 22 Apr 2008, Mihael Hategan wrote:

> Or update to 1970 and re-compile.

so now that test class outputs the following, veeery slowly. And then 
seems to do sit and do nothing. On my laptop it doesn't seem to be using 
CPU, and on 128.135.125.118, tp-grid1, I see nothing in the PBS queue.
On tp-grid1, I see these java processes running:

 /soft/java-1.5.0_06-sun-r1/bin/java 
-Djava.home=/soft/java-1.5.0_06-sun-r1 
-DX509_USER_PROXY=/home/benc/.globus/job/tp-grid1.ci.uchicago.edu/7656.1208881472/x509_up 
-DGLOBUS_HOSTNAME=tp-grid1.ci.uchicago.
edu -jar bootstrap.lP7914 http://128.135.199.187:50001 
4da44a90a961d5f9f4965b1a8a2ce85e https://128.135.199.187:50000 357791723

 9835 ?        Sl     0:04 /soft/java-1.5.0_06-sun-r1/bin/java 
-DX509_USER_PROXY=/home/benc/.globus/job/tp-grid1.ci.uchicago.edu/7656.1208881472/x509_up 
-DGLOBUS_HOSTNAME=tp-grid1.ci.uchicago.edu -cp 
/home/benc/.globus/coasters/cac
he/cog-provider-coaster-0.1-1139af49204eed1884ffa46465f9704f.jar:/home/benc/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/benc/.globus/coasters/cache/cog-abstraction-common-2.2-a4301bae7
66fd4d64d0fefabb539e3f7.jar:/home/benc/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/benc/.globus/coasters/cache/cog-karajan-0.36-dev-b695c30b273bc90fb60b5b62764cdfe4.jar:/home/benc/.globu
s/coasters/cache/cog-provider-gt2-2.3-cd8fd68d5d520178a507723c4027885b.jar:/home/benc/.globus/coasters/cache/cog-provider-gt4_0_0-2.4-79fe87623a5d5a052d546a1735b09aad.jar:/home/benc/.globus/coasters/cache/cog-provider-local-2.1-9a4
1ac57fae7d518e5ae7fae894c457c.jar:/home/benc/.globus/coasters/cache/cog-provider-localscheduler-0.2-6cf4a8df6e05d1a0547de8a31c2eca7c.jar:/home/benc/.globus/coasters/cache/cog-provider-ssh-2.3-8d82acf0a5048350e7ef89119027890a.jar:/h
ome/benc/.globus/coasters/cache/cog-util-0.92-0e560de7e37434887f39f389de6fac57.jar:/home/benc/.globus/coasters/cache/commons-logging-1.1-6b62417e77b000a87de66ee3935edbf5.jar:/home/benc/.globus/coasters/cache/cryptix-asn1-87c4cf848c
81d102bd29e33681b80e8a.jar:/home/benc/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/benc/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/benc/.globus/coasters/cache/j2ssh-comm
on-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/benc/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/benc/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/benc/
.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/benc/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/benc/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c.ja
r:/home/benc/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar 
org.globus.cog.abstraction.coaster.service.CoasterService 
https://128.135.199.187:50000 357791723


Neither of them have any child processes (and one is a child of the 
other).


$ ./coaster-test 
Started local service: 128.135.199.187:50000
Socket bound. URL is http://128.135.199.187:50001
[/128.135.125.118:35262]GET /coaster-bootstrap.jar HTTP/1.0
[/128.135.125.118:35329]GET /list HTTP/1.1
[/128.135.125.118:35334]GET /backport-util-concurrent.jar HTTP/1.1
org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could 
not submit job
        at 
org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:81)
        at 
org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submitTask(JobSubmissionTaskHandler.java:229)
        at 
org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.main(JobSubmissionTaskHandler.java:238)
Caused by: 
org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could 
not start coaster service
        at 
org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.reserveService(ServiceManager.java:80)
        at 
org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:68)
        ... 2 more
Caused by: java.io.IOException: Timed out waiting for registration for 
357791723
        at 
org.globus.cog.abstraction.coaster.service.local.LocalService.waitForRegistration(LocalService.java:71)
        at 
org.globus.cog.abstraction.coaster.service.local.LocalService.waitForRegistration(LocalService.java:61)
        at 
org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.startService(ServiceManager.java:104)
        at 
org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.reserveService(ServiceManager.java:74)
        ... 3 more
[/128.135.125.118:35508]GET /cog-abstraction-common-2.2.jar HTTP/1.1
[/128.135.125.118:35578]GET /cog-jglobus-dev-080222.jar HTTP/1.1
[/128.135.125.118:35580]GET /cog-karajan-0.36-dev.jar HTTP/1.1
[/128.135.125.118:35583]GET /cog-provider-coaster-0.1.jar HTTP/1.1
[/128.135.125.118:35584]GET /cog-provider-gt2-2.3.jar HTTP/1.1
[/128.135.125.118:35586]GET /cog-provider-gt4_0_0-2.4.jar HTTP/1.1
[/128.135.125.118:35589]GET /cog-provider-local-2.1.jar HTTP/1.1
[/128.135.125.118:35593]GET /cog-provider-localscheduler-0.2.jar HTTP/1.1
[/128.135.125.118:35595]GET /cog-provider-ssh-2.3.jar HTTP/1.1
[/128.135.125.118:35596]GET /cog-util-0.92.jar HTTP/1.1
[/128.135.125.118:35597]GET /commons-logging-1.1.jar HTTP/1.1
[/128.135.125.118:35598]GET /cryptix-asn1.jar HTTP/1.1
[/128.135.125.118:35599]GET /cryptix.jar HTTP/1.1
[/128.135.125.118:35601]GET /cryptix32.jar HTTP/1.1
[/128.135.125.118:35605]GET /j2ssh-common-0.2.2.jar HTTP/1.1
[/128.135.125.118:35609]GET /j2ssh-core-0.2.2-patched.jar HTTP/1.1
[/128.135.125.118:35610]GET /jaxrpc.jar HTTP/1.1
[/128.135.125.118:35611]GET /jce-jdk13-131.jar HTTP/1.1
[/128.135.125.118:35612]GET /jgss.jar HTTP/1.1
[/128.135.125.118:35613]GET /log4j-1.2.8.jar HTTP/1.1
[/128.135.125.118:35617]GET /puretls.jar HTTP/1.1
nullChannel started
Channel id: -2c8466ac:11976f60889:-8000:-7cd08a9f:11976f5f16c:-8000
MetaChannel: 8254578 -> null.bind -> GSSC-null

-- 


From benc at hawaga.org.uk  Tue Apr 22 11:41:54 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 22 Apr 2008 16:41:54 +0000 (GMT)
Subject: [Swift-devel] coasters
In-Reply-To: <Pine.LNX.4.64.0804221632210.31934@dildano.hawaga.org.uk>
References: <1208873739.12384.7.camel@localhost> 
	<Pine.LNX.4.64.0804221433370.645@dildano.hawaga.org.uk> 
	<1208877832.13505.3.camel@localhost>
	<Pine.LNX.4.64.0804221540380.31934@dildano.hawaga.org.uk>
	<1208879174.13851.2.camel@localhost>
	<Pine.LNX.4.64.0804221557220.31934@dildano.hawaga.org.uk>
	<1208880080.14131.0.camel@localhost>
	<1208880248.14131.2.camel@localhost>
	<Pine.LNX.4.64.0804221632210.31934@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0804221641310.31934@dildano.hawaga.org.uk>

also with r1970, I get the same 'Failed to start channel 
GSSC-https://:1984' error.
-- 


From hategan at mcs.anl.gov  Tue Apr 22 13:13:09 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 22 Apr 2008 13:13:09 -0500
Subject: [Swift-devel] coasters
In-Reply-To: <Pine.LNX.4.64.0804221641310.31934@dildano.hawaga.org.uk>
References: <1208873739.12384.7.camel@localhost>
	<Pine.LNX.4.64.0804221433370.645@dildano.hawaga.org.uk>
	<1208877832.13505.3.camel@localhost>
	<Pine.LNX.4.64.0804221540380.31934@dildano.hawaga.org.uk>
	<1208879174.13851.2.camel@localhost>
	<Pine.LNX.4.64.0804221557220.31934@dildano.hawaga.org.uk>
	<1208880080.14131.0.camel@localhost>
	<1208880248.14131.2.camel@localhost>
	<Pine.LNX.4.64.0804221632210.31934@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804221641310.31934@dildano.hawaga.org.uk>
Message-ID: <1208887989.17457.0.camel@localhost>

1971 is out.

On Tue, 2008-04-22 at 16:41 +0000, Ben Clifford wrote:
> also with r1970, I get the same 'Failed to start channel 
> GSSC-https://:1984' error.


From benc at hawaga.org.uk  Wed Apr 23 16:15:57 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 23 Apr 2008 21:15:57 +0000 (GMT)
Subject: [Swift-devel] making this list no longer require subscription
	approval
Message-ID: <Pine.LNX.4.64.0804232113530.645@dildano.hawaga.org.uk>

This list has traditionally required subscription approval; I'd like to 
make it so that admins do not have to approve people for subscription any 
more - it takes time and I don't see that there is any value gained by 
this. So I'm going to remove the requirement from the config for this 
list - subscriptions will then act like swift-user subscriptions do now.

-- 


From benc at hawaga.org.uk  Thu Apr 24 09:23:24 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 24 Apr 2008 14:23:24 +0000 (GMT)
Subject: [Swift-devel] today's coaster report
Message-ID: <Pine.LNX.4.64.0804241417040.31934@dildano.hawaga.org.uk>


Using cog r1977, the test coaster client will run jobs successfully but 
then issues a warning:

WARN   - Failed to shut down service https://127.0.0.1:55731

and a stack trace beginning:
org.globus.cog.karajan.workflow.service.ProtocolException: 
        at 
org.globus.cog.karajan.workflow.service.commands.Command.execute(Command.java:118)
        at 
org.globus.cog.abstraction.impl.execution.coaster.ServiceManager$ServiceReaper.run(ServiceManager.java:274)


(and indeed, something continues to listen on that port)

Also, GLOBUS_HOSTNAME appears to be respected, but GLOBUS_TCP_PORT_RANGE 
appears to not be in some cases, using a standard cog launcher around the 
test client - some stuff listens on a port in the port range (eg. the 
server where the jar files get downloaded from); but whatever service is 
referred to in:

> WARN   - Failed to shut down service https://127.0.0.1:55751

doesn't listen on that port range.

-- 


From benc at hawaga.org.uk  Thu Apr 24 09:42:33 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 24 Apr 2008 14:42:33 +0000 (GMT)
Subject: [Swift-devel] Re: today's coaster report
In-Reply-To: <Pine.LNX.4.64.0804241417040.31934@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0804241417040.31934@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0804241441350.31934@dildano.hawaga.org.uk>

also, I had to change the path to md5sum that is hard coded in the bowels 
of the coaster code to point to the appropriate md5sum executable on my 
machine.

I have successfully run a job through swift into the coaster mechanism. 
hurrah.

-- 


From hategan at mcs.anl.gov  Thu Apr 24 10:05:23 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 24 Apr 2008 10:05:23 -0500
Subject: [Swift-devel] today's coaster report
In-Reply-To: <Pine.LNX.4.64.0804241417040.31934@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0804241417040.31934@dildano.hawaga.org.uk>
Message-ID: <1209049523.18858.2.camel@localhost>


On Thu, 2008-04-24 at 14:23 +0000, Ben Clifford wrote:
> Using cog r1977, the test coaster client will run jobs successfully but 
> then issues a warning:
> 
> WARN   - Failed to shut down service https://127.0.0.1:55731
> 
> and a stack trace beginning:
> org.globus.cog.karajan.workflow.service.ProtocolException: 
>         at 
> org.globus.cog.karajan.workflow.service.commands.Command.execute(Command.java:118)
>         at 
> org.globus.cog.abstraction.impl.execution.coaster.ServiceManager$ServiceReaper.run(ServiceManager.java:274)
> 
> 
> 
> (and indeed, something continues to listen on that port)
> 
> Also, GLOBUS_HOSTNAME appears to be respected, but GLOBUS_TCP_PORT_RANGE 
> appears to not be in some cases, using a standard cog launcher around the 
> test client - some stuff listens on a port in the port range (eg. the 
> server where the jar files get downloaded from); but whatever service is 
> referred to in:
> 
> > WARN   - Failed to shut down service https://127.0.0.1:55751
> 
> doesn't listen on that port range.

Right. That's the remote service. The port range does not apply for it.

> 


From benc at hawaga.org.uk  Thu Apr 24 10:06:18 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 24 Apr 2008 15:06:18 +0000 (GMT)
Subject: [Swift-devel] Re: today's coaster report
In-Reply-To: <Pine.LNX.4.64.0804241441350.31934@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0804241417040.31934@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804241441350.31934@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0804241505450.645@dildano.hawaga.org.uk>

though the lack of service shutdown makes my laptop happy when I run the 
97 language behaviour tests through coaster... kaboom!
-- 


From benc at hawaga.org.uk  Thu Apr 24 10:10:19 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 24 Apr 2008 15:10:19 +0000 (GMT)
Subject: [Swift-devel] today's coaster report
In-Reply-To: <1209049523.18858.2.camel@localhost>
References: <Pine.LNX.4.64.0804241417040.31934@dildano.hawaga.org.uk>
	<1209049523.18858.2.camel@localhost>
Message-ID: <Pine.LNX.4.64.0804241508340.31934@dildano.hawaga.org.uk>


On Thu, 24 Apr 2008, Mihael Hategan wrote:

> Right. That's the remote service. The port range does not apply for it.

In real deployment, the remote service should probably be made to use the 
GLOBUS_TCP_PORT_RANGE that it inherits from the environment on the remote 
side; if the system administrator of the remote site has configured a 
global GLOBUS_TCP_PORT_RANGE in the environment of a job to suit that 
site's firewall configuration, then the remote service part of coasters 
should probably respect that.

-- 


From hategan at mcs.anl.gov  Thu Apr 24 10:14:15 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 24 Apr 2008 10:14:15 -0500
Subject: [Swift-devel] today's coaster report
In-Reply-To: <Pine.LNX.4.64.0804241508340.31934@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0804241417040.31934@dildano.hawaga.org.uk>
	<1209049523.18858.2.camel@localhost>
	<Pine.LNX.4.64.0804241508340.31934@dildano.hawaga.org.uk>
Message-ID: <1209050055.18858.12.camel@localhost>


On Thu, 2008-04-24 at 15:10 +0000, Ben Clifford wrote:
> On Thu, 24 Apr 2008, Mihael Hategan wrote:
> 
> > Right. That's the remote service. The port range does not apply for it.
> 
> In real deployment, the remote service should probably be made to use the 
> GLOBUS_TCP_PORT_RANGE that it inherits from the environment on the remote 
> side; if the system administrator of the remote site has configured a 
> global GLOBUS_TCP_PORT_RANGE in the environment of a job to suit that 
> site's firewall configuration, then the remote service part of coasters 
> should probably respect that.

Good point. I'll pass that on to the programmers.

> 


From hategan at mcs.anl.gov  Thu Apr 24 10:16:02 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 24 Apr 2008 10:16:02 -0500
Subject: [Swift-devel] Re: today's coaster report
In-Reply-To: <Pine.LNX.4.64.0804241505450.645@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0804241417040.31934@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804241441350.31934@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0804241505450.645@dildano.hawaga.org.uk>
Message-ID: <1209050162.18858.15.camel@localhost>


On Thu, 2008-04-24 at 15:06 +0000, Ben Clifford wrote:
> though the lack of service shutdown makes my laptop happy

happy or unhappy? I reckon 97*2 JVMs can't be a very good thing.

>  when I run the 
> 97 language behaviour tests through coaster... kaboom!


From bugzilla-daemon at mcs.anl.gov  Thu Apr 24 17:34:45 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Thu, 24 Apr 2008 17:34:45 -0500 (CDT)
Subject: [Swift-devel] [Bug 110] move OPTIONS out of swift executable
In-Reply-To: <bug-110-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080424223445.7986D164BB@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=110


benc at hawaga.org.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #2 from benc at hawaga.org.uk  2008-04-24 17:34 -------
documented as of r1803


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.


From hategan at mcs.anl.gov  Thu Apr 24 21:27:19 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 24 Apr 2008 21:27:19 -0500
Subject: [Swift-devel] coasters update
Message-ID: <1209090439.4719.4.camel@localhost>

A bunch of fixes and updates went in. I was able to submit 512 jobs to
TGUC (using gt2+local pbs). There are occasional "job was killed because
it exceeded walltime" mails. When shutting down the service, running and
queued workers are killed, but it seems like some fall through the
cracks. I guess that part needs more work.


From benc at hawaga.org.uk  Fri Apr 25 06:23:47 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 25 Apr 2008 11:23:47 +0000 (GMT)
Subject: [Swift-devel] swift 0.5 and gt2
In-Reply-To: <1208870611.11068.0.camel@localhost>
References: <1208833974.9368.1.camel@localhost> 
	<Pine.LNX.4.64.0804220352500.645@dildano.hawaga.org.uk> 
	<1208868576.10512.3.camel@localhost>
	<Pine.LNX.4.64.0804221316020.31934@dildano.hawaga.org.uk>
	<1208870611.11068.0.camel@localhost>
Message-ID: <Pine.LNX.4.64.0804251121360.31934@dildano.hawaga.org.uk>


On Tue, 22 Apr 2008, Mihael Hategan wrote:

> But not in the url. I'm talking about using the jobManager attribute to
> <execution> or <jobmanager>. If you put it in the url, it's fine.

r1806 adds a second test of TGUC+gt2+pbs to the tests/sites/ directory, 
using <execute> and the jobmanager attribute; the other, already existing 
test, uses <jobmanager>

The jobmanager element doesn't take a jobmanager attribute.

-- 


From benc at hawaga.org.uk  Fri Apr 25 06:40:31 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 25 Apr 2008 11:40:31 +0000 (GMT)
Subject: [Swift-devel] swift 0.5 and gt2
In-Reply-To: <1208871475.11385.0.camel@localhost>
References: <1208833974.9368.1.camel@localhost> 
	<Pine.LNX.4.64.0804220352500.645@dildano.hawaga.org.uk> 
	<1208868576.10512.3.camel@localhost>
	<Pine.LNX.4.64.0804221316020.31934@dildano.hawaga.org.uk>
	<1208870611.11068.0.camel@localhost>
	<1208871475.11385.0.camel@localhost>
Message-ID: <Pine.LNX.4.64.0804251139110.31934@dildano.hawaga.org.uk>


On Tue, 22 Apr 2008, Mihael Hategan wrote:

> Maybe we should keep a "known issues" file for each version.

r1807 puts a release notes file for 0.5 into SVN (the 0.4 release notes 
were kept as an on-webserver file) with this issue listed.

-- 


From mikekubal at yahoo.com  Fri Apr 25 10:08:23 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Fri, 25 Apr 2008 08:08:23 -0700 (PDT)
Subject: [Swift-devel] code to test for file existence in swift
Message-ID: <229543.28080.qm@web52304.mail.re2.yahoo.com>

Could someone suggest swift code for testing for the
existence of a file?

Thanks,

Mike


      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


From wilde at mcs.anl.gov  Fri Apr 25 10:19:34 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 25 Apr 2008 10:19:34 -0500
Subject: [Swift-devel] code to test for file existence in swift
In-Reply-To: <229543.28080.qm@web52304.mail.re2.yahoo.com>
References: <229543.28080.qm@web52304.mail.re2.yahoo.com>
Message-ID: <4811F686.8000504@mcs.anl.gov>

Mike, could you elaborate on the use case for this?

Do you want the swift code to execute a procedure only if a file exists?

Exists on the submit host?

One way is write a tiny shell script that returns 1 or True if the file 
exists and zero otherwise.  You'll need, I think, to use @extractint: 
the return value (from the test) in fact needs to go into a file, then 
extractint can return t/f based on the value of that file:

@extractint(file) will read the specified file, parse an integer from 
the file contents and return that integer.

if, switch or iterate could then be used to act on the 
exists/doesnt-exist condition.

I'm not sure if there's a more elegant way to do this. Depends a bit on 
your actual use case.

- Mike


On 4/25/08 10:08 AM, Mike Kubal wrote:
> Could someone suggest swift code for testing for the
> existence of a file?
> 
> Thanks,
> 
> Mike
> 
> 
>       ____________________________________________________________________________________
> Be a better friend, newshound, and 
> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 


From hategan at mcs.anl.gov  Fri Apr 25 10:20:13 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 25 Apr 2008 10:20:13 -0500
Subject: [Swift-devel] code to test for file existence in swift
In-Reply-To: <229543.28080.qm@web52304.mail.re2.yahoo.com>
References: <229543.28080.qm@web52304.mail.re2.yahoo.com>
Message-ID: <1209136813.27187.1.camel@localhost>

Can you provide more details about the scenario?
In principle, no such facilities exist (or should exist) in swift.

Mihael

On Fri, 2008-04-25 at 08:08 -0700, Mike Kubal wrote:
> Could someone suggest swift code for testing for the
> existence of a file?
> 
> Thanks,
> 
> Mike
> 
> 
>       ____________________________________________________________________________________
> Be a better friend, newshound, and 
> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From mikekubal at yahoo.com  Fri Apr 25 10:37:36 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Fri, 25 Apr 2008 08:37:36 -0700 (PDT)
Subject: [Swift-devel] code to test for file existence in swift
In-Reply-To: <1209136813.27187.1.camel@localhost>
Message-ID: <77710.52229.qm@web52309.mail.re2.yahoo.com>

I was running a job that did not finish due to my grid-proxy expiring (I should have set it longer in the first place) that iterates through 670 input files. I wanted to add some code to my swift script that would check to see if the corresponding output file existed so as not to resubmit when iterating through the input files.

if(results_file exist){ print("done already")}
else{run job(input_file)}

Mihael Hategan <hategan at mcs.anl.gov> wrote: Can you provide more details about the scenario?
In principle, no such facilities exist (or should exist) in swift.

Mihael

On Fri, 2008-04-25 at 08:08 -0700, Mike Kubal wrote:
> Could someone suggest swift code for testing for the
> existence of a file?
> 
> Thanks,
> 
> Mike
> 
> 
>       ____________________________________________________________________________________
> Be a better friend, newshound, and 
> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 

_______________________________________________
Swift-devel mailing list
Swift-devel at ci.uchicago.edu
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


---------------------------------
Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it now.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080425/913fd0d0/attachment.html>

From benc at hawaga.org.uk  Fri Apr 25 10:45:41 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 25 Apr 2008 15:45:41 +0000 (GMT)
Subject: [Swift-devel] code to test for file existence in swift
In-Reply-To: <77710.52229.qm@web52309.mail.re2.yahoo.com>
References: <77710.52229.qm@web52309.mail.re2.yahoo.com>
Message-ID: <Pine.LNX.4.64.0804251544350.29823@dildano.hawaga.org.uk>


What you might want there is for bug 107 to get fixed, at which point 
there would be a restart mechanism that would do this sort of thing for 
you.

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=107


On Fri, 25 Apr 2008, Mike Kubal wrote:

> I was running a job that did not finish due to my grid-proxy expiring (I should have set it longer in the first place) that iterates through 670 input files. I wanted to add some code to my swift script that would check to see if the corresponding output file existed so as not to resubmit when iterating through the input files.
> 
> if(results_file exist){ print("done already")}
> else{run job(input_file)}
> 
> Mihael Hategan <hategan at mcs.anl.gov> wrote: Can you provide more details about the scenario?
> In principle, no such facilities exist (or should exist) in swift.
> 
> Mihael
> 
> On Fri, 2008-04-25 at 08:08 -0700, Mike Kubal wrote:
> > Could someone suggest swift code for testing for the
> > existence of a file?
> > 
> > Thanks,
> > 
> > Mike
> > 
> > 
> >       ____________________________________________________________________________________
> > Be a better friend, newshound, and 
> > know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 
> 
>        
> ---------------------------------
> Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it now.


From wilde at mcs.anl.gov  Fri Apr 25 10:49:14 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 25 Apr 2008 10:49:14 -0500
Subject: [Swift-devel] code to test for file existence in swift
In-Reply-To: <77710.52229.qm@web52309.mail.re2.yahoo.com>
References: <77710.52229.qm@web52309.mail.re2.yahoo.com>
Message-ID: <4811FD7A.2000202@mcs.anl.gov>

Ah - this kind of situation is different: it should be covered by the 
Swift retry mechanism, I think. You should be able to restart the 
workflows after (most) failures and it should pick up where it left off, 
which is I think what you want in this case, right?

In other words, this kind of file existence checking should require no 
code. The state of the workflow - based on the file mapping that were 
done when the workflow still ran - should be retained in the Swift 
recovery log.

See:

http://www.ci.uchicago.edu/swift/guides/userguide.php#restart

At one point I think there was a bug in Swift recovery.  Mihael or Ben, 
do you know what the state of this feature is?

- Mike


On 4/25/08 10:37 AM, Mike Kubal wrote:
> I was running a job that did not finish due to my grid-proxy expiring (I 
> should have set it longer in the first place) that iterates through 670 
> input files. I wanted to add some code to my swift script that would 
> check to see if the corresponding output file existed so as not to 
> resubmit when iterating through the input files.
> 
> if(results_file exist){ print("done already")}
> else{run job(input_file)}
> 
> */Mihael Hategan <hategan at mcs.anl.gov>/* wrote:
> 
>     Can you provide more details about the scenario?
>     In principle, no such facilities exist (or should exist) in swift.
> 
>     Mihael
> 
>     On Fri, 2008-04-25 at 08:08 -0700, Mike Kubal wrote:
>      > Could someone suggest swift code for testing for the
>      > existence of a file?
>      >
>      > Thanks,
>      >
>      > Mike
>      >
>      >
>      >
>     ____________________________________________________________________________________
>      > Be a better friend, newshound, and
>      > know-it-all with Yahoo! Mobile. Try it now.
>     http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>      > _______________________________________________
>      > Swift-devel mailing list
>      > Swift-devel at ci.uchicago.edu
>      > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>      >
> 
>     _______________________________________________
>     Swift-devel mailing list
>     Swift-devel at ci.uchicago.edu
>     http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 
> ------------------------------------------------------------------------
> Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try 
> it now. 
> <http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ 
>  >
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From hategan at mcs.anl.gov  Mon Apr 28 12:55:17 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 28 Apr 2008 12:55:17 -0500
Subject: [Swift-devel] data errors
Message-ID: <1209405317.28655.0.camel@localhost>

They just look ugly:

Execution failed:
	java.lang.RuntimeException: Data set initialization failed for
org.griphyn.vdl.mapping.RootDataNode identifier tag:benc
@ci.uchicago.edu,2008:swift:dataset:20080428-1253-1p0peih1:720000000006
with no value at dataset=f32 (closed). Missing r
equired field: g


From bugzilla-daemon at mcs.anl.gov  Mon Apr 28 14:58:49 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon, 28 Apr 2008 14:58:49 -0500 (CDT)
Subject: [Swift-devel] [Bug 107] restarts broken (by generalisation of data
	file handling)
In-Reply-To: <bug-107-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080428195849.2688F164CF@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=107


hategan at mcs.anl.gov changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED


------- Comment #2 from hategan at mcs.anl.gov  2008-04-28 14:58 -------
This should be fixed in r1819.

It now saves symbolic swift variable names (relies on dbgname).

It will break if the swift script is re-compiled. In order to avoid that, I
would need the mappers to support forced file names.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Mon Apr 28 18:48:09 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon, 28 Apr 2008 18:48:09 -0500 (CDT)
Subject: [Swift-devel] [Bug 107] restarts broken (by generalisation of data
	file handling)
In-Reply-To: <bug-107-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080428234809.BCA1E164CF@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=107


------- Comment #3 from benc at hawaga.org.uk  2008-04-28 18:48 -------
r1822 introduces a test for restarts


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.


From benc at hawaga.org.uk  Mon Apr 28 18:57:12 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 28 Apr 2008 23:57:12 +0000 (GMT)
Subject: [Swift-devel] code to test for file existence in swift
In-Reply-To: <77710.52229.qm@web52309.mail.re2.yahoo.com>
References: <77710.52229.qm@web52309.mail.re2.yahoo.com>
Message-ID: <Pine.LNX.4.64.0804282354100.29823@dildano.hawaga.org.uk>


> I wanted to add some code to my swift script that would 
> check to see if the corresponding output file existed so as not to 
> resubmit when iterating through the input files.

Restarts (at least to the extent that I have tested them, which is not in 
a huge depth) look like they work now (with recent cog and swift, after 
Mihael's changes today).

When a run fails, you will get a .rlog file for that run in pwd. You can 
restart by saying:

 $ swift -resume foo-99999.rlog foo.swift

-- 


From benc at hawaga.org.uk  Tue Apr 29 07:43:53 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 29 Apr 2008 12:43:53 +0000 (GMT)
Subject: [Swift-devel] per-site scheduler parameters
Message-ID: <Pine.LNX.4.64.0804291236580.31934@dildano.hawaga.org.uk>


I implemented the following primarily for skenny and Xi Li.

Building swift with cog >=r1991 will give two new profile keys that can be 
used in the sites.xml site catalog:

  initialScore - this allows the initial score for a site to be set to 
something other than 0, so that initial job submission rate can be set 
higher without setting the throttle factor higher.

  jobThrottle - this behaves like the job throttle set in 
swift.properties, but affects only the site for which it is set. This 
should allow (for example) local execution to have a throttle of 0, GRAM2 
sites to have a throttle of 0.2, GRAM4 sites to have a throttle of 4 and 
Falkon to have the throttle off, defined in the site definitions for those 
sites rather than having to reconfigure it manually when switching between 
sites (the numbers here being my rough preferred values for each type of 
site).


-- 


From iraicu at cs.uchicago.edu  Tue Apr 29 11:24:10 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Tue, 29 Apr 2008 11:24:10 -0500
Subject: [Swift-devel] talk today's talk: Swift Innovation for BOINC
Message-ID: <48174BAA.7040005@cs.uchicago.edu>

Hi all,
Today's talk will be by Xu Du, and is titled "Swift Innovation for BOINC".

The talk abstract is:
Swift is a professional distributed computing platform which performs 
excellently at workflow control. But the previous swift system can only 
work with Grid sites. However, besides of Grid, volunteer distributed 
computing systems are also important computing resources. Among those 
volunteer distributed computing systems, BOINC is the most outstanding 
one. So it is significant that Swift can not only work with Grid but 
also work with BOINC. Swift Innovation for BOINC is such a project that 
innovates Swift so that it can work with BOINC. The main task of the 
project is to design and implement a ?BOINC provider? in SWIFT and a 
?SWIFT adapter? in BOINC, by which Swift can submit jobs to BOINC and 
get back result after the jobs are executed. Up to now, a prototype has 
been worked out. This talk will first introduce the design and implement 
of Swift Innovation for BOINC, and then demo the prototype.

See you at 4:30PM in RI405!

Cheers,
Ioan
http://dsl-wiki.cs.uchicago.edu/index.php/Wiki:ScheduleSpring08

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From benc at hawaga.org.uk  Tue Apr 29 18:39:23 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 29 Apr 2008 23:39:23 +0000 (GMT)
Subject: [Swift-devel] coaster wget dependency
Message-ID: <Pine.LNX.4.64.0804292336470.31934@dildano.hawaga.org.uk>

in the excitement of a new OS install, I have discovered that coaster has 
a wget dependency somewhere on the service side. that is common but not 
always installed everywhere (especially in places that prefer curl...).

the swift+coaster docs (if/when they appear) should probably document 
additional dependencies like this (and like GNU md5sum) that are 
potentially not installed places.

-- 


From hategan at mcs.anl.gov  Tue Apr 29 18:42:07 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 29 Apr 2008 18:42:07 -0500
Subject: [Swift-devel] coaster wget dependency
In-Reply-To: <Pine.LNX.4.64.0804292336470.31934@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0804292336470.31934@dildano.hawaga.org.uk>
Message-ID: <1209512527.30126.1.camel@localhost>


On Tue, 2008-04-29 at 23:39 +0000, Ben Clifford wrote:
> in the excitement of a new OS install, I have discovered that coaster has 
> a wget dependency somewhere on the service side. that is common but not 
> always installed everywhere (especially in places that prefer curl...).

The bootstrap script can be made to use curl instead of wget.

> 
> the swift+coaster docs (if/when they appear) should probably document 
> additional dependencies like this (and like GNU md5sum) that are 
> potentially not installed places.

It's unfortunate, but some means to download a jar and some means to do
and md5 sum must exist. If not, the service would need to be started
manually on the target site.

> 


From benc at hawaga.org.uk  Tue Apr 29 18:46:36 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 29 Apr 2008 23:46:36 +0000 (GMT)
Subject: [Swift-devel] coaster wget dependency
In-Reply-To: <1209512527.30126.1.camel@localhost>
References: <Pine.LNX.4.64.0804292336470.31934@dildano.hawaga.org.uk>
	<1209512527.30126.1.camel@localhost>
Message-ID: <Pine.LNX.4.64.0804292345440.31934@dildano.hawaga.org.uk>


> > the swift+coaster docs (if/when they appear) should probably document 
> > additional dependencies like this (and like GNU md5sum) that are 
> > potentially not installed places.
> 
> It's unfortunate, but some means to download a jar and some means to do
> and md5 sum must exist. If not, the service would need to be started
> manually on the target site.

right. its fine to have those as prerequisites, I think, but its nicer to 
discover them in a README than in a process of of "log into remote site, 
figure out where is log file, read error message"

-- 


From hategan at mcs.anl.gov  Tue Apr 29 20:13:31 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 29 Apr 2008 20:13:31 -0500
Subject: [Swift-devel] coaster wget dependency
In-Reply-To: <Pine.LNX.4.64.0804292345440.31934@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0804292336470.31934@dildano.hawaga.org.uk>
	<1209512527.30126.1.camel@localhost>
	<Pine.LNX.4.64.0804292345440.31934@dildano.hawaga.org.uk>
Message-ID: <1209518011.30623.1.camel@localhost>


On Tue, 2008-04-29 at 23:46 +0000, Ben Clifford wrote:
> > > the swift+coaster docs (if/when they appear) should probably document 
> > > additional dependencies like this (and like GNU md5sum) that are 
> > > potentially not installed places.
> > 
> > It's unfortunate, but some means to download a jar and some means to do
> > and md5 sum must exist. If not, the service would need to be started
> > manually on the target site.
> 
> right. its fine to have those as prerequisites, I think, but its nicer to 
> discover them in a README than in a process of of "log into remote site, 
> figure out where is log file, read error message"

Or... it could be an error message saying "no wget or curl were found".
As long as we keep that script below the max argv character length of
the various pieces involved.

> 


From benc at hawaga.org.uk  Tue Apr 29 21:20:06 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 30 Apr 2008 02:20:06 +0000 (GMT)
Subject: [Swift-devel] coaster->tg-ncsa
Message-ID: <Pine.LNX.4.64.0804300216560.31934@dildano.hawaga.org.uk>


I just tried to run coasters on TG NCSA (my first attempt to run them 
outside of my laptop). I tried with gt2 and with gt4 to submit:

I put the two sites files in tests/sites/coaster/ - the two filenames 
should be apparent upon observing that directory with ls.

Using gt2 and pbs, I get this error:

Caused by:
        Could not submit job
Caused by:
        Could not start coaster service
Caused by:
        Cannot parse the given RSL
Caused by:
        Problems while creating a Gass Server
Caused by:
        Could not determine this host's IP address. Please set an IP 
address in cog.properties

Its not clear from that message which host 'this host' is. There are 
potentially at least three involved. A hostname or some other useful 
information might be nice there.

For the attempt with gt4, I get this failure:

Caused by:
        Could not submit job
Caused by:
        Could not start coaster service
Caused by:
        The gt4.0.0 provider does not support redirection

-- 


From hategan at mcs.anl.gov  Tue Apr 29 21:33:58 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 29 Apr 2008 21:33:58 -0500
Subject: [Swift-devel] Re: coaster->tg-ncsa
In-Reply-To: <Pine.LNX.4.64.0804300216560.31934@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0804300216560.31934@dildano.hawaga.org.uk>
Message-ID: <1209522838.702.8.camel@localhost>


On Wed, 2008-04-30 at 02:20 +0000, Ben Clifford wrote:
> I just tried to run coasters on TG NCSA (my first attempt to run them 
> outside of my laptop). I tried with gt2 and with gt4 to submit:
> 
> I put the two sites files in tests/sites/coaster/ - the two filenames 
> should be apparent upon observing that directory with ls.
> 
> Using gt2 and pbs, I get this error:
> 
> Caused by:
>         Could not submit job
> Caused by:
>         Could not start coaster service
> Caused by:
>         Cannot parse the given RSL
> Caused by:
>         Problems while creating a Gass Server
> Caused by:
>         Could not determine this host's IP address. Please set an IP 
> address in cog.properties
> 
> Its not clear from that message which host 'this host' is.

Uhm... yes... it's 127.0.0.1 :)

Although given that it's not a remote exception, it looks like your
machine.

>  There are 
> potentially at least three involved. A hostname or some other useful 
> information might be nice there.
> 
> For the attempt with gt4, I get this failure:
> 
> Caused by:
>         Could not submit job
> Caused by:
>         Could not start coaster service
> Caused by:
>         The gt4.0.0 provider does not support redirection

Grr. Right. I removed the redirections (r1992). Unfortunately it is now
harder to troubleshoot. Maybe the ws-gram provider should only print a
warning if redirection is requested instead of failing.

> 


From benc at hawaga.org.uk  Tue Apr 29 21:47:58 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 30 Apr 2008 02:47:58 +0000 (GMT)
Subject: [Swift-devel] Re: coaster->tg-ncsa
In-Reply-To: <1209522838.702.8.camel@localhost>
References: <Pine.LNX.4.64.0804300216560.31934@dildano.hawaga.org.uk>
	<1209522838.702.8.camel@localhost>
Message-ID: <Pine.LNX.4.64.0804300245070.31934@dildano.hawaga.org.uk>

I set GLOBUS_HOSTNAME and ran with gram2 again. It got further... sits and 
waits for quite a long time and then the below error (because I 
deliberately don't have a default project on teragrid sites, instead 
setting it via a profile entry in the appropriate sites file to whichever 
of the three I'm in)

Caused by:


ERROR: you must either specify an account (project=xxx) or log in to
       the system to set a default account. Here are your accounts:

NCSA                                   Days     Service Units   Avail for
Proj  TG Account    Type     Status  Remaining    Remaining     Batch Jobs
----  ----------    ----     ------  ---------  -------------   ---------
bdd  TG-CDA070002   prb      Active        1         28387         yes
brn  TG-CCR080001   prb      Active      155         28970         yes
brv  TG-CCR080002   prb      Active      155         29559         yes

 *** Job not submitted ***

null
org.globus.gram.GramException: The job failed when the job manager 
attempted to run it
        at 
org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:476)
        at org.globus.gram.GramJob.setStatus(GramJob.java:184)
        at 
org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176)
        at java.lang.Thread.run(Thread.java:534)

-- 


From hategan at mcs.anl.gov  Tue Apr 29 21:51:01 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 29 Apr 2008 21:51:01 -0500
Subject: [Swift-devel] Re: coaster->tg-ncsa
In-Reply-To: <Pine.LNX.4.64.0804300245070.31934@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0804300216560.31934@dildano.hawaga.org.uk>
	<1209522838.702.8.camel@localhost>
	<Pine.LNX.4.64.0804300245070.31934@dildano.hawaga.org.uk>
Message-ID: <1209523861.1278.0.camel@localhost>

Odd. Attributes should be copied from the original task. I'll look into
that.

On Wed, 2008-04-30 at 02:47 +0000, Ben Clifford wrote:
> I set GLOBUS_HOSTNAME and ran with gram2 again. It got further... sits and 
> waits for quite a long time and then the below error (because I 
> deliberately don't have a default project on teragrid sites, instead 
> setting it via a profile entry in the appropriate sites file to whichever 
> of the three I'm in)
> 
> Caused by:
> 
> 
> ERROR: you must either specify an account (project=xxx) or log in to
>        the system to set a default account. Here are your accounts:
> 
> NCSA                                   Days     Service Units   Avail for
> Proj  TG Account    Type     Status  Remaining    Remaining     Batch Jobs
> ----  ----------    ----     ------  ---------  -------------   ---------
> bdd  TG-CDA070002   prb      Active        1         28387         yes
> brn  TG-CCR080001   prb      Active      155         28970         yes
> brv  TG-CCR080002   prb      Active      155         29559         yes
> 
>  *** Job not submitted ***
> 
> null
> org.globus.gram.GramException: The job failed when the job manager 
> attempted to run it
>         at 
> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:476)
>         at org.globus.gram.GramJob.setStatus(GramJob.java:184)
>         at 
> org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176)
>         at java.lang.Thread.run(Thread.java:534)
> 


From benc at hawaga.org.uk  Tue Apr 29 21:55:17 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 30 Apr 2008 02:55:17 +0000 (GMT)
Subject: [Swift-devel] Re: coaster->tg-ncsa
In-Reply-To: <1209523861.1278.0.camel@localhost>
References: <Pine.LNX.4.64.0804300216560.31934@dildano.hawaga.org.uk> 
	<1209522838.702.8.camel@localhost>
	<Pine.LNX.4.64.0804300245070.31934@dildano.hawaga.org.uk>
	<1209523861.1278.0.camel@localhost>
Message-ID: <Pine.LNX.4.64.0804300254030.31934@dildano.hawaga.org.uk>


On Tue, 29 Apr 2008, Mihael Hategan wrote:

> Odd. Attributes should be copied from the original task. I'll look into
> that.

also, this using <execution provider="coaster" 
url="grid-hg.ncsa.teragrid.org" jobManager="gt2:gt2:pbs" />, which I 
modified from SVN to use gt2:gt2:pbs rather than gt2:pbs as is in SVN now.

If properties should be copied across then I guess either of those should 
have worked if the project key is propagated.

-- 


From hategan at mcs.anl.gov  Tue Apr 29 21:55:49 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 29 Apr 2008 21:55:49 -0500
Subject: [Swift-devel] Re: coaster->tg-ncsa
In-Reply-To: <Pine.LNX.4.64.0804300254030.31934@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0804300216560.31934@dildano.hawaga.org.uk>
	<1209522838.702.8.camel@localhost>
	<Pine.LNX.4.64.0804300245070.31934@dildano.hawaga.org.uk>
	<1209523861.1278.0.camel@localhost>
	<Pine.LNX.4.64.0804300254030.31934@dildano.hawaga.org.uk>
Message-ID: <1209524149.1377.0.camel@localhost>


On Wed, 2008-04-30 at 02:55 +0000, Ben Clifford wrote:
> On Tue, 29 Apr 2008, Mihael Hategan wrote:
> 
> > Odd. Attributes should be copied from the original task. I'll look into
> > that.
> 
> also, this using <execution provider="coaster" 
> url="grid-hg.ncsa.teragrid.org" jobManager="gt2:gt2:pbs" />, which I 
> modified from SVN to use gt2:gt2:pbs rather than gt2:pbs as is in SVN now.
> 
> If properties should be copied across then I guess either of those should 
> have worked if the project key is propagated.

Right. So it must mean they are not.

>