From wilde at mcs.anl.gov  Tue Apr  1 10:56:28 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 01 Apr 2008 10:56:28 -0500
Subject: [Swift-user] Re: [falkon-user] 1) disable retry mechanism and 2)
	continue on failure?
In-Reply-To: <Pine.LNX.4.64.0803310723010.9854@dildano.hawaga.org.uk>
References: <47F02A00.6090203@cs.uchicago.edu>
	<Pine.LNX.4.64.0803310206080.5372@dildano.hawaga.org.uk>
	<47F04E38.60207@uchicago.edu>
	<Pine.LNX.4.64.0803310423240.9854@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803310723010.9854@dildano.hawaga.org.uk>
Message-ID: <47F25B2C.4090005@mcs.anl.gov>

Ben, thanks - these patches sound great. Can the use of /tmp be 
controlled by a property, ideally on a per-application basis in tc.data, 
and these changes committed to svn?

Seems like wrapper-tmp-log-locally could be done for all apps as the 
default, and only turned off for certain debugging scenarios.

Can you do application caching as well, in a general manner?

We'll measure over the next few days and report back.

- Mike


On 3/31/08 2:34 AM, Ben Clifford wrote:
> On Mon, 31 Mar 2008, Ben Clifford wrote:
> 
>> This temporary directory handling is pretty ugly - it should be a couple 
>> lines change to wrapper.sh to get similar functionality using the existing 
>> swift temporary direcotry handling - change the path to /tmp and use cp 
>> instead of ln -s. That way you can take advantage of Swift's existing 
>> unique job IDs and error handling too.
> 
> Attached are three patches that will apply against svn r1775:
> 
> The first puts temporary directories in /tmp rather than on shared fs.
> http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-on-tmp
> 
> The second copies the application file to the worker in each job execution 
> (though doesn't do any worker-node caching of such between jobs)
> http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-mv-executable
> 
> The third creates the worker node log on /tmp and copies it at the end.
> http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-log-locally
> 
> The three modify all wrapper.sh and should be applied in the above order.
> 
> With the first two patches, the timestamps in the usual info logs will 
> provide information about how long the copies take, in the same way that 
> they usually indicate times for other execution stages.
> 


From benc at hawaga.org.uk  Tue Apr  1 20:33:27 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 2 Apr 2008 01:33:27 +0000 (GMT)
Subject: [Swift-user] Re: [falkon-user] 1) disable retry mechanism and 2)
 continue on failure?
In-Reply-To: <47F25B2C.4090005@mcs.anl.gov>
References: <47F02A00.6090203@cs.uchicago.edu>
	<Pine.LNX.4.64.0803310206080.5372@dildano.hawaga.org.uk>
	<47F04E38.60207@uchicago.edu>
	<Pine.LNX.4.64.0803310423240.9854@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803310723010.9854@dildano.hawaga.org.uk>
	<47F25B2C.4090005@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0804020126310.9854@dildano.hawaga.org.uk>


On Tue, 1 Apr 2008, Michael Wilde wrote:

> Can you do application caching as well, in a general manner?

Applications 'in general' consist of a lot more than their base executable 
- even echo hello world seems to attempt to read 9 different files on my 
Linux box. so 'no'.

-- 


From iraicu at cs.uchicago.edu  Tue Apr  1 21:14:37 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Tue, 01 Apr 2008 21:14:37 -0500
Subject: [Swift-user] Re: [falkon-user] 1) disable retry mechanism and 2)
	continue on failure?
In-Reply-To: <Pine.LNX.4.64.0804020126310.9854@dildano.hawaga.org.uk>
References: <47F02A00.6090203@cs.uchicago.edu>
	<Pine.LNX.4.64.0803310206080.5372@dildano.hawaga.org.uk>
	<47F04E38.60207@uchicago.edu>
	<Pine.LNX.4.64.0803310423240.9854@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803310723010.9854@dildano.hawaga.org.uk>
	<47F25B2C.4090005@mcs.anl.gov>
	<Pine.LNX.4.64.0804020126310.9854@dildano.hawaga.org.uk>
Message-ID: <47F2EC0D.3040704@cs.uchicago.edu>

If the applications are statically compiled, is the problem more tractable?

Ioan

Ben Clifford wrote:
> On Tue, 1 Apr 2008, Michael Wilde wrote:
>
>   
>> Can you do application caching as well, in a general manner?
>>     
>
> Applications 'in general' consist of a lot more than their base executable 
> - even echo hello world seems to attempt to read 9 different files on my 
> Linux box. so 'no'.
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20080401/4b40d948/attachment.html>

From iraicu at cs.uchicago.edu  Wed Apr  2 15:17:17 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Wed, 02 Apr 2008 15:17:17 -0500
Subject: [Swift-user] Re: [falkon-user] 1) disable retry mechanism and 2)
	continue on failure?
In-Reply-To: <Pine.LNX.4.64.0803310723010.9854@dildano.hawaga.org.uk>
References: <47F02A00.6090203@cs.uchicago.edu>
	<Pine.LNX.4.64.0803310206080.5372@dildano.hawaga.org.uk>
	<47F04E38.60207@uchicago.edu>
	<Pine.LNX.4.64.0803310423240.9854@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803310723010.9854@dildano.hawaga.org.uk>
Message-ID: <47F3E9CD.9090507@cs.uchicago.edu>

Hi Ben,
Thanks again for the patches, they made a huge difference, increased 
efficiency from 21% to 81%!

Here are the numbers:

	1 Node Perf 	Falkon 	Swift+Falkon 	Swift+Falkon (patched)
Min 	63.618 	53.782 	169.139 	58.538
Average 	64.76 	65.47253 	309.1945 	80.21246
Median 	64.74072 	64.774 	313.5535 	76.5245
Max 	65.863 	94.447 	605.654 	115.237
Standard Deviation 	0.488984 	3.863944 	52.13821 	10.95652
Efficiency 	100% 	99% 	21% 	81%


The first column shows the per task statistic when running on 1 node (4 
CPUs) through Falkon.  The second column are the statistics for running 
the application at large scale, on 2048 CPUs.  The 3rd column is running 
Swift+Falkon (both from SVN) on 256 CPUs.  The 4th column is 
Swift+Falkon, but Swift has the 3 patches applied.  Essentially, the per 
task execution time was reduced from 309 seconds to 80 seconds, where 
the ideal would have been 64 seconds.  It brought the efficiency from 
21% to 81% for this particular workload.  This looks fantastic! 

We'll have to verify that we can maintain this 81% efficiency to higher 
number of CPUs.  In the meantime, if you can think of anything else that 
we could do to keep pushing the 81% efficiency number higher, let us know.4

Thanks again,
Ioan

Ben Clifford wrote:
> On Mon, 31 Mar 2008, Ben Clifford wrote:
>
>   
>> This temporary directory handling is pretty ugly - it should be a couple 
>> lines change to wrapper.sh to get similar functionality using the existing 
>> swift temporary direcotry handling - change the path to /tmp and use cp 
>> instead of ln -s. That way you can take advantage of Swift's existing 
>> unique job IDs and error handling too.
>>     
>
> Attached are three patches that will apply against svn r1775:
>
> The first puts temporary directories in /tmp rather than on shared fs.
> http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-on-tmp
>
> The second copies the application file to the worker in each job execution 
> (though doesn't do any worker-node caching of such between jobs)
> http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-mv-executable
>
> The third creates the worker node log on /tmp and copies it at the end.
> http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-log-locally
>
> The three modify all wrapper.sh and should be applied in the above order.
>
> With the first two patches, the timestamps in the usual info logs will 
> provide information about how long the copies take, in the same way that 
> they usually indicate times for other execution stages.
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20080402/96847267/attachment.html>

From benc at hawaga.org.uk  Wed Apr  2 15:36:18 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 2 Apr 2008 20:36:18 +0000 (GMT)
Subject: [Swift-user] Re: [falkon-user] 1) disable retry mechanism and 2)
 continue on failure?
In-Reply-To: <47F3E9CD.9090507@cs.uchicago.edu>
References: <47F02A00.6090203@cs.uchicago.edu>
	<Pine.LNX.4.64.0803310206080.5372@dildano.hawaga.org.uk>
	<47F04E38.60207@uchicago.edu>
	<Pine.LNX.4.64.0803310423240.9854@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803310723010.9854@dildano.hawaga.org.uk>
	<47F3E9CD.9090507@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0804022035560.9854@dildano.hawaga.org.uk>

any chance you can test the patches separately to see how they each 
contribute to this change?

On Wed, 2 Apr 2008, Ioan Raicu wrote:

> Hi Ben,
> Thanks again for the patches, they made a huge difference, increased
> efficiency from 21% to 81%!
> 
> Here are the numbers:
> 
> 	1 Node Perf 	Falkon 	Swift+Falkon 	Swift+Falkon (patched)
> Min 	63.618 	53.782 	169.139 	58.538
> Average 	64.76 	65.47253 	309.1945 	80.21246
> Median 	64.74072 	64.774 	313.5535 	76.5245
> Max 	65.863 	94.447 	605.654 	115.237
> Standard Deviation 	0.488984 	3.863944 	52.13821
> 10.95652
> Efficiency 	100% 	99% 	21% 	81%
> 
> 
> The first column shows the per task statistic when running on 1 node (4 CPUs)
> through Falkon.  The second column are the statistics for running the
> application at large scale, on 2048 CPUs.  The 3rd column is running
> Swift+Falkon (both from SVN) on 256 CPUs.  The 4th column is Swift+Falkon, but
> Swift has the 3 patches applied.  Essentially, the per task execution time was
> reduced from 309 seconds to 80 seconds, where the ideal would have been 64
> seconds.  It brought the efficiency from 21% to 81% for this particular
> workload.  This looks fantastic! 
> We'll have to verify that we can maintain this 81% efficiency to higher number
> of CPUs.  In the meantime, if you can think of anything else that we could do
> to keep pushing the 81% efficiency number higher, let us know.4
> 
> Thanks again,
> Ioan
> 
> Ben Clifford wrote:
> > On Mon, 31 Mar 2008, Ben Clifford wrote:
> > 
> >   
> > > This temporary directory handling is pretty ugly - it should be a couple
> > > lines change to wrapper.sh to get similar functionality using the existing
> > > swift temporary direcotry handling - change the path to /tmp and use cp
> > > instead of ln -s. That way you can take advantage of Swift's existing
> > > unique job IDs and error handling too.
> > >     
> > 
> > Attached are three patches that will apply against svn r1775:
> > 
> > The first puts temporary directories in /tmp rather than on shared fs.
> > http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-on-tmp
> > 
> > The second copies the application file to the worker in each job execution
> > (though doesn't do any worker-node caching of such between jobs)
> > http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-mv-executable
> > 
> > The third creates the worker node log on /tmp and copies it at the end.
> > http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-log-locally
> > 
> > The three modify all wrapper.sh and should be applied in the above order.
> > 
> > With the first two patches, the timestamps in the usual info logs will
> > provide information about how long the copies take, in the same way that
> > they usually indicate times for other execution stages.
> > 
> >   
> 
> 


From zhoujianghua1017 at 163.com  Wed Apr  2 22:09:41 2008
From: zhoujianghua1017 at 163.com (jezhee)
Date: Thu, 3 Apr 2008 11:09:41 +0800
Subject: [Swift-user] How to patch work to SSH node?
Message-ID: <200804031109308300133@163.com>

Hi,ladies and gentlemen,
	 Excuse me to trouble you. I tried to patch tasks to an SSH server by Swift last week. But, I encountered some problems. Could you give me some advice?
    At first, I tried to run Swift in Windows. I used puttygen to generate a key SSH2 RSA key pair. Then, I changed the file sites.xml according to
the User Guide and created auth.defaults in my user home directory. But When I ran Swift, error happened: 
			Execution failed:
     	  		 Could not initialize shared directory on sshsvr
			Caused by:
     	 		  org.globus.cog.abstraction.impl.file.FileResourceException: Error while communicating with the SSH server on 192.168.88.17:22
			Caused by:
      			  SSH Connection failed: null
   Actually, when I used F-SSH client to loggon to the Linux server by public key method, I didn't succeed neither.
   So, I recompiled the project in Linux and wanted to test whether this would work. I used command "ssh-keygen -t rsa" to generate the key pair,
then I transport the rsa.pub to another Linux server. After these, I could log to the server without password successfully. So, I changed the 
configuration of Swift and run the sample script. But fallaciously, the same error appeared. Both of the two Linux PCs' kernal is 2.6. I used F-SSH as
the remote login tool. 
    I also tried changing the auth.defaults to the following:
                  192.168.88.246.type=password
				  192.168.88.246.username=root
                  192.168.88.246.password=***
                  192.168.88.246.passphrase=
I got the same error.
    Could you help me to find out whether there are any wrong config?
     
     Besides, it seems that some of the Swift source code is not open, but provided in jar library. I noticed that the support to element "filesystem"
is added in Nov. 2007, but I didn't find any disposal to this keyword in the source code. Our innovation - boinc provider is based on SSH, and 
provides more parameters to adapt the BOINC task format. Obviously, just replacing the ssh with boinc is not usable even CoGkit module has 
supported boinc provider. So, I want to ask you how to add a customized provider to Swift?
    
     Thanks a lot.	

?Regards.
							2008-04-03
//////////////////////////////////////////
// Zhou Jianghua zhoujianghua1017 at 163.com
// EI Dep, Huazhong Uni of Sci & Tech
// Internet Technology and Engineering R&D Center
// http://www.itec.org.cn
// 
// Tel?(86)27-87792139
// Fax?(86)27-87540745
// Zipcode?430074
// Address?Luoyu Road 1037, Wuhan, Hubei, China
/////////////////////////////////////////

From benc at hawaga.org.uk  Wed Apr  2 22:35:15 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 3 Apr 2008 03:35:15 +0000 (GMT)
Subject: [Swift-user] How to patch work to SSH node?
In-Reply-To: <200804031109308300133@163.com>
References: <200804031109308300133@163.com>
Message-ID: <Pine.LNX.4.64.0804030329490.5372@dildano.hawaga.org.uk>


Hi. I don't really know anything about the ssh provider, so I can't help 
you there. But to answer your other questions:

>      Besides, it seems that some of the Swift source code is not open, 
> but provided in jar library.

You should be able to get the original source used to generate all of the 
jar files, from various places. Are there particular jars that you want to 
look at the source for?

> I noticed that the support to element 
> "filesystem" is added in Nov. 2007, but I didn't find any disposal to 
> this keyword in the source code.

All of the sites.xml elements are defined in libexec/vdl-sc.k. The 
filesystem element was added in commit r1490 at 2007-11-23 21:30:36 +0000. 
It is at line 35 in that file at the moment.

> Our innovation - boinc provider is based on SSH, and provides more 
> parameters to adapt the BOINC task format. Obviously, just replacing the 
> ssh with boinc is not usable even CoGkit module has supported boinc 
> provider. So, I want to ask you how to add a customized provider to 
> Swift?

To add a customised provider to swift, add a new directory into cog that 
looks like one of the existing provider-* directories. If you want an 
example, look at the provider-deef add on, which you can get with this 
command: svn co https://svn.ci.uchicago.edu/svn/vdl2/provider-deef This is 
an example of a provider for a different execution system (falkon).

-- 


From benc at hawaga.org.uk  Thu Apr  3 04:43:27 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 3 Apr 2008 09:43:27 +0000 (GMT)
Subject: [Swift-user] Re: [falkon-user] 1) disable retry mechanism and 2)
 continue on failure?
In-Reply-To: <47F3E9CD.9090507@cs.uchicago.edu>
References: <47F02A00.6090203@cs.uchicago.edu>
	<Pine.LNX.4.64.0803310206080.5372@dildano.hawaga.org.uk>
	<47F04E38.60207@uchicago.edu>
	<Pine.LNX.4.64.0803310423240.9854@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803310723010.9854@dildano.hawaga.org.uk>
	<47F3E9CD.9090507@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0804030941510.9854@dildano.hawaga.org.uk>


On Wed, 2 Apr 2008, Ioan Raicu wrote:

> 	1 Node Perf 	Falkon 	Swift+Falkon 	Swift+Falkon (patched)
> Min 	63.618 	53.782 	169.139 	58.538
> Average 	64.76 	65.47253 	309.1945 	80.21246
> Median 	64.74072 	64.774 	313.5535 	76.5245
> Max 	65.863 	94.447 	605.654 	115.237
> Standard Deviation 	0.488984 	3.863944 	52.13821
> 10.95652
> Efficiency 	100% 	99% 	21% 	81%
> 
> 
> The first column shows the per task statistic when running on 1 node (4 CPUs)
> through Falkon.  The second column are the statistics for running the
> application at large scale, on 2048 CPUs.  The 3rd column is running
> Swift+Falkon (both from SVN) on 256 CPUs.  The 4th column is Swift+Falkon, but
> Swift has the 3 patches applied.  Essentially, the per task execution time was
> reduced from 309 seconds to 80 seconds, where the ideal would have been 64
> seconds.  It brought the efficiency from 21% to 81% for this particular
> workload.  This looks fantastic! 

The standard deviation is quite large for the patched-swift values. I'd be 
interested to see the -info files for all of these runs so I can see what 
they are doing. Can you put them somewhere for me?

-- 


From zhaozhang at uchicago.edu  Thu Apr  3 04:51:04 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Thu, 03 Apr 2008 04:51:04 -0500
Subject: [Swift-user] Re: [falkon-user] 1) disable retry mechanism and 2)
	continue on failure?
In-Reply-To: <Pine.LNX.4.64.0804030941510.9854@dildano.hawaga.org.uk>
References: <47F02A00.6090203@cs.uchicago.edu>
	<Pine.LNX.4.64.0803310206080.5372@dildano.hawaga.org.uk>
	<47F04E38.60207@uchicago.edu>
	<Pine.LNX.4.64.0803310423240.9854@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803310723010.9854@dildano.hawaga.org.uk>
	<47F3E9CD.9090507@cs.uchicago.edu>
	<Pine.LNX.4.64.0804030941510.9854@dildano.hawaga.org.uk>
Message-ID: <47F4A888.1000705@uchicago.edu>

Hi, Ben

Check this
login.ci.uchicago.edu:/home/zzhang/info.tar

zhao

Ben Clifford wrote:
> On Wed, 2 Apr 2008, Ioan Raicu wrote:
>
>   
>> 	1 Node Perf 	Falkon 	Swift+Falkon 	Swift+Falkon (patched)
>> Min 	63.618 	53.782 	169.139 	58.538
>> Average 	64.76 	65.47253 	309.1945 	80.21246
>> Median 	64.74072 	64.774 	313.5535 	76.5245
>> Max 	65.863 	94.447 	605.654 	115.237
>> Standard Deviation 	0.488984 	3.863944 	52.13821
>> 10.95652
>> Efficiency 	100% 	99% 	21% 	81%
>>
>>
>> The first column shows the per task statistic when running on 1 node (4 CPUs)
>> through Falkon.  The second column are the statistics for running the
>> application at large scale, on 2048 CPUs.  The 3rd column is running
>> Swift+Falkon (both from SVN) on 256 CPUs.  The 4th column is Swift+Falkon, but
>> Swift has the 3 patches applied.  Essentially, the per task execution time was
>> reduced from 309 seconds to 80 seconds, where the ideal would have been 64
>> seconds.  It brought the efficiency from 21% to 81% for this particular
>> workload.  This looks fantastic! 
>>     
>
> The standard deviation is quite large for the patched-swift values. I'd be 
> interested to see the -info files for all of these runs so I can see what 
> they are doing. Can you put them somewhere for me?
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20080403/77760e8c/attachment.html>

From benc at hawaga.org.uk  Thu Apr  3 05:05:20 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 3 Apr 2008 10:05:20 +0000 (GMT)
Subject: [Swift-user] Re: [falkon-user] 1) disable retry mechanism and 2)
 continue on failure?
In-Reply-To: <47F4A888.1000705@uchicago.edu>
References: <47F02A00.6090203@cs.uchicago.edu>
	<Pine.LNX.4.64.0803310206080.5372@dildano.hawaga.org.uk>
	<47F04E38.60207@uchicago.edu>
	<Pine.LNX.4.64.0803310423240.9854@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803310723010.9854@dildano.hawaga.org.uk>
	<47F3E9CD.9090507@cs.uchicago.edu>
	<Pine.LNX.4.64.0804030941510.9854@dildano.hawaga.org.uk>
	<47F4A888.1000705@uchicago.edu>
Message-ID: <Pine.LNX.4.64.0804031005090.9854@dildano.hawaga.org.uk>

do you have the corresponding swift run log file to go with it?

On Thu, 3 Apr 2008, Zhao Zhang wrote:

> Hi, Ben
> 
> Check this
> login.ci.uchicago.edu:/home/zzhang/info.tar
> 
> zhao
> 
> Ben Clifford wrote:
> > On Wed, 2 Apr 2008, Ioan Raicu wrote:
> > 
> >   
> > > 	1 Node Perf 	Falkon 	Swift+Falkon 	Swift+Falkon (patched)
> > > Min 	63.618 	53.782 	169.139 	58.538
> > > Average 	64.76 	65.47253 	309.1945 	80.21246
> > > Median 	64.74072 	64.774 	313.5535 	76.5245
> > > Max 	65.863 	94.447 	605.654 	115.237
> > > Standard Deviation 	0.488984 	3.863944 	52.13821
> > > 10.95652
> > > Efficiency 	100% 	99% 	21% 	81%
> > > 
> > > 
> > > The first column shows the per task statistic when running on 1 node (4
> > > CPUs)
> > > through Falkon.  The second column are the statistics for running the
> > > application at large scale, on 2048 CPUs.  The 3rd column is running
> > > Swift+Falkon (both from SVN) on 256 CPUs.  The 4th column is Swift+Falkon,
> > > but
> > > Swift has the 3 patches applied.  Essentially, the per task execution time
> > > was
> > > reduced from 309 seconds to 80 seconds, where the ideal would have been 64
> > > seconds.  It brought the efficiency from 21% to 81% for this particular
> > > workload.  This looks fantastic!     
> > 
> > The standard deviation is quite large for the patched-swift values. I'd be
> > interested to see the -info files for all of these runs so I can see what
> > they are doing. Can you put them somewhere for me?
> > 
> >   


From benc at hawaga.org.uk  Thu Apr  3 05:06:26 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 3 Apr 2008 10:06:26 +0000 (GMT)
Subject: [Swift-user] Re: [falkon-user] 1) disable retry mechanism and 2)
 continue on failure?
In-Reply-To: <47F3E9CD.9090507@cs.uchicago.edu>
References: <47F02A00.6090203@cs.uchicago.edu>
	<Pine.LNX.4.64.0803310206080.5372@dildano.hawaga.org.uk>
	<47F04E38.60207@uchicago.edu>
	<Pine.LNX.4.64.0803310423240.9854@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803310723010.9854@dildano.hawaga.org.uk>
	<47F3E9CD.9090507@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0804031005560.9854@dildano.hawaga.org.uk>


I just asked zhao for the log files (both swift and -info) for the patched 
run; but I think I'd like to see the unpatched run logs too.

On Wed, 2 Apr 2008, Ioan Raicu wrote:

> Hi Ben,
> Thanks again for the patches, they made a huge difference, increased
> efficiency from 21% to 81%!
> 
> Here are the numbers:
> 
> 	1 Node Perf 	Falkon 	Swift+Falkon 	Swift+Falkon (patched)
> Min 	63.618 	53.782 	169.139 	58.538
> Average 	64.76 	65.47253 	309.1945 	80.21246
> Median 	64.74072 	64.774 	313.5535 	76.5245
> Max 	65.863 	94.447 	605.654 	115.237
> Standard Deviation 	0.488984 	3.863944 	52.13821
> 10.95652
> Efficiency 	100% 	99% 	21% 	81%
> 
> 
> The first column shows the per task statistic when running on 1 node (4 CPUs)
> through Falkon.  The second column are the statistics for running the
> application at large scale, on 2048 CPUs.  The 3rd column is running
> Swift+Falkon (both from SVN) on 256 CPUs.  The 4th column is Swift+Falkon, but
> Swift has the 3 patches applied.  Essentially, the per task execution time was
> reduced from 309 seconds to 80 seconds, where the ideal would have been 64
> seconds.  It brought the efficiency from 21% to 81% for this particular
> workload.  This looks fantastic! 
> We'll have to verify that we can maintain this 81% efficiency to higher number
> of CPUs.  In the meantime, if you can think of anything else that we could do
> to keep pushing the 81% efficiency number higher, let us know.4
> 
> Thanks again,
> Ioan
> 
> Ben Clifford wrote:
> > On Mon, 31 Mar 2008, Ben Clifford wrote:
> > 
> >   
> > > This temporary directory handling is pretty ugly - it should be a couple
> > > lines change to wrapper.sh to get similar functionality using the existing
> > > swift temporary direcotry handling - change the path to /tmp and use cp
> > > instead of ln -s. That way you can take advantage of Swift's existing
> > > unique job IDs and error handling too.
> > >     
> > 
> > Attached are three patches that will apply against svn r1775:
> > 
> > The first puts temporary directories in /tmp rather than on shared fs.
> > http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-on-tmp
> > 
> > The second copies the application file to the worker in each job execution
> > (though doesn't do any worker-node caching of such between jobs)
> > http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-mv-executable
> > 
> > The third creates the worker node log on /tmp and copies it at the end.
> > http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-log-locally
> > 
> > The three modify all wrapper.sh and should be applied in the above order.
> > 
> > With the first two patches, the timestamps in the usual info logs will
> > provide information about how long the copies take, in the same way that
> > they usually indicate times for other execution stages.
> > 
> >   
> 
> 


From zhaozhang at uchicago.edu  Thu Apr  3 06:45:14 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Thu, 03 Apr 2008 06:45:14 -0500
Subject: [Swift-user] Re: [falkon-user] 1) disable retry mechanism and 2)
	continue on failure?
In-Reply-To: <Pine.LNX.4.64.0804031005560.9854@dildano.hawaga.org.uk>
References: <47F02A00.6090203@cs.uchicago.edu>
	<Pine.LNX.4.64.0803310206080.5372@dildano.hawaga.org.uk>
	<47F04E38.60207@uchicago.edu>
	<Pine.LNX.4.64.0803310423240.9854@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803310723010.9854@dildano.hawaga.org.uk>
	<47F3E9CD.9090507@cs.uchicago.edu>
	<Pine.LNX.4.64.0804031005560.9854@dildano.hawaga.org.uk>
Message-ID: <47F4C34A.4020703@uchicago.edu>

Sorry, Ben.

I didn't save the swift log file. If you really need the old -info file, 
I could redo the test, and try to send them to you.
But for now, I have several urgent issues.

zhao

Ben Clifford wrote:
> I just asked zhao for the log files (both swift and -info) for the patched 
> run; but I think I'd like to see the unpatched run logs too.
>
> On Wed, 2 Apr 2008, Ioan Raicu wrote:
>
>   
>> Hi Ben,
>> Thanks again for the patches, they made a huge difference, increased
>> efficiency from 21% to 81%!
>>
>> Here are the numbers:
>>
>> 	1 Node Perf 	Falkon 	Swift+Falkon 	Swift+Falkon (patched)
>> Min 	63.618 	53.782 	169.139 	58.538
>> Average 	64.76 	65.47253 	309.1945 	80.21246
>> Median 	64.74072 	64.774 	313.5535 	76.5245
>> Max 	65.863 	94.447 	605.654 	115.237
>> Standard Deviation 	0.488984 	3.863944 	52.13821
>> 10.95652
>> Efficiency 	100% 	99% 	21% 	81%
>>
>>
>> The first column shows the per task statistic when running on 1 node (4 CPUs)
>> through Falkon.  The second column are the statistics for running the
>> application at large scale, on 2048 CPUs.  The 3rd column is running
>> Swift+Falkon (both from SVN) on 256 CPUs.  The 4th column is Swift+Falkon, but
>> Swift has the 3 patches applied.  Essentially, the per task execution time was
>> reduced from 309 seconds to 80 seconds, where the ideal would have been 64
>> seconds.  It brought the efficiency from 21% to 81% for this particular
>> workload.  This looks fantastic! 
>> We'll have to verify that we can maintain this 81% efficiency to higher number
>> of CPUs.  In the meantime, if you can think of anything else that we could do
>> to keep pushing the 81% efficiency number higher, let us know.4
>>
>> Thanks again,
>> Ioan
>>
>> Ben Clifford wrote:
>>     
>>> On Mon, 31 Mar 2008, Ben Clifford wrote:
>>>
>>>   
>>>       
>>>> This temporary directory handling is pretty ugly - it should be a couple
>>>> lines change to wrapper.sh to get similar functionality using the existing
>>>> swift temporary direcotry handling - change the path to /tmp and use cp
>>>> instead of ln -s. That way you can take advantage of Swift's existing
>>>> unique job IDs and error handling too.
>>>>     
>>>>         
>>> Attached are three patches that will apply against svn r1775:
>>>
>>> The first puts temporary directories in /tmp rather than on shared fs.
>>> http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-on-tmp
>>>
>>> The second copies the application file to the worker in each job execution
>>> (though doesn't do any worker-node caching of such between jobs)
>>> http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-mv-executable
>>>
>>> The third creates the worker node log on /tmp and copies it at the end.
>>> http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-log-locally
>>>
>>> The three modify all wrapper.sh and should be applied in the above order.
>>>
>>> With the first two patches, the timestamps in the usual info logs will
>>> provide information about how long the copies take, in the same way that
>>> they usually indicate times for other execution stages.
>>>
>>>   
>>>       
>>     
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20080403/eda9b16f/attachment.html>

From benc at hawaga.org.uk  Thu Apr  3 14:45:22 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 3 Apr 2008 19:45:22 +0000 (GMT)
Subject: [Swift-user] Re: [falkon-user] 1) disable retry mechanism and 2)
 continue on failure?
In-Reply-To: <47F4C34A.4020703@uchicago.edu>
References: <47F02A00.6090203@cs.uchicago.edu>
	<Pine.LNX.4.64.0803310206080.5372@dildano.hawaga.org.uk>
	<47F04E38.60207@uchicago.edu>
	<Pine.LNX.4.64.0803310423240.9854@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803310723010.9854@dildano.hawaga.org.uk>
	<47F3E9CD.9090507@cs.uchicago.edu>
	<Pine.LNX.4.64.0804031005560.9854@dildano.hawaga.org.uk>
	<47F4C34A.4020703@uchicago.edu>
Message-ID: <Pine.LNX.4.64.0804031931490.5372@dildano.hawaga.org.uk>


its fine for now.

There's a convention for storing log files - put the .log file and the 
whole .d director somewhere in ~benc/swift-logs/ in CI NFS space.

Most simply, put files directly in there; for a more structured layout see 
how mike has organised his stuff under ~benc/swift-logs/wilde/

On Thu, 3 Apr 2008, Zhao Zhang wrote:

> Sorry, Ben.
> 
> I didn't save the swift log file. If you really need the old -info file, I
> could redo the test, and try to send them to you.
> But for now, I have several urgent issues.
> 
> zhao
> 
> Ben Clifford wrote:
> > I just asked zhao for the log files (both swift and -info) for the patched
> > run; but I think I'd like to see the unpatched run logs too.
> > 
> > On Wed, 2 Apr 2008, Ioan Raicu wrote:
> > 
> >   
> > > Hi Ben,
> > > Thanks again for the patches, they made a huge difference, increased
> > > efficiency from 21% to 81%!
> > > 
> > > Here are the numbers:
> > > 
> > > 	1 Node Perf 	Falkon 	Swift+Falkon 	Swift+Falkon (patched)
> > > Min 	63.618 	53.782 	169.139 	58.538
> > > Average 	64.76 	65.47253 	309.1945 	80.21246
> > > Median 	64.74072 	64.774 	313.5535 	76.5245
> > > Max 	65.863 	94.447 	605.654 	115.237
> > > Standard Deviation 	0.488984 	3.863944 	52.13821
> > > 10.95652
> > > Efficiency 	100% 	99% 	21% 	81%
> > > 
> > > 
> > > The first column shows the per task statistic when running on 1 node (4
> > > CPUs)
> > > through Falkon.  The second column are the statistics for running the
> > > application at large scale, on 2048 CPUs.  The 3rd column is running
> > > Swift+Falkon (both from SVN) on 256 CPUs.  The 4th column is Swift+Falkon,
> > > but
> > > Swift has the 3 patches applied.  Essentially, the per task execution time
> > > was
> > > reduced from 309 seconds to 80 seconds, where the ideal would have been 64
> > > seconds.  It brought the efficiency from 21% to 81% for this particular
> > > workload.  This looks fantastic! We'll have to verify that we can maintain
> > > this 81% efficiency to higher number
> > > of CPUs.  In the meantime, if you can think of anything else that we could
> > > do
> > > to keep pushing the 81% efficiency number higher, let us know.4
> > > 
> > > Thanks again,
> > > Ioan
> > > 
> > > Ben Clifford wrote:
> > >     
> > > > On Mon, 31 Mar 2008, Ben Clifford wrote:
> > > > 
> > > >         
> > > > > This temporary directory handling is pretty ugly - it should be a
> > > > > couple
> > > > > lines change to wrapper.sh to get similar functionality using the
> > > > > existing
> > > > > swift temporary direcotry handling - change the path to /tmp and use
> > > > > cp
> > > > > instead of ln -s. That way you can take advantage of Swift's existing
> > > > > unique job IDs and error handling too.
> > > > >             
> > > > Attached are three patches that will apply against svn r1775:
> > > > 
> > > > The first puts temporary directories in /tmp rather than on shared fs.
> > > > http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-on-tmp
> > > > 
> > > > The second copies the application file to the worker in each job
> > > > execution
> > > > (though doesn't do any worker-node caching of such between jobs)
> > > > http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-mv-executable
> > > > 
> > > > The third creates the worker node log on /tmp and copies it at the end.
> > > > http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-log-locally
> > > > 
> > > > The three modify all wrapper.sh and should be applied in the above
> > > > order.
> > > > 
> > > > With the first two patches, the timestamps in the usual info logs will
> > > > provide information about how long the copies take, in the same way that
> > > > they usually indicate times for other execution stages.
> > > > 
> > > >         
> > >     
> > 
> >   


From zhaozhang at uchicago.edu  Thu Apr  3 14:47:04 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Thu, 03 Apr 2008 14:47:04 -0500
Subject: [Swift-user] Re: [falkon-user] 1) disable retry mechanism and 2)
	continue on failure?
In-Reply-To: <Pine.LNX.4.64.0804031931490.5372@dildano.hawaga.org.uk>
References: <47F02A00.6090203@cs.uchicago.edu>
	<Pine.LNX.4.64.0803310206080.5372@dildano.hawaga.org.uk>
	<47F04E38.60207@uchicago.edu>
	<Pine.LNX.4.64.0803310423240.9854@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803310723010.9854@dildano.hawaga.org.uk>
	<47F3E9CD.9090507@cs.uchicago.edu>
	<Pine.LNX.4.64.0804031005560.9854@dildano.hawaga.org.uk>
	<47F4C34A.4020703@uchicago.edu>
	<Pine.LNX.4.64.0804031931490.5372@dildano.hawaga.org.uk>
Message-ID: <47F53438.3070401@uchicago.edu>

Thanks, Ben

zhao

Ben Clifford wrote:
> its fine for now.
>
> There's a convention for storing log files - put the .log file and the 
> whole .d director somewhere in ~benc/swift-logs/ in CI NFS space.
>
> Most simply, put files directly in there; for a more structured layout see 
> how mike has organised his stuff under ~benc/swift-logs/wilde/
>
> On Thu, 3 Apr 2008, Zhao Zhang wrote:
>
>   
>> Sorry, Ben.
>>
>> I didn't save the swift log file. If you really need the old -info file, I
>> could redo the test, and try to send them to you.
>> But for now, I have several urgent issues.
>>
>> zhao
>>
>> Ben Clifford wrote:
>>     
>>> I just asked zhao for the log files (both swift and -info) for the patched
>>> run; but I think I'd like to see the unpatched run logs too.
>>>
>>> On Wed, 2 Apr 2008, Ioan Raicu wrote:
>>>
>>>   
>>>       
>>>> Hi Ben,
>>>> Thanks again for the patches, they made a huge difference, increased
>>>> efficiency from 21% to 81%!
>>>>
>>>> Here are the numbers:
>>>>
>>>> 	1 Node Perf 	Falkon 	Swift+Falkon 	Swift+Falkon (patched)
>>>> Min 	63.618 	53.782 	169.139 	58.538
>>>> Average 	64.76 	65.47253 	309.1945 	80.21246
>>>> Median 	64.74072 	64.774 	313.5535 	76.5245
>>>> Max 	65.863 	94.447 	605.654 	115.237
>>>> Standard Deviation 	0.488984 	3.863944 	52.13821
>>>> 10.95652
>>>> Efficiency 	100% 	99% 	21% 	81%
>>>>
>>>>
>>>> The first column shows the per task statistic when running on 1 node (4
>>>> CPUs)
>>>> through Falkon.  The second column are the statistics for running the
>>>> application at large scale, on 2048 CPUs.  The 3rd column is running
>>>> Swift+Falkon (both from SVN) on 256 CPUs.  The 4th column is Swift+Falkon,
>>>> but
>>>> Swift has the 3 patches applied.  Essentially, the per task execution time
>>>> was
>>>> reduced from 309 seconds to 80 seconds, where the ideal would have been 64
>>>> seconds.  It brought the efficiency from 21% to 81% for this particular
>>>> workload.  This looks fantastic! We'll have to verify that we can maintain
>>>> this 81% efficiency to higher number
>>>> of CPUs.  In the meantime, if you can think of anything else that we could
>>>> do
>>>> to keep pushing the 81% efficiency number higher, let us know.4
>>>>
>>>> Thanks again,
>>>> Ioan
>>>>
>>>> Ben Clifford wrote:
>>>>     
>>>>         
>>>>> On Mon, 31 Mar 2008, Ben Clifford wrote:
>>>>>
>>>>>         
>>>>>           
>>>>>> This temporary directory handling is pretty ugly - it should be a
>>>>>> couple
>>>>>> lines change to wrapper.sh to get similar functionality using the
>>>>>> existing
>>>>>> swift temporary direcotry handling - change the path to /tmp and use
>>>>>> cp
>>>>>> instead of ln -s. That way you can take advantage of Swift's existing
>>>>>> unique job IDs and error handling too.
>>>>>>             
>>>>>>             
>>>>> Attached are three patches that will apply against svn r1775:
>>>>>
>>>>> The first puts temporary directories in /tmp rather than on shared fs.
>>>>> http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-on-tmp
>>>>>
>>>>> The second copies the application file to the worker in each job
>>>>> execution
>>>>> (though doesn't do any worker-node caching of such between jobs)
>>>>> http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-mv-executable
>>>>>
>>>>> The third creates the worker node log on /tmp and copies it at the end.
>>>>> http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-log-locally
>>>>>
>>>>> The three modify all wrapper.sh and should be applied in the above
>>>>> order.
>>>>>
>>>>> With the first two patches, the timestamps in the usual info logs will
>>>>> provide information about how long the copies take, in the same way that
>>>>> they usually indicate times for other execution stages.
>>>>>
>>>>>         
>>>>>           
>>>>     
>>>>         
>>>   
>>>       
>
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20080403/2cafe8ff/attachment.html>

From wilde at mcs.anl.gov  Mon Apr 14 14:39:39 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 14 Apr 2008 14:39:39 -0500
Subject: [Swift-user] sites.xml entry for Abe teragrd site
Message-ID: <4803B2FB.8040201@mcs.anl.gov>

Mike, I think this is pretty close to what you need, but I did not test it:

<pool handle="abe" sysinfo="INTEL32::LINUX">
     <gridftp url="gsiftp://grid-abe.ncsa.teragrid.org"/>
     <jobmanager universe="vanilla"
        url="grid-abe.ncsa.teragrid.org/jobmanager-pbs" major="2"/>
     <workdirectory>/cfs/scratch/users/mkubal/swiftwork</workdirectory>
   -or-
     <workdirectory>/u/ac/mkubal/swiftwork</workdirectory>
   - be sure to create these swiftwork dirs first!
</pool>


What you should do:

create the swiftwork dirs listed above
first is for large scratch space,
second is for your persistent user space

remove the -comments- above and use only one workdirectory.

I think you can use mainly the scratch one for now

test submitting a simple command via globus-job-run (first to the 
default for jobmanger, then to jobmanager-pbs)

test copying a short file to the work dirs using globus-url copy

then try a simple workflow

Ben, Sarah or Mihael may be able to help you find out if WS-GRAM is 
available and working on Abe.  If so, you should switch to that to avoid 
overrunning Abe's gatekeeper. And use the throttling properties that you 
and Ben worked out.

- Mike


From wilde at mcs.anl.gov  Mon Apr 14 15:25:01 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 14 Apr 2008 15:25:01 -0500
Subject: [Swift-user] Teragrid info for WS-GRAM and pre-WS_GRAM
Message-ID: <4803BD9D.3010205@mcs.anl.gov>

The following table (if accurate) seems to have all the info needed for 
sites.xml entries for all GRAM versions on all TG sites:

   http://www.teragrid.org/userinfo/jobs/gram.php

If there's any discrepancies or issues with this config info we (and 
users) should contact help at teragrid.org.

A link to this should be added to the Swift Users Guide sections 15 
and/or 16.


From benc at hawaga.org.uk  Wed Apr 16 14:42:39 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 16 Apr 2008 19:42:39 +0000 (GMT)
Subject: [Swift-user] Swift 0.5 released.
Message-ID: <Pine.LNX.4.64.0804161938410.31934@dildano.hawaga.org.uk>


Swift 0.5 is now available for download from
http://www.ci.uchicago.edu/swift/packages/vdsk-0.5.tar.gz

This is intended to address a number of bugs that were present in 0.4, 
most notably data channel reuse in GridFTP and a number of problems with 
recent compiler enhancements.

For more information about Swift, visit http://www.ci.uchicago.edu/swift/

--