From wilde at mcs.anl.gov Tue Apr 1 10:56:28 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 01 Apr 2008 10:56:28 -0500 Subject: [Swift-user] Re: [falkon-user] 1) disable retry mechanism and 2) continue on failure? In-Reply-To: References: <47F02A00.6090203@cs.uchicago.edu> <47F04E38.60207@uchicago.edu> Message-ID: <47F25B2C.4090005@mcs.anl.gov> Ben, thanks - these patches sound great. Can the use of /tmp be controlled by a property, ideally on a per-application basis in tc.data, and these changes committed to svn? Seems like wrapper-tmp-log-locally could be done for all apps as the default, and only turned off for certain debugging scenarios. Can you do application caching as well, in a general manner? We'll measure over the next few days and report back. - Mike On 3/31/08 2:34 AM, Ben Clifford wrote: > On Mon, 31 Mar 2008, Ben Clifford wrote: > >> This temporary directory handling is pretty ugly - it should be a couple >> lines change to wrapper.sh to get similar functionality using the existing >> swift temporary direcotry handling - change the path to /tmp and use cp >> instead of ln -s. That way you can take advantage of Swift's existing >> unique job IDs and error handling too. > > Attached are three patches that will apply against svn r1775: > > The first puts temporary directories in /tmp rather than on shared fs. > http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-on-tmp > > The second copies the application file to the worker in each job execution > (though doesn't do any worker-node caching of such between jobs) > http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-mv-executable > > The third creates the worker node log on /tmp and copies it at the end. > http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-log-locally > > The three modify all wrapper.sh and should be applied in the above order. > > With the first two patches, the timestamps in the usual info logs will > provide information about how long the copies take, in the same way that > they usually indicate times for other execution stages. > From benc at hawaga.org.uk Tue Apr 1 20:33:27 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 2 Apr 2008 01:33:27 +0000 (GMT) Subject: [Swift-user] Re: [falkon-user] 1) disable retry mechanism and 2) continue on failure? In-Reply-To: <47F25B2C.4090005@mcs.anl.gov> References: <47F02A00.6090203@cs.uchicago.edu> <47F04E38.60207@uchicago.edu> <47F25B2C.4090005@mcs.anl.gov> Message-ID: On Tue, 1 Apr 2008, Michael Wilde wrote: > Can you do application caching as well, in a general manner? Applications 'in general' consist of a lot more than their base executable - even echo hello world seems to attempt to read 9 different files on my Linux box. so 'no'. -- From iraicu at cs.uchicago.edu Tue Apr 1 21:14:37 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 01 Apr 2008 21:14:37 -0500 Subject: [Swift-user] Re: [falkon-user] 1) disable retry mechanism and 2) continue on failure? In-Reply-To: References: <47F02A00.6090203@cs.uchicago.edu> <47F04E38.60207@uchicago.edu> <47F25B2C.4090005@mcs.anl.gov> Message-ID: <47F2EC0D.3040704@cs.uchicago.edu> If the applications are statically compiled, is the problem more tractable? Ioan Ben Clifford wrote: > On Tue, 1 Apr 2008, Michael Wilde wrote: > > >> Can you do application caching as well, in a general manner? >> > > Applications 'in general' consist of a lot more than their base executable > - even echo hello world seems to attempt to read 9 different files on my > Linux box. so 'no'. > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Wed Apr 2 15:17:17 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 02 Apr 2008 15:17:17 -0500 Subject: [Swift-user] Re: [falkon-user] 1) disable retry mechanism and 2) continue on failure? In-Reply-To: References: <47F02A00.6090203@cs.uchicago.edu> <47F04E38.60207@uchicago.edu> Message-ID: <47F3E9CD.9090507@cs.uchicago.edu> Hi Ben, Thanks again for the patches, they made a huge difference, increased efficiency from 21% to 81%! Here are the numbers: 1 Node Perf Falkon Swift+Falkon Swift+Falkon (patched) Min 63.618 53.782 169.139 58.538 Average 64.76 65.47253 309.1945 80.21246 Median 64.74072 64.774 313.5535 76.5245 Max 65.863 94.447 605.654 115.237 Standard Deviation 0.488984 3.863944 52.13821 10.95652 Efficiency 100% 99% 21% 81% The first column shows the per task statistic when running on 1 node (4 CPUs) through Falkon. The second column are the statistics for running the application at large scale, on 2048 CPUs. The 3rd column is running Swift+Falkon (both from SVN) on 256 CPUs. The 4th column is Swift+Falkon, but Swift has the 3 patches applied. Essentially, the per task execution time was reduced from 309 seconds to 80 seconds, where the ideal would have been 64 seconds. It brought the efficiency from 21% to 81% for this particular workload. This looks fantastic! We'll have to verify that we can maintain this 81% efficiency to higher number of CPUs. In the meantime, if you can think of anything else that we could do to keep pushing the 81% efficiency number higher, let us know.4 Thanks again, Ioan Ben Clifford wrote: > On Mon, 31 Mar 2008, Ben Clifford wrote: > > >> This temporary directory handling is pretty ugly - it should be a couple >> lines change to wrapper.sh to get similar functionality using the existing >> swift temporary direcotry handling - change the path to /tmp and use cp >> instead of ln -s. That way you can take advantage of Swift's existing >> unique job IDs and error handling too. >> > > Attached are three patches that will apply against svn r1775: > > The first puts temporary directories in /tmp rather than on shared fs. > http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-on-tmp > > The second copies the application file to the worker in each job execution > (though doesn't do any worker-node caching of such between jobs) > http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-mv-executable > > The third creates the worker node log on /tmp and copies it at the end. > http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-log-locally > > The three modify all wrapper.sh and should be applied in the above order. > > With the first two patches, the timestamps in the usual info logs will > provide information about how long the copies take, in the same way that > they usually indicate times for other execution stages. > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Wed Apr 2 15:36:18 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 2 Apr 2008 20:36:18 +0000 (GMT) Subject: [Swift-user] Re: [falkon-user] 1) disable retry mechanism and 2) continue on failure? In-Reply-To: <47F3E9CD.9090507@cs.uchicago.edu> References: <47F02A00.6090203@cs.uchicago.edu> <47F04E38.60207@uchicago.edu> <47F3E9CD.9090507@cs.uchicago.edu> Message-ID: any chance you can test the patches separately to see how they each contribute to this change? On Wed, 2 Apr 2008, Ioan Raicu wrote: > Hi Ben, > Thanks again for the patches, they made a huge difference, increased > efficiency from 21% to 81%! > > Here are the numbers: > > 1 Node Perf Falkon Swift+Falkon Swift+Falkon (patched) > Min 63.618 53.782 169.139 58.538 > Average 64.76 65.47253 309.1945 80.21246 > Median 64.74072 64.774 313.5535 76.5245 > Max 65.863 94.447 605.654 115.237 > Standard Deviation 0.488984 3.863944 52.13821 > 10.95652 > Efficiency 100% 99% 21% 81% > > > The first column shows the per task statistic when running on 1 node (4 CPUs) > through Falkon. The second column are the statistics for running the > application at large scale, on 2048 CPUs. The 3rd column is running > Swift+Falkon (both from SVN) on 256 CPUs. The 4th column is Swift+Falkon, but > Swift has the 3 patches applied. Essentially, the per task execution time was > reduced from 309 seconds to 80 seconds, where the ideal would have been 64 > seconds. It brought the efficiency from 21% to 81% for this particular > workload. This looks fantastic! > We'll have to verify that we can maintain this 81% efficiency to higher number > of CPUs. In the meantime, if you can think of anything else that we could do > to keep pushing the 81% efficiency number higher, let us know.4 > > Thanks again, > Ioan > > Ben Clifford wrote: > > On Mon, 31 Mar 2008, Ben Clifford wrote: > > > > > > > This temporary directory handling is pretty ugly - it should be a couple > > > lines change to wrapper.sh to get similar functionality using the existing > > > swift temporary direcotry handling - change the path to /tmp and use cp > > > instead of ln -s. That way you can take advantage of Swift's existing > > > unique job IDs and error handling too. > > > > > > > Attached are three patches that will apply against svn r1775: > > > > The first puts temporary directories in /tmp rather than on shared fs. > > http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-on-tmp > > > > The second copies the application file to the worker in each job execution > > (though doesn't do any worker-node caching of such between jobs) > > http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-mv-executable > > > > The third creates the worker node log on /tmp and copies it at the end. > > http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-log-locally > > > > The three modify all wrapper.sh and should be applied in the above order. > > > > With the first two patches, the timestamps in the usual info logs will > > provide information about how long the copies take, in the same way that > > they usually indicate times for other execution stages. > > > > > > From zhoujianghua1017 at 163.com Wed Apr 2 22:09:41 2008 From: zhoujianghua1017 at 163.com (jezhee) Date: Thu, 3 Apr 2008 11:09:41 +0800 Subject: [Swift-user] How to patch work to SSH node? Message-ID: <200804031109308300133@163.com> Hi,ladies and gentlemen, Excuse me to trouble you. I tried to patch tasks to an SSH server by Swift last week. But, I encountered some problems. Could you give me some advice? At first, I tried to run Swift in Windows. I used puttygen to generate a key SSH2 RSA key pair. Then, I changed the file sites.xml according to the User Guide and created auth.defaults in my user home directory. But When I ran Swift, error happened: Execution failed: Could not initialize shared directory on sshsvr Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Error while communicating with the SSH server on 192.168.88.17:22 Caused by: SSH Connection failed: null Actually, when I used F-SSH client to loggon to the Linux server by public key method, I didn't succeed neither. So, I recompiled the project in Linux and wanted to test whether this would work. I used command "ssh-keygen -t rsa" to generate the key pair, then I transport the rsa.pub to another Linux server. After these, I could log to the server without password successfully. So, I changed the configuration of Swift and run the sample script. But fallaciously, the same error appeared. Both of the two Linux PCs' kernal is 2.6. I used F-SSH as the remote login tool. I also tried changing the auth.defaults to the following: 192.168.88.246.type=password 192.168.88.246.username=root 192.168.88.246.password=*** 192.168.88.246.passphrase= I got the same error. Could you help me to find out whether there are any wrong config? Besides, it seems that some of the Swift source code is not open, but provided in jar library. I noticed that the support to element "filesystem" is added in Nov. 2007, but I didn't find any disposal to this keyword in the source code. Our innovation - boinc provider is based on SSH, and provides more parameters to adapt the BOINC task format. Obviously, just replacing the ssh with boinc is not usable even CoGkit module has supported boinc provider. So, I want to ask you how to add a customized provider to Swift? Thanks a lot. ?Regards. 2008-04-03 ////////////////////////////////////////// // Zhou Jianghua zhoujianghua1017 at 163.com // EI Dep, Huazhong Uni of Sci & Tech // Internet Technology and Engineering R&D Center // http://www.itec.org.cn // // Tel?(86)27-87792139 // Fax?(86)27-87540745 // Zipcode?430074 // Address?Luoyu Road 1037, Wuhan, Hubei, China ///////////////////////////////////////// From benc at hawaga.org.uk Wed Apr 2 22:35:15 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 3 Apr 2008 03:35:15 +0000 (GMT) Subject: [Swift-user] How to patch work to SSH node? In-Reply-To: <200804031109308300133@163.com> References: <200804031109308300133@163.com> Message-ID: Hi. I don't really know anything about the ssh provider, so I can't help you there. But to answer your other questions: > Besides, it seems that some of the Swift source code is not open, > but provided in jar library. You should be able to get the original source used to generate all of the jar files, from various places. Are there particular jars that you want to look at the source for? > I noticed that the support to element > "filesystem" is added in Nov. 2007, but I didn't find any disposal to > this keyword in the source code. All of the sites.xml elements are defined in libexec/vdl-sc.k. The filesystem element was added in commit r1490 at 2007-11-23 21:30:36 +0000. It is at line 35 in that file at the moment. > Our innovation - boinc provider is based on SSH, and provides more > parameters to adapt the BOINC task format. Obviously, just replacing the > ssh with boinc is not usable even CoGkit module has supported boinc > provider. So, I want to ask you how to add a customized provider to > Swift? To add a customised provider to swift, add a new directory into cog that looks like one of the existing provider-* directories. If you want an example, look at the provider-deef add on, which you can get with this command: svn co https://svn.ci.uchicago.edu/svn/vdl2/provider-deef This is an example of a provider for a different execution system (falkon). -- From benc at hawaga.org.uk Thu Apr 3 04:43:27 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 3 Apr 2008 09:43:27 +0000 (GMT) Subject: [Swift-user] Re: [falkon-user] 1) disable retry mechanism and 2) continue on failure? In-Reply-To: <47F3E9CD.9090507@cs.uchicago.edu> References: <47F02A00.6090203@cs.uchicago.edu> <47F04E38.60207@uchicago.edu> <47F3E9CD.9090507@cs.uchicago.edu> Message-ID: On Wed, 2 Apr 2008, Ioan Raicu wrote: > 1 Node Perf Falkon Swift+Falkon Swift+Falkon (patched) > Min 63.618 53.782 169.139 58.538 > Average 64.76 65.47253 309.1945 80.21246 > Median 64.74072 64.774 313.5535 76.5245 > Max 65.863 94.447 605.654 115.237 > Standard Deviation 0.488984 3.863944 52.13821 > 10.95652 > Efficiency 100% 99% 21% 81% > > > The first column shows the per task statistic when running on 1 node (4 CPUs) > through Falkon. The second column are the statistics for running the > application at large scale, on 2048 CPUs. The 3rd column is running > Swift+Falkon (both from SVN) on 256 CPUs. The 4th column is Swift+Falkon, but > Swift has the 3 patches applied. Essentially, the per task execution time was > reduced from 309 seconds to 80 seconds, where the ideal would have been 64 > seconds. It brought the efficiency from 21% to 81% for this particular > workload. This looks fantastic! The standard deviation is quite large for the patched-swift values. I'd be interested to see the -info files for all of these runs so I can see what they are doing. Can you put them somewhere for me? -- From zhaozhang at uchicago.edu Thu Apr 3 04:51:04 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 03 Apr 2008 04:51:04 -0500 Subject: [Swift-user] Re: [falkon-user] 1) disable retry mechanism and 2) continue on failure? In-Reply-To: References: <47F02A00.6090203@cs.uchicago.edu> <47F04E38.60207@uchicago.edu> <47F3E9CD.9090507@cs.uchicago.edu> Message-ID: <47F4A888.1000705@uchicago.edu> Hi, Ben Check this login.ci.uchicago.edu:/home/zzhang/info.tar zhao Ben Clifford wrote: > On Wed, 2 Apr 2008, Ioan Raicu wrote: > > >> 1 Node Perf Falkon Swift+Falkon Swift+Falkon (patched) >> Min 63.618 53.782 169.139 58.538 >> Average 64.76 65.47253 309.1945 80.21246 >> Median 64.74072 64.774 313.5535 76.5245 >> Max 65.863 94.447 605.654 115.237 >> Standard Deviation 0.488984 3.863944 52.13821 >> 10.95652 >> Efficiency 100% 99% 21% 81% >> >> >> The first column shows the per task statistic when running on 1 node (4 CPUs) >> through Falkon. The second column are the statistics for running the >> application at large scale, on 2048 CPUs. The 3rd column is running >> Swift+Falkon (both from SVN) on 256 CPUs. The 4th column is Swift+Falkon, but >> Swift has the 3 patches applied. Essentially, the per task execution time was >> reduced from 309 seconds to 80 seconds, where the ideal would have been 64 >> seconds. It brought the efficiency from 21% to 81% for this particular >> workload. This looks fantastic! >> > > The standard deviation is quite large for the patched-swift values. I'd be > interested to see the -info files for all of these runs so I can see what > they are doing. Can you put them somewhere for me? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Thu Apr 3 05:05:20 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 3 Apr 2008 10:05:20 +0000 (GMT) Subject: [Swift-user] Re: [falkon-user] 1) disable retry mechanism and 2) continue on failure? In-Reply-To: <47F4A888.1000705@uchicago.edu> References: <47F02A00.6090203@cs.uchicago.edu> <47F04E38.60207@uchicago.edu> <47F3E9CD.9090507@cs.uchicago.edu> <47F4A888.1000705@uchicago.edu> Message-ID: do you have the corresponding swift run log file to go with it? On Thu, 3 Apr 2008, Zhao Zhang wrote: > Hi, Ben > > Check this > login.ci.uchicago.edu:/home/zzhang/info.tar > > zhao > > Ben Clifford wrote: > > On Wed, 2 Apr 2008, Ioan Raicu wrote: > > > > > > > 1 Node Perf Falkon Swift+Falkon Swift+Falkon (patched) > > > Min 63.618 53.782 169.139 58.538 > > > Average 64.76 65.47253 309.1945 80.21246 > > > Median 64.74072 64.774 313.5535 76.5245 > > > Max 65.863 94.447 605.654 115.237 > > > Standard Deviation 0.488984 3.863944 52.13821 > > > 10.95652 > > > Efficiency 100% 99% 21% 81% > > > > > > > > > The first column shows the per task statistic when running on 1 node (4 > > > CPUs) > > > through Falkon. The second column are the statistics for running the > > > application at large scale, on 2048 CPUs. The 3rd column is running > > > Swift+Falkon (both from SVN) on 256 CPUs. The 4th column is Swift+Falkon, > > > but > > > Swift has the 3 patches applied. Essentially, the per task execution time > > > was > > > reduced from 309 seconds to 80 seconds, where the ideal would have been 64 > > > seconds. It brought the efficiency from 21% to 81% for this particular > > > workload. This looks fantastic! > > > > The standard deviation is quite large for the patched-swift values. I'd be > > interested to see the -info files for all of these runs so I can see what > > they are doing. Can you put them somewhere for me? > > > > From benc at hawaga.org.uk Thu Apr 3 05:06:26 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 3 Apr 2008 10:06:26 +0000 (GMT) Subject: [Swift-user] Re: [falkon-user] 1) disable retry mechanism and 2) continue on failure? In-Reply-To: <47F3E9CD.9090507@cs.uchicago.edu> References: <47F02A00.6090203@cs.uchicago.edu> <47F04E38.60207@uchicago.edu> <47F3E9CD.9090507@cs.uchicago.edu> Message-ID: I just asked zhao for the log files (both swift and -info) for the patched run; but I think I'd like to see the unpatched run logs too. On Wed, 2 Apr 2008, Ioan Raicu wrote: > Hi Ben, > Thanks again for the patches, they made a huge difference, increased > efficiency from 21% to 81%! > > Here are the numbers: > > 1 Node Perf Falkon Swift+Falkon Swift+Falkon (patched) > Min 63.618 53.782 169.139 58.538 > Average 64.76 65.47253 309.1945 80.21246 > Median 64.74072 64.774 313.5535 76.5245 > Max 65.863 94.447 605.654 115.237 > Standard Deviation 0.488984 3.863944 52.13821 > 10.95652 > Efficiency 100% 99% 21% 81% > > > The first column shows the per task statistic when running on 1 node (4 CPUs) > through Falkon. The second column are the statistics for running the > application at large scale, on 2048 CPUs. The 3rd column is running > Swift+Falkon (both from SVN) on 256 CPUs. The 4th column is Swift+Falkon, but > Swift has the 3 patches applied. Essentially, the per task execution time was > reduced from 309 seconds to 80 seconds, where the ideal would have been 64 > seconds. It brought the efficiency from 21% to 81% for this particular > workload. This looks fantastic! > We'll have to verify that we can maintain this 81% efficiency to higher number > of CPUs. In the meantime, if you can think of anything else that we could do > to keep pushing the 81% efficiency number higher, let us know.4 > > Thanks again, > Ioan > > Ben Clifford wrote: > > On Mon, 31 Mar 2008, Ben Clifford wrote: > > > > > > > This temporary directory handling is pretty ugly - it should be a couple > > > lines change to wrapper.sh to get similar functionality using the existing > > > swift temporary direcotry handling - change the path to /tmp and use cp > > > instead of ln -s. That way you can take advantage of Swift's existing > > > unique job IDs and error handling too. > > > > > > > Attached are three patches that will apply against svn r1775: > > > > The first puts temporary directories in /tmp rather than on shared fs. > > http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-on-tmp > > > > The second copies the application file to the worker in each job execution > > (though doesn't do any worker-node caching of such between jobs) > > http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-mv-executable > > > > The third creates the worker node log on /tmp and copies it at the end. > > http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-log-locally > > > > The three modify all wrapper.sh and should be applied in the above order. > > > > With the first two patches, the timestamps in the usual info logs will > > provide information about how long the copies take, in the same way that > > they usually indicate times for other execution stages. > > > > > > From zhaozhang at uchicago.edu Thu Apr 3 06:45:14 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 03 Apr 2008 06:45:14 -0500 Subject: [Swift-user] Re: [falkon-user] 1) disable retry mechanism and 2) continue on failure? In-Reply-To: References: <47F02A00.6090203@cs.uchicago.edu> <47F04E38.60207@uchicago.edu> <47F3E9CD.9090507@cs.uchicago.edu> Message-ID: <47F4C34A.4020703@uchicago.edu> Sorry, Ben. I didn't save the swift log file. If you really need the old -info file, I could redo the test, and try to send them to you. But for now, I have several urgent issues. zhao Ben Clifford wrote: > I just asked zhao for the log files (both swift and -info) for the patched > run; but I think I'd like to see the unpatched run logs too. > > On Wed, 2 Apr 2008, Ioan Raicu wrote: > > >> Hi Ben, >> Thanks again for the patches, they made a huge difference, increased >> efficiency from 21% to 81%! >> >> Here are the numbers: >> >> 1 Node Perf Falkon Swift+Falkon Swift+Falkon (patched) >> Min 63.618 53.782 169.139 58.538 >> Average 64.76 65.47253 309.1945 80.21246 >> Median 64.74072 64.774 313.5535 76.5245 >> Max 65.863 94.447 605.654 115.237 >> Standard Deviation 0.488984 3.863944 52.13821 >> 10.95652 >> Efficiency 100% 99% 21% 81% >> >> >> The first column shows the per task statistic when running on 1 node (4 CPUs) >> through Falkon. The second column are the statistics for running the >> application at large scale, on 2048 CPUs. The 3rd column is running >> Swift+Falkon (both from SVN) on 256 CPUs. The 4th column is Swift+Falkon, but >> Swift has the 3 patches applied. Essentially, the per task execution time was >> reduced from 309 seconds to 80 seconds, where the ideal would have been 64 >> seconds. It brought the efficiency from 21% to 81% for this particular >> workload. This looks fantastic! >> We'll have to verify that we can maintain this 81% efficiency to higher number >> of CPUs. In the meantime, if you can think of anything else that we could do >> to keep pushing the 81% efficiency number higher, let us know.4 >> >> Thanks again, >> Ioan >> >> Ben Clifford wrote: >> >>> On Mon, 31 Mar 2008, Ben Clifford wrote: >>> >>> >>> >>>> This temporary directory handling is pretty ugly - it should be a couple >>>> lines change to wrapper.sh to get similar functionality using the existing >>>> swift temporary direcotry handling - change the path to /tmp and use cp >>>> instead of ln -s. That way you can take advantage of Swift's existing >>>> unique job IDs and error handling too. >>>> >>>> >>> Attached are three patches that will apply against svn r1775: >>> >>> The first puts temporary directories in /tmp rather than on shared fs. >>> http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-on-tmp >>> >>> The second copies the application file to the worker in each job execution >>> (though doesn't do any worker-node caching of such between jobs) >>> http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-mv-executable >>> >>> The third creates the worker node log on /tmp and copies it at the end. >>> http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-log-locally >>> >>> The three modify all wrapper.sh and should be applied in the above order. >>> >>> With the first two patches, the timestamps in the usual info logs will >>> provide information about how long the copies take, in the same way that >>> they usually indicate times for other execution stages. >>> >>> >>> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Thu Apr 3 14:45:22 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 3 Apr 2008 19:45:22 +0000 (GMT) Subject: [Swift-user] Re: [falkon-user] 1) disable retry mechanism and 2) continue on failure? In-Reply-To: <47F4C34A.4020703@uchicago.edu> References: <47F02A00.6090203@cs.uchicago.edu> <47F04E38.60207@uchicago.edu> <47F3E9CD.9090507@cs.uchicago.edu> <47F4C34A.4020703@uchicago.edu> Message-ID: its fine for now. There's a convention for storing log files - put the .log file and the whole .d director somewhere in ~benc/swift-logs/ in CI NFS space. Most simply, put files directly in there; for a more structured layout see how mike has organised his stuff under ~benc/swift-logs/wilde/ On Thu, 3 Apr 2008, Zhao Zhang wrote: > Sorry, Ben. > > I didn't save the swift log file. If you really need the old -info file, I > could redo the test, and try to send them to you. > But for now, I have several urgent issues. > > zhao > > Ben Clifford wrote: > > I just asked zhao for the log files (both swift and -info) for the patched > > run; but I think I'd like to see the unpatched run logs too. > > > > On Wed, 2 Apr 2008, Ioan Raicu wrote: > > > > > > > Hi Ben, > > > Thanks again for the patches, they made a huge difference, increased > > > efficiency from 21% to 81%! > > > > > > Here are the numbers: > > > > > > 1 Node Perf Falkon Swift+Falkon Swift+Falkon (patched) > > > Min 63.618 53.782 169.139 58.538 > > > Average 64.76 65.47253 309.1945 80.21246 > > > Median 64.74072 64.774 313.5535 76.5245 > > > Max 65.863 94.447 605.654 115.237 > > > Standard Deviation 0.488984 3.863944 52.13821 > > > 10.95652 > > > Efficiency 100% 99% 21% 81% > > > > > > > > > The first column shows the per task statistic when running on 1 node (4 > > > CPUs) > > > through Falkon. The second column are the statistics for running the > > > application at large scale, on 2048 CPUs. The 3rd column is running > > > Swift+Falkon (both from SVN) on 256 CPUs. The 4th column is Swift+Falkon, > > > but > > > Swift has the 3 patches applied. Essentially, the per task execution time > > > was > > > reduced from 309 seconds to 80 seconds, where the ideal would have been 64 > > > seconds. It brought the efficiency from 21% to 81% for this particular > > > workload. This looks fantastic! We'll have to verify that we can maintain > > > this 81% efficiency to higher number > > > of CPUs. In the meantime, if you can think of anything else that we could > > > do > > > to keep pushing the 81% efficiency number higher, let us know.4 > > > > > > Thanks again, > > > Ioan > > > > > > Ben Clifford wrote: > > > > > > > On Mon, 31 Mar 2008, Ben Clifford wrote: > > > > > > > > > > > > > This temporary directory handling is pretty ugly - it should be a > > > > > couple > > > > > lines change to wrapper.sh to get similar functionality using the > > > > > existing > > > > > swift temporary direcotry handling - change the path to /tmp and use > > > > > cp > > > > > instead of ln -s. That way you can take advantage of Swift's existing > > > > > unique job IDs and error handling too. > > > > > > > > > Attached are three patches that will apply against svn r1775: > > > > > > > > The first puts temporary directories in /tmp rather than on shared fs. > > > > http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-on-tmp > > > > > > > > The second copies the application file to the worker in each job > > > > execution > > > > (though doesn't do any worker-node caching of such between jobs) > > > > http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-mv-executable > > > > > > > > The third creates the worker node log on /tmp and copies it at the end. > > > > http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-log-locally > > > > > > > > The three modify all wrapper.sh and should be applied in the above > > > > order. > > > > > > > > With the first two patches, the timestamps in the usual info logs will > > > > provide information about how long the copies take, in the same way that > > > > they usually indicate times for other execution stages. > > > > > > > > > > > > > > > From zhaozhang at uchicago.edu Thu Apr 3 14:47:04 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 03 Apr 2008 14:47:04 -0500 Subject: [Swift-user] Re: [falkon-user] 1) disable retry mechanism and 2) continue on failure? In-Reply-To: References: <47F02A00.6090203@cs.uchicago.edu> <47F04E38.60207@uchicago.edu> <47F3E9CD.9090507@cs.uchicago.edu> <47F4C34A.4020703@uchicago.edu> Message-ID: <47F53438.3070401@uchicago.edu> Thanks, Ben zhao Ben Clifford wrote: > its fine for now. > > There's a convention for storing log files - put the .log file and the > whole .d director somewhere in ~benc/swift-logs/ in CI NFS space. > > Most simply, put files directly in there; for a more structured layout see > how mike has organised his stuff under ~benc/swift-logs/wilde/ > > On Thu, 3 Apr 2008, Zhao Zhang wrote: > > >> Sorry, Ben. >> >> I didn't save the swift log file. If you really need the old -info file, I >> could redo the test, and try to send them to you. >> But for now, I have several urgent issues. >> >> zhao >> >> Ben Clifford wrote: >> >>> I just asked zhao for the log files (both swift and -info) for the patched >>> run; but I think I'd like to see the unpatched run logs too. >>> >>> On Wed, 2 Apr 2008, Ioan Raicu wrote: >>> >>> >>> >>>> Hi Ben, >>>> Thanks again for the patches, they made a huge difference, increased >>>> efficiency from 21% to 81%! >>>> >>>> Here are the numbers: >>>> >>>> 1 Node Perf Falkon Swift+Falkon Swift+Falkon (patched) >>>> Min 63.618 53.782 169.139 58.538 >>>> Average 64.76 65.47253 309.1945 80.21246 >>>> Median 64.74072 64.774 313.5535 76.5245 >>>> Max 65.863 94.447 605.654 115.237 >>>> Standard Deviation 0.488984 3.863944 52.13821 >>>> 10.95652 >>>> Efficiency 100% 99% 21% 81% >>>> >>>> >>>> The first column shows the per task statistic when running on 1 node (4 >>>> CPUs) >>>> through Falkon. The second column are the statistics for running the >>>> application at large scale, on 2048 CPUs. The 3rd column is running >>>> Swift+Falkon (both from SVN) on 256 CPUs. The 4th column is Swift+Falkon, >>>> but >>>> Swift has the 3 patches applied. Essentially, the per task execution time >>>> was >>>> reduced from 309 seconds to 80 seconds, where the ideal would have been 64 >>>> seconds. It brought the efficiency from 21% to 81% for this particular >>>> workload. This looks fantastic! We'll have to verify that we can maintain >>>> this 81% efficiency to higher number >>>> of CPUs. In the meantime, if you can think of anything else that we could >>>> do >>>> to keep pushing the 81% efficiency number higher, let us know.4 >>>> >>>> Thanks again, >>>> Ioan >>>> >>>> Ben Clifford wrote: >>>> >>>> >>>>> On Mon, 31 Mar 2008, Ben Clifford wrote: >>>>> >>>>> >>>>> >>>>>> This temporary directory handling is pretty ugly - it should be a >>>>>> couple >>>>>> lines change to wrapper.sh to get similar functionality using the >>>>>> existing >>>>>> swift temporary direcotry handling - change the path to /tmp and use >>>>>> cp >>>>>> instead of ln -s. That way you can take advantage of Swift's existing >>>>>> unique job IDs and error handling too. >>>>>> >>>>>> >>>>> Attached are three patches that will apply against svn r1775: >>>>> >>>>> The first puts temporary directories in /tmp rather than on shared fs. >>>>> http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-on-tmp >>>>> >>>>> The second copies the application file to the worker in each job >>>>> execution >>>>> (though doesn't do any worker-node caching of such between jobs) >>>>> http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-mv-executable >>>>> >>>>> The third creates the worker node log on /tmp and copies it at the end. >>>>> http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-log-locally >>>>> >>>>> The three modify all wrapper.sh and should be applied in the above >>>>> order. >>>>> >>>>> With the first two patches, the timestamps in the usual info logs will >>>>> provide information about how long the copies take, in the same way that >>>>> they usually indicate times for other execution stages. >>>>> >>>>> >>>>> >>>> >>>> >>> >>> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Mon Apr 14 14:39:39 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 14 Apr 2008 14:39:39 -0500 Subject: [Swift-user] sites.xml entry for Abe teragrd site Message-ID: <4803B2FB.8040201@mcs.anl.gov> Mike, I think this is pretty close to what you need, but I did not test it: /cfs/scratch/users/mkubal/swiftwork -or- /u/ac/mkubal/swiftwork - be sure to create these swiftwork dirs first! What you should do: create the swiftwork dirs listed above first is for large scratch space, second is for your persistent user space remove the -comments- above and use only one workdirectory. I think you can use mainly the scratch one for now test submitting a simple command via globus-job-run (first to the default for jobmanger, then to jobmanager-pbs) test copying a short file to the work dirs using globus-url copy then try a simple workflow Ben, Sarah or Mihael may be able to help you find out if WS-GRAM is available and working on Abe. If so, you should switch to that to avoid overrunning Abe's gatekeeper. And use the throttling properties that you and Ben worked out. - Mike From wilde at mcs.anl.gov Mon Apr 14 15:25:01 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 14 Apr 2008 15:25:01 -0500 Subject: [Swift-user] Teragrid info for WS-GRAM and pre-WS_GRAM Message-ID: <4803BD9D.3010205@mcs.anl.gov> The following table (if accurate) seems to have all the info needed for sites.xml entries for all GRAM versions on all TG sites: http://www.teragrid.org/userinfo/jobs/gram.php If there's any discrepancies or issues with this config info we (and users) should contact help at teragrid.org. A link to this should be added to the Swift Users Guide sections 15 and/or 16. From benc at hawaga.org.uk Wed Apr 16 14:42:39 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 16 Apr 2008 19:42:39 +0000 (GMT) Subject: [Swift-user] Swift 0.5 released. Message-ID: Swift 0.5 is now available for download from http://www.ci.uchicago.edu/swift/packages/vdsk-0.5.tar.gz This is intended to address a number of bugs that were present in 0.4, most notably data channel reuse in GridFTP and a number of problems with recent compiler enhancements. For more information about Swift, visit http://www.ci.uchicago.edu/swift/ --