[Swift-user] Re: [falkon-user] 1) disable retry mechanism and 2) continue on failure?
Zhao Zhang
zhaozhang at uchicago.edu
Thu Apr 3 14:47:04 CDT 2008
Thanks, Ben
zhao
Ben Clifford wrote:
> its fine for now.
>
> There's a convention for storing log files - put the .log file and the
> whole .d director somewhere in ~benc/swift-logs/ in CI NFS space.
>
> Most simply, put files directly in there; for a more structured layout see
> how mike has organised his stuff under ~benc/swift-logs/wilde/
>
> On Thu, 3 Apr 2008, Zhao Zhang wrote:
>
>
>> Sorry, Ben.
>>
>> I didn't save the swift log file. If you really need the old -info file, I
>> could redo the test, and try to send them to you.
>> But for now, I have several urgent issues.
>>
>> zhao
>>
>> Ben Clifford wrote:
>>
>>> I just asked zhao for the log files (both swift and -info) for the patched
>>> run; but I think I'd like to see the unpatched run logs too.
>>>
>>> On Wed, 2 Apr 2008, Ioan Raicu wrote:
>>>
>>>
>>>
>>>> Hi Ben,
>>>> Thanks again for the patches, they made a huge difference, increased
>>>> efficiency from 21% to 81%!
>>>>
>>>> Here are the numbers:
>>>>
>>>> 1 Node Perf Falkon Swift+Falkon Swift+Falkon (patched)
>>>> Min 63.618 53.782 169.139 58.538
>>>> Average 64.76 65.47253 309.1945 80.21246
>>>> Median 64.74072 64.774 313.5535 76.5245
>>>> Max 65.863 94.447 605.654 115.237
>>>> Standard Deviation 0.488984 3.863944 52.13821
>>>> 10.95652
>>>> Efficiency 100% 99% 21% 81%
>>>>
>>>>
>>>> The first column shows the per task statistic when running on 1 node (4
>>>> CPUs)
>>>> through Falkon. The second column are the statistics for running the
>>>> application at large scale, on 2048 CPUs. The 3rd column is running
>>>> Swift+Falkon (both from SVN) on 256 CPUs. The 4th column is Swift+Falkon,
>>>> but
>>>> Swift has the 3 patches applied. Essentially, the per task execution time
>>>> was
>>>> reduced from 309 seconds to 80 seconds, where the ideal would have been 64
>>>> seconds. It brought the efficiency from 21% to 81% for this particular
>>>> workload. This looks fantastic! We'll have to verify that we can maintain
>>>> this 81% efficiency to higher number
>>>> of CPUs. In the meantime, if you can think of anything else that we could
>>>> do
>>>> to keep pushing the 81% efficiency number higher, let us know.4
>>>>
>>>> Thanks again,
>>>> Ioan
>>>>
>>>> Ben Clifford wrote:
>>>>
>>>>
>>>>> On Mon, 31 Mar 2008, Ben Clifford wrote:
>>>>>
>>>>>
>>>>>
>>>>>> This temporary directory handling is pretty ugly - it should be a
>>>>>> couple
>>>>>> lines change to wrapper.sh to get similar functionality using the
>>>>>> existing
>>>>>> swift temporary direcotry handling - change the path to /tmp and use
>>>>>> cp
>>>>>> instead of ln -s. That way you can take advantage of Swift's existing
>>>>>> unique job IDs and error handling too.
>>>>>>
>>>>>>
>>>>> Attached are three patches that will apply against svn r1775:
>>>>>
>>>>> The first puts temporary directories in /tmp rather than on shared fs.
>>>>> http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-on-tmp
>>>>>
>>>>> The second copies the application file to the worker in each job
>>>>> execution
>>>>> (though doesn't do any worker-node caching of such between jobs)
>>>>> http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-dirs-mv-executable
>>>>>
>>>>> The third creates the worker node log on /tmp and copies it at the end.
>>>>> http://www.ci.uchicago.edu/~benc/tmp/wrapper-tmp-log-locally
>>>>>
>>>>> The three modify all wrapper.sh and should be applied in the above
>>>>> order.
>>>>>
>>>>> With the first two patches, the timestamps in the usual info logs will
>>>>> provide information about how long the copies take, in the same way that
>>>>> they usually indicate times for other execution stages.
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20080403/2cafe8ff/attachment.html>
More information about the Swift-user
mailing list