[Swift-devel] bug 53

Ioan Raicu iraicu at cs.uchicago.edu
Sat Sep 15 18:09:25 CDT 2007


Hi Nika, all,
I have just committed my changes for the last week or two (R1221):
- added a GUI monitor to view the service state remotely
- added support for data caching, including modifying the service WSDL
  to support the meta-data associated with the data caching... if data
caching is not needed, the old clients using the old WSDL should still
work just fine; the data-aware scheduler has not been tested much yet, use
with caution
- added some new options to the two config files from the service and
  workers
- added some new scripts to all, clients, service, workers...
- added various sample workloads to test the data caching code
*- addressed BUG53 from Swift... *added support for capturing and
  identifying certain hardware errors (i.e. stale NFS file handle), and
retrying those tasks up to a certain number of retry attempts; also
added support for the scheduler to suspend workers for some time that
have failed some number of tasks (for now, only the known hardware
errors actually affect this, and application errors should not suspend
any workers)
- other minor stuff

Regarding BUG53, here are more details (service/etc/Falkon.config):
#settings for task retries upon known errors (i.e. stale NFS file 
handle) and max number of known failures per node before suspending the 
corresponding node
maxNumErrorsPerTask=10
maxNumErrorsPerExecutor=3
suspendTimeoutInterval_ms=15000

These setting is probably a good place to start, but we could change 
them as we see fit as we learn more about how this new feature behaves 
when there are errors.  For example, we might want to increase the 
suspend timeout value from 15sec to say 60 sec, and possibly reduce the 
number of errors per executor from 3 to 1 as its likely that when 1 
error occurs, another will happen on the next 2 as well. 

Nika, feel free to update Falkon (the provider should not need an 
update), and try another 244 mol run with MolDyn.  You have an account 
to charge against at ANL/UC, right?  You can try to do the whole run 
yourself, and see how it works.  Let us know if the workflow completes!  
I am also still interested for some comparison numbers if MolDyn would 
run over GRAM directly for the larger number of molecules, so keep us 
posted with that progress as well.

BTW, Catalin is still waiting on instructions on how to compile the 
entire MolDyn app.  Do you not have time for this now, and we should 
move to a different app for running Swift+Falkon in a virtual cluster?

Ioan

Veronika Nefedova wrote:
> Ioan, how your work on that 'avoiding bad node' thing is progressing? 
> You seem to be more interested in running my workflow on a virtual 
> cluster  rather then working on a new feature  that would enable 
> MolDyn to run reliably on TG... I apologize if I am wrong - the lack 
> of information made me to come to this conclusion; please provide me 
> with a relevant information and an estimate on when I can expect 
> Falcon to be ready for a new rounds of tests.
>
> Thanks,
>
> Nika
>
>
>
> On Sep 13, 2007, at 5:48 PM, Ioan Raicu wrote:
>
>> It would be good to have some comparison numbers, so I think its 
>> worth doing to see if the workflow will complete, and to see what 
>> performance it gets!
>> Ioan
>>
>> Veronika Nefedova wrote:
>>> Thanks, Mihael! I could try submitting now some 20 molecules to 
>>> tg-uc (directly to GRAM) -- just to be on a safe side. If no GRAM 
>>> problems will be reported, I'll increase the number to 244.
>>> Of, course the performance will suffer greatly -- but I hope it 
>>> would enable to get the whole workflow to go throw. Are there any 
>>> throttles that could be set to increase a bit the performance (given 
>>> that I set the maxSubmitRate to 0.2) ?'
>>>
>>> Nika
>>>
>>> On Sep 13, 2007, at 4:41 PM, Mihael Hategan wrote:
>>>
>>>> Ok, so there's something in.
>>>> There are some discussions that can be had on certain aesthetic 
>>>> topics.
>>>> In any event, in sites.xml, you can add, for a site, something like
>>>> this:
>>>>
>>>> <profile namespace="karajan" key="maxSubmitRate">0.1</profile>
>>>>
>>>> The rate is in jobs per second. The above would mean one job every ten
>>>> seconds.
>>>>
>>>> Mihael
>>>>
>>>> On Thu, 2007-09-13 at 15:23 +0000, Ben Clifford wrote:
>>>>> Yes?
>>>>>
>>>>> On Thu, 13 Sep 2007, Mihael Hategan wrote:
>>>>>
>>>>>> May I still fix that bug though?
>>>>>>
>>>>>> On Thu, 2007-09-13 at 09:54 -0500, Ioan Raicu wrote:
>>>>>>> Hi,
>>>>>>> I am still working on the new feature for Falkon to avoid 
>>>>>>> submitting
>>>>>>> tasks to known bad nodes, and to perhaps do its own retries for 
>>>>>>> failed
>>>>>>> jobs with certain known errors (i.e. stale NFS handle).  I 
>>>>>>> should have
>>>>>>> that ready for next week to try out.  Once this new feature is 
>>>>>>> in, we
>>>>>>> could try MolDyn again to see how it behaves.
>>>>>>>
>>>>>>> About avoiding Falkon of MolDyn, I recall something about the
>>>>>>> scalability/policies of GRAM/PBS to handle many con current jobs,
>>>>>>> having to throttle job submissions to something around 1 job 
>>>>>>> every 10
>>>>>>> seconds (for sustained periods of time, short bursts could send
>>>>>>> faster), and the fact that only a few 10s of nodes would be used
>>>>>>> concurrently, even though the sites that it was running on had more
>>>>>>> free nodes.  I also think that MolDyn through GRAM/PBS was running
>>>>>>> only 1 job per node, in essence only using 1 processor of the 2 per
>>>>>>> node.  I think the largest workflow Nika was able to run over 
>>>>>>> GRAM/PBS
>>>>>>> was 5 molecules, 421 jobs (but only 340 jobs in the large stage).
>>>>>>> Nika, were there other problems you encountered?
>>>>>>>
>>>>>>> Ioan
>>>>>>>
>>>>>>> Mihael Hategan wrote:
>>>>>>>> Very well Sir. I shall see to the priority of the issue being 
>>>>>>>> raised.
>>>>>>>>
>>>>>>>> On Thu, 2007-09-13 at 14:09 +0000, Ben Clifford wrote:
>>>>>>>>
>>>>>>>>> I think one of the main impediments to moldyn running with 
>>>>>>>>> GRAM directly
>>>>>>>>> is bug 53 which is a request for sumission rate limiting.
>>>>>>>>>
>>>>>>>>> It might be relatively easy to implement that and see how the 
>>>>>>>>> MolDyn
>>>>>>>>> workflow behaves then.
>>>>>>>>>
>>>>>>>>> I'm interested to see if Falkon can be avoided for this workflow.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Swift-devel mailing list
>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>> ============================================
>>>>>>> Ioan Raicu
>>>>>>> Ph.D. Student
>>>>>>> ============================================
>>>>>>> Distributed Systems Laboratory
>>>>>>> Computer Science Department
>>>>>>> University of Chicago
>>>>>>> 1100 E. 58th Street, Ryerson Hall
>>>>>>> Chicago, IL 60637
>>>>>>> ============================================
>>>>>>> Email: iraicu at cs.uchicago.edu
>>>>>>> Web:   http://www.cs.uchicago.edu/~iraicu
>>>>>>>        http://dsl.cs.uchicago.edu/
>>>>>>> ============================================
>>>>>>> ============================================
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>
>> -- 
>> ============================================
>> Ioan Raicu
>> Ph.D. Student
>> ============================================
>> Distributed Systems Laboratory
>> Computer Science Department
>> University of Chicago
>> 1100 E. 58th Street, Ryerson Hall
>> Chicago, IL 60637
>> ============================================
>> Email: iraicu at cs.uchicago.edu
>> Web:   http://www.cs.uchicago.edu/~iraicu
>>       http://dsl.cs.uchicago.edu/
>> ============================================
>> ============================================
>>
>
>

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070915/a05ca876/attachment.html>


More information about the Swift-devel mailing list