[Swift-devel] Re: GRAM and Swift discussion this week?

Ioan Raicu iraicu at cs.uchicago.edu
Tue May 22 14:34:07 CDT 2007


See below:

Ben Clifford wrote:
> On Tue, 22 May 2007, Ian Foster wrote:
>
>   
>> Are there WS-GRAM issues that are causing problems for Swift?
>>     
>
> No one uses WS-GRAM with Swift, so we aren't really uncovering issus 
> there.
>
>
>   
>> Is advance reservation important for Swift?
>>     
>
> We haven't really talked about that. I'm not sure how it would fit in, but 
> if people want it, it would be nice to accomodate it somehow.
>
>
>   
>> Swift is increasingly using Falkon to handle submissions, which reduces 
>> the number of GRAM operations performed significantly.
>>     
>
> At the high/experimental end, yes. However, if we have any expectation of 
> people downloading and using it by themselves without us providing 
> professional services-style consultancy, then those users won't be going 
> anywhere near Falkon any time soon.
>   
We have learned quite a bit about setting up Falkon at different sites 
across the TG.  The caveats that we have to watch out for are:

   1. platform specific JVM location, this is not set correctly in the
      remote machine's environment, and is different from site to site;
      this remains as an issue that needs to be addressed per site
   2. some sites require the project be explicitly specified; this has
      been fixed
   3. expired credentials errors don't get propagated to the user's
      screen, they are simply written to logs...
   4. some sites (ANL) support GRAM4 extensions, while other sites do
      not; we now support both RSL formats
   5. the many logs that we generate are quite hard for people to
      follow, and keep track of what each one contains; we fixed this by
      developing a GUI that  can connect to the GT4 container remotely
      and display relevant information!
   6. TG machines have an old kernel that do not support changing the
      thread stack size
          * this has implications on the number of threads a JVM can
            create before running out of memory
          * we have observed that we can create about 100~200 threads
            per JVM on most TG nodes
          * the GT4 container operates on a pool of threads for
            everything it does, so the max number of threads it will
            create is bounded!
          * the provisioner currently creates a new thread for every job
            (resource allocation) it sends to GRAM4
                o depending on which allocation strategy is used, this
                  might/might not be a problem on TG nodes
                o in theory, we don't want more than 100 or so GRAM4
                  jobs in parallel running, but  if we choose the policy
                  in which each job allocates a single machine, then we
                  can easily surpass 100 jobs in parallel... all the
                  other policies, would be able to allocate 1K+, even
                  10K+ machines with less than 100 jobs in parallel, so
                  it could work perfectly fine even with the current
                  implementation; in the long run, this might be able to
                  be changed to a pool of threads in the provisioner!

The things that I believe are needed for it be more friendly to 
new/existing users outside of the core developers:

   1. A suite of tests that will ensure everything is set correctly,
      before using Falkon
          * we could check against grid-proxy-info in a script
          * make sure GRAM4 works at the particular site by using
            globusrun-ws
          * check the JAVA_HOME and java commands from within a GRAM4
            submitted job
          * check if ANT is installed; this is needed to recompile the
            Falkon service
   2. get more of the Falkon configuration parameters into config files,
      rather than scripts or code!
   3. clean up the scripts, and make them more robust and user friendly
   4. make an interface into the provisioning component and Falkon to
      allow the live configuration of Falkon without requiring restarts
   5. Documentation well beyond the current 1 page readme that is only
      sufficient if everything works!
   6. There is no documentation on how to set up the needed security if
      a user wants to enable security in Falkon; the default is no security

Maybe there are others that I missed, but I don't think we are that far 
from people being able to use it without us taking them by the hand the 
entire way.  The things that would be good to do are not on the top of 
my things to do list, but in time, I'll get them done.  If anyone wants 
to help with these, I would  not refuse anyone's help.

Ioan
> --
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================




More information about the Swift-devel mailing list