[Swift-devel] coasters-hosts.pl script

Michael Wilde wilde at mcs.anl.gov
Fri Mar 2 11:27:51 CST 2012


This all seems a bit brittle. I think what we did in Falkon was to use the Zoid init script that runs on the IOP to add the worker IPs:

http://wiki.mcs.anl.gov/zeptoos/index.php/ZOID#User_script

This script can find the subnet of the workers, and the worker IPs on that subnet are fixed.

You still have the issue of waiting for all the IPs to report back. Each could make a file in a directory.  But you'd be less at the mercy of worker.pl scripts and log4j to get the IP info you need, perhaps?

- Mike

----- Original Message -----
> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
> To: "Justin M Wozniak" <wozniak at mcs.anl.gov>
> Cc: "swift-devel at ci.uchicago.edu Devel" <swift-devel at ci.uchicago.edu>, emalayan at ece.ubc.ca
> Sent: Friday, March 2, 2012 11:15:03 AM
> Subject: Re: [Swift-devel] coasters-hosts.pl script
> That fix still did not work. I had moved it to the same spot. It is
> still waiting for the worker-init.pl script to finish before the ip
> addresses are printed to the cps log. Those ip addresses are what is
> needed by the coaster-hosts.pl script to finish. If I create an empty
> file for the coaster-host.pl script to read, then the work continues
> and the ip addresses show up in the cps log.
> 
> Why is log4j waiting to add those lines to the cps log after the
> worker-init.pl script is finished?
> 
> On Mar 2, 2012, at 11:05 AM, Jonathan Monette wrote:
> 
> > Thanks, in my copy I thought I had moved the reconnect to before the
> > init-cmd and it still wasn't working. I will test with your change.
> > I just verified that it was indeed waiting for the worker-init.pl
> > script to finish. I created an empty file for the script to read and
> > it finished connecting and the ip addresses I needed were added to
> > the cps log. I will also be testing your fix.
> >
> > On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote:
> >
> >>
> >> Yes- I must have tested this with a different log file. I just
> >> checked in and installed in ~wozniak/Public a fix for this that
> >> launches WORKER_INIT_CMD after the reconnect(). I am a little
> >> worried about time outs but it works so far. I will continue
> >> testing...
> >> 	Justin
> >>
> >> On Thu, 1 Mar 2012, Jonathan Monette wrote:
> >>
> >>> Justin,
> >>> So I have been trying to help Emalayan get the host list file for
> >>> the worker-init.pl script. It seems the cps log file is not
> >>> providing the ip addresses for the coasters-hosts.pl script. I
> >>> thought this was maybe because we did not have the correct log4j
> >>> setting set but we have the Coaster service Cpu set to DEBUG. So
> >>> for some reason the workers are not connecting to the service.
> >>> When I comment out the export WORKER_ENVIRONEMTN="…" line in the
> >>> coaster-service.conf file I see the workers connect and the cps
> >>> log file shows there ip addresses. However when setting this line
> >>> it seems they are not connecting.
> >>>
> >>> Emalayan thought there might be some sort of circular dependency
> >>> going with the host-list file and the worker. The worker requires
> >>> the host-list file so that it can run the worker-init.pl script
> >>> and then connect but the host-list file cannot be generated
> >>> because the workers cannot connect. I noticed in your swift-test
> >>> directory the cps files did have the ip addresses set and
> >>> coasters-hosts.pl found the ip addresses and reported them. Did
> >>> you try that test with setting the WORKER_ENVIRONMENT variable in
> >>> the coaster-service.conf file? Any idea what may be happening? The
> >>> job is running when looking under cqstat.
> >>>
> >>> A side note: At the mosaswift site, your example talks about
> >>> running the coasters-hosts.pl on the cps log but the example you
> >>> provide runs it on logs/coasters.log. This may need to be changed.
> >>> Also, should provide the log4j setting that is required to
> >>> generate the Cpu line with the worker ip address just to clarify
> >>> that this line should be set for this script to work.
> >>>
> >>> For reference, this line:
> >>> log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG
> >>
> >> --
> >> Justin M Wozniak
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list