[Swift-devel] coasters-hosts.pl script
Jonathan Monette
jonmon at mcs.anl.gov
Sat Mar 3 17:21:47 CST 2012
Sure. I can help him debug and get the ip addresses he needs.
On Mar 3, 2012, at 9:40 AM, Michael Wilde <wilde at mcs.anl.gov> wrote:
> Jon, thanks - I missed your note when I signed off last night. We did a few more tests and verified that on the default zeptoos kernel, hostname was returning a numeric dotted IP address like 172.nnn.nnn.nnn, while on the special kernel fixed for Mosa it was returning "(none)". So it seems like some config issue in that kernel profile.
>
> Emalyan was going to try to patch the call to hostname with a script that pulls the IP address from ifconfig or some other source.
>
> He was also going to try the compute node login procedure (ssh-telnet), and report if it still doesnt work, as that would be handy for debugging in this case.
>
> For the moment we were debugging this with cqsub jobs. Im gonna drop out and leave this to you and Emalayan. (I just happened to be online when he reported this problem).
>
> Thanks,
>
> - Mike
>
>
> ----- Original Message -----
>> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
>> To: "Michael Wilde" <wilde at mcs.anl.gov>
>> Cc: "Emalayan Vairavanathan" <svemalayan at yahoo.com>, "swift-devel at ci.uchicago.edu Devel"
>> <swift-devel at ci.uchicago.edu>, emalayan at ece.ubc.ca, "MosaStore" <mosastore at googlegroups.com>, "Justin M Wozniak"
>> <wozniak at mcs.anl.gov>
>> Sent: Friday, March 2, 2012 9:46:07 PM
>> Subject: Re: [Swift-devel] coasters-hosts.pl script
>> I was just about to send with this exact information. : )
>>
>> I do not think that hostname not being in the PATH would cause "none"
>> to appear. hostname has to be configured to return what you want it
>> to. If it wasn't in the path I think there would be more problems with
>> the worker since it would return an error for the binary not being
>> found, but I could be wrong there.
>>
>> But I echo Mike's debugging suggestions to diagnose the problems. Try
>> sshing/telnetting to the compute node to check out the environment
>> that the worker sees while it is running.
>>
>> On Mar 2, 2012, at 9:31 PM, Michael Wilde wrote:
>>
>>> Emalayan,
>>>
>>> The problem may be due to the hostname command returning something
>>> unexpected (perhaps null) on the worker nodes when booted under that
>>> kernel profile.
>>>
>>> These lines are in worker.pl:
>>>
>>> my $myhost=`hostname`;
>>> $myhost =~ s/\s+$//;
>>> ...
>>> wlog(INFO, "Running on node $myhost\n");
>>>
>>> To debug this, it seems useful to be able to login to the compute
>>> nodes via the ssh-telnet procedure. That seems not to work last time
>>> we tried - perhaps that should be debugged.
>>>
>>> Also, you could run simple test jobs with cqsub to print the output
>>> (and location) of the hostname command.
>>>
>>> Perhaps hostname is not in the PATH for worker nodes booted under
>>> that kernel???
>>>
>>> I recall in the distant past we used to need to do various PATH and
>>> LD_LIBRARY_PATH initialization steps to get the right /bin and
>>> /usr/bin for the ZeptoOS nodes. Maybe something is broken in that
>>> regard for the alternate kernel profile?
>>>
>>> - Mike
>>>
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Emalayan Vairavanathan" <svemalayan at yahoo.com>
>>>> To: "Jonathan Monette" <jonmon at mcs.anl.gov>, "Justin M Wozniak"
>>>> <wozniak at mcs.anl.gov>
>>>> Cc: "swift-devel at ci.uchicago.edu Devel"
>>>> <swift-devel at ci.uchicago.edu>, emalayan at ece.ubc.ca, "MosaStore"
>>>> <mosastore at googlegroups.com>
>>>> Sent: Friday, March 2, 2012 9:13:30 PM
>>>> Subject: Re: [Swift-devel] coasters-hosts.pl script
>>>> Hi Jon,
>>>>
>>>>
>>>> Thank you again for your time and very quick fix. I tested the fix
>>>> with zeptoos and zepto-vn-eval/mosatest kernel profiles. The fix
>>>> only
>>>> worked with zeptoos. It did not work with zepto-vn-eval/mosatest
>>>> (This
>>>> profile contains some Mosastore related bug fixes so to run
>>>> MosaStore
>>>> we need this profile ).
>>>>
>>>>
>>>>
>>>> The reason is:
>>>>
>>>>
>>>> I can generate worker-hosts.txt only with zeptoos and it did not
>>>> work
>>>> with zepto-vn-eval/mosatest. This is because coasters-hosts.pl
>>>> extract
>>>> worker IP address from workers log files. But with
>>>> zepto-vn-eval/mosatest worker log files didnt contain the IP
>>>> address.
>>>> Please see the log messages attached below.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Do you have any idea ? It took few hours for me to narrow down the
>>>> problem and find out that the issue is with kernel-profile . I hope
>>>> this information will help you.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thank you
>>>> Emalayan
>>>>
>>>>
>>>>
>>>>
>>>> With zeptoos :
>>>>
>>>>
>>>> 2012/03/03 02:41:01.492 INFO - 2012.0303.023831.32459 Logging
>>>> started:
>>>> Sat Mar 3 02:41:01 2012
>>>> 2012/03/03 02:41:01.492 INFO - 2012.0303.023831.32459 Logging
>>>> started:
>>>> Sat Mar 3 02:41:01 2012
>>>> 2012/03/03 02:41:01.494 INFO - Running on node 172.18.1.19
>>>> 2012/03/03 02:41:01.494 DEBUG - uri=http://172.17.3.12:22346
>>>> 2012/03/03 02:41:01.494 DEBUG - scheme=http
>>>> 2012/03/03 02:41:01.495 DEBUG - host=172.17.3.12
>>>> 2012/03/03 02:41:01.495 DEBUG - port=22346
>>>> 2012/03/03 02:41:01.495 DEBUG - blockid=2012.0303.023831.32459
>>>>
>>>>
>>>> With zepto-vn-eval/mosatest:
>>>>
>>>>
>>>> 2012/03/03 02:50:40.667 INFO - 2012.0303.024814.15474 Logging
>>>> started:
>>>> Sat Mar 3 02:50:40 2012
>>>> 2012/03/03 02:50:40.683 INFO - Running on node (none)
>>>> 2012/03/03 02:50:40.684 DEBUG - uri=http://172.17.3.12:22346
>>>> 2012/03/03 02:50:40.684 DEBUG - scheme=http
>>>> 2012/03/03 02:50:40.685 DEBUG - host=172.17.3.12
>>>> 2012/03/03 02:50:40.685 DEBUG - port=22346
>>>> 2012/03/03 02:50:40.686 DEBUG - blockid=2012.0303.024814.15474
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> From: Jonathan Monette <jonmon at mcs.anl.gov>
>>>> To: Justin M Wozniak <wozniak at mcs.anl.gov>
>>>> Cc: "swift-devel at ci.uchicago.edu Devel"
>>>> <swift-devel at ci.uchicago.edu>;
>>>> emalayan at ece.ubc.ca
>>>> Sent: Friday, 2 March 2012 2:21 PM
>>>> Subject: Re: [Swift-devel] coasters-hosts.pl script
>>>>
>>>> Emalayan,
>>>> We believe we have fixed the issue. You can copy the new
>>>> coasters-hosts.pl script from
>>>> ~jonmon/surveyor/worker-init-test/coasters-hosts.pl
>>>>
>>>> This script reads the worker logs located in the logs directory.
>>>> The
>>>> steps to run are as follows:
>>>> start-coaster-service
>>>> <wait for workers to start>
>>>> ./coasters-hosts.pl logs/worker-*.log > worker-hosts.txt
>>>>
>>>> You MUST clean out the worker logs after you before you start a new
>>>> coaster service to make sure the script searches the right worker
>>>> log
>>>> files. This may not be ideal at the moment but this will help get
>>>> you
>>>> started. If you have any other questions feel free to ask. We will
>>>> need to update the mosaswift site with the new information, we will
>>>> do
>>>> this soon.
>>>>
>>>> On Mar 2, 2012, at 11:26 AM, Jonathan Monette wrote:
>>>>
>>>>> Can we match this line: 2012/03/02 17:16:04.712 INFO - Running on
>>>>> node 172.18.1.83 from the worker log,
>>>>> instead of this line: 2012-03-02 17:21:25,214+0000 DEBUG Cpu
>>>>> worker
>>>>> started: block=2012.0302.171344.704 host=172.18.1.83 id=0 from the
>>>>> cps log?
>>>>>
>>>>> They both provide the same ip addresses. And the worker log always
>>>>> has that ip address before the cps log does.
>>>>>
>>>>> On Mar 2, 2012, at 11:15 AM, Jonathan Monette wrote:
>>>>>
>>>>>> That fix still did not work. I had moved it to the same spot. It
>>>>>> is
>>>>>> still waiting for the worker-init.pl script to finish before the
>>>>>> ip
>>>>>> addresses are printed to the cps log. Those ip addresses are what
>>>>>> is needed by the coaster-hosts.pl script to finish. If I create
>>>>>> an
>>>>>> empty file for the coaster-host.pl script to read, then the work
>>>>>> continues and the ip addresses show up in the cps log.
>>>>>>
>>>>>> Why is log4j waiting to add those lines to the cps log after the
>>>>>> worker-init.pl script is finished?
>>>>>>
>>>>>> On Mar 2, 2012, at 11:05 AM, Jonathan Monette wrote:
>>>>>>
>>>>>>> Thanks, in my copy I thought I had moved the reconnect to before
>>>>>>> the init-cmd and it still wasn't working. I will test with your
>>>>>>> change. I just verified that it was indeed waiting for the
>>>>>>> worker-init.pl script to finish. I created an empty file for the
>>>>>>> script to read and it finished connecting and the ip addresses I
>>>>>>> needed were added to the cps log. I will also be testing your
>>>>>>> fix.
>>>>>>>
>>>>>>> On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Yes- I must have tested this with a different log file. I just
>>>>>>>> checked in and installed in ~wozniak/Public a fix for this that
>>>>>>>> launches WORKER_INIT_CMD after the reconnect(). I am a little
>>>>>>>> worried about time outs but it works so far. I will continue
>>>>>>>> testing...
>>>>>>>> Justin
>>>>>>>>
>>>>>>>> On Thu, 1 Mar 2012, Jonathan Monette wrote:
>>>>>>>>
>>>>>>>>> Justin,
>>>>>>>>> So I have been trying to help Emalayan get the host list file
>>>>>>>>> for the worker-init.pl script. It seems the cps log file is
>>>>>>>>> not
>>>>>>>>> providing the ip addresses for the coasters-hosts.pl script. I
>>>>>>>>> thought this was maybe because we did not have the correct
>>>>>>>>> log4j
>>>>>>>>> setting set but we have the Coaster service Cpu set to DEBUG.
>>>>>>>>> So
>>>>>>>>> for some reason the workers are not connecting to the service.
>>>>>>>>> When I comment out the export WORKER_ENVIRONEMTN="…" line in
>>>>>>>>> the
>>>>>>>>> coaster-service.conf file I see the workers connect and the
>>>>>>>>> cps
>>>>>>>>> log file shows there ip addresses. However when setting this
>>>>>>>>> line it seems they are not connecting.
>>>>>>>>>
>>>>>>>>> Emalayan thought there might be some sort of circular
>>>>>>>>> dependency
>>>>>>>>> going with the host-list file and the worker. The worker
>>>>>>>>> requires the host-list file so that it can run the
>>>>>>>>> worker-init.pl script and then connect but the host-list file
>>>>>>>>> cannot be generated because the workers cannot connect. I
>>>>>>>>> noticed in your swift-test directory the cps files did have
>>>>>>>>> the
>>>>>>>>> ip addresses set and coasters-hosts.pl found the ip addresses
>>>>>>>>> and reported them. Did you try that test with setting the
>>>>>>>>> WORKER_ENVIRONMENT variable in the coaster-service.conf file?
>>>>>>>>> Any idea what may be happening? The job is running when
>>>>>>>>> looking
>>>>>>>>> under cqstat.
>>>>>>>>>
>>>>>>>>> A side note: At the mosaswift site, your example talks about
>>>>>>>>> running the coasters-hosts.pl on the cps log but the example
>>>>>>>>> you
>>>>>>>>> provide runs it on logs/coasters.log. This may need to be
>>>>>>>>> changed. Also, should provide the log4j setting that is
>>>>>>>>> required
>>>>>>>>> to generate the Cpu line with the worker ip address just to
>>>>>>>>> clarify that this line should be set for this script to work.
>>>>>>>>>
>>>>>>>>> For reference, this line:
>>>>>>>>> log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG
>>>>>>>>
>>>>>>>> --
>>>>>>>> Justin M Wozniak
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Swift-devel mailing list
>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>>>>>>
>>>>>> _______________________________________________
>>>>>> Swift-devel mailing list
>>>>>> Swift-devel at ci.uchicago.edu
>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>>>>>
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>>>>
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>>>
>>> --
>>> Michael Wilde
>>> Computation Institute, University of Chicago
>>> Mathematics and Computer Science Division
>>> Argonne National Laboratory
>>>
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
>
More information about the Swift-devel
mailing list