[Swift-devel] Coaster worker connect problem and control over coaster logging?

Michael Wilde wilde at mcs.anl.gov
Thu Oct 7 21:31:45 CDT 2010


Found the worker connect problem: the order of interfaces passed was such that the first failed, and the reconnect logic did not clear the failure flag correctly.  Fixed in cog 2903.

The logging problem still needs to be investigated.

- Mike


----- wilde at mcs.anl.gov wrote:

> Im debugging a problem I just started seeing in coasters on PADS. Im
> getting "error code 29" returned from a simple 1-cat job.
> 
> Whats happening is that the coaster worker is failing to connect. Its
> possible I broke it in a recent commit; I hope to know soon. I see
> this in the worker log:
> 
> 1286490899.535 DEBUG - Trying 169.254.95.119:35151...
> 1286490899.571 DEBUG - Connection failed: Connection refused. Trying
> other addresses
> 1286490899.571 DEBUG - Trying 172.5.86.5:35151...
> 1286490899.573 DEBUG - Connection failed: Illegal seek. Trying other
> addresses
> 1286490899.573 DEBUG - Trying 192.5.86.5:35151...
> 1286490899.574 DEBUG - Connection failed: Illegal seek. Trying other
> addresses
> 1286490899.574 ERROR - Connection failed for all addresses. Retrying
> in 1 seconds
> 
> A second question here is about control of the worker log. I see the
> env variable WORKER_LOGGING_ENABLED getting set in the coaster pbs
> submit file. But as far as I can tell, this will not be picked up by
> the worker unless its exported.
> 
> Has this always been set this way? Is anyone actually *getting* worker
> logs in their ~/.globus/coasters directory using trunk?
> 
> I will look further into this; in the meantime Im forcing TRACE
> logging on in worker.pl (which is how I finally got the messages
> above).
> 
> Sarah: this is an interesting and challenging case in error reporting.
> Diagnosing this involves tracking the error from swift stdout to the
> pbs stderr file (with debug=true in the etc/provider-pbs.properties
> file) to the coaster worker log (with elevated logging levels).  Lets
> discuss how the defaults in all these 3 places could be better, and
> how the relevant files could be better coalesced for the user, and
> perhaps integrated by some post-processing diagnostic tool.
> 
> - Mike
> 
> login1$ cat *54.submit
> #PBS -S /bin/sh
> #PBS -N Block-1007-340807-000000
> #PBS -m n
> #PBS -l nodes=1
> #PBS -l walltime=01:00:00
> #PBS -q short
> #PBS -o
> /home/wilde/.globus/scripts/PBS3642641579913160354.submit.stdout
> #PBS -e
> /home/wilde/.globus/scripts/PBS3642641579913160354.submit.stderr
> WORKER_LOGGING_ENABLED=true
> cd / && /usr/bin/perl
> /home/wilde/.globus/coasters/cscript4619936640935778716.pl
> http://169.254.95.119:37300,http://172.5.86.5:37300,http://192.5.86.5:37300
> 1007-340807-000000 /home/wilde/.globus/coasters
> /bin/echo $?
> >/home/wilde/.globus/scripts/PBS3642641579913160354.submit.exitcode
> login1$ 
> 
> 
> -- 
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list