[Swift-devel] Coaster worker connect problem and control over coaster logging?

Michael Wilde wilde at mcs.anl.gov
Fri Oct 8 08:53:09 CDT 2010


Ive filed the coaster logging issue as bug 226.

Sarah, I think that this would make a very good first code change to work on. Can you take a look at it an start a bugzilla thread to discuss?

Thanks,

- Mike



----- "Michael Wilde" <wilde at mcs.anl.gov> wrote:

> Found the worker connect problem: the order of interfaces passed was
> such that the first failed, and the reconnect logic did not clear the
> failure flag correctly.  Fixed in cog 2903.
> 
> The logging problem still needs to be investigated.
> 
> - Mike
> 
> 
> ----- wilde at mcs.anl.gov wrote:
> 
> > Im debugging a problem I just started seeing in coasters on PADS. Im
> > getting "error code 29" returned from a simple 1-cat job.
> >
> > Whats happening is that the coaster worker is failing to connect.
> Its
> > possible I broke it in a recent commit; I hope to know soon. I see
> > this in the worker log:
> >
> > 1286490899.535 DEBUG - Trying 169.254.95.119:35151...
> > 1286490899.571 DEBUG - Connection failed: Connection refused. Trying
> > other addresses
> > 1286490899.571 DEBUG - Trying 172.5.86.5:35151...
> > 1286490899.573 DEBUG - Connection failed: Illegal seek. Trying other
> > addresses
> > 1286490899.573 DEBUG - Trying 192.5.86.5:35151...
> > 1286490899.574 DEBUG - Connection failed: Illegal seek. Trying other
> > addresses
> > 1286490899.574 ERROR - Connection failed for all addresses. Retrying
> > in 1 seconds
> >
> > A second question here is about control of the worker log. I see the
> > env variable WORKER_LOGGING_ENABLED getting set in the coaster pbs
> > submit file. But as far as I can tell, this will not be picked up by
> > the worker unless its exported.
> >
> > Has this always been set this way? Is anyone actually *getting*
> worker
> > logs in their ~/.globus/coasters directory using trunk?
> >
> > I will look further into this; in the meantime Im forcing TRACE
> > logging on in worker.pl (which is how I finally got the messages
> > above).
> >
> > Sarah: this is an interesting and challenging case in error
> reporting.
> > Diagnosing this involves tracking the error from swift stdout to the
> > pbs stderr file (with debug=true in the etc/provider-pbs.properties
> > file) to the coaster worker log (with elevated logging levels). 
> Lets
> > discuss how the defaults in all these 3 places could be better, and
> > how the relevant files could be better coalesced for the user, and
> > perhaps integrated by some post-processing diagnostic tool.
> >
> > - Mike
> >
> > login1$ cat *54.submit
> > #PBS -S /bin/sh
> > #PBS -N Block-1007-340807-000000
> > #PBS -m n
> > #PBS -l nodes=1
> > #PBS -l walltime=01:00:00
> > #PBS -q short
> > #PBS -o
> > /home/wilde/.globus/scripts/PBS3642641579913160354.submit.stdout
> > #PBS -e
> > /home/wilde/.globus/scripts/PBS3642641579913160354.submit.stderr
> > WORKER_LOGGING_ENABLED=true
> > cd / && /usr/bin/perl
> > /home/wilde/.globus/coasters/cscript4619936640935778716.pl
> >
> http://169.254.95.119:37300,http://172.5.86.5:37300,http://192.5.86.5:37300
> > 1007-340807-000000 /home/wilde/.globus/coasters
> > /bin/echo $?
> > >/home/wilde/.globus/scripts/PBS3642641579913160354.submit.exitcode
> > login1$
> >
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list