[Swift-devel] Coaster worker connect problem and control over coaster logging?
wilde at mcs.anl.gov
wilde at mcs.anl.gov
Thu Oct 7 17:55:02 CDT 2010
Im debugging a problem I just started seeing in coasters on PADS. Im getting "error code 29" returned from a simple 1-cat job.
Whats happening is that the coaster worker is failing to connect. Its possible I broke it in a recent commit; I hope to know soon. I see this in the worker log:
1286490899.535 DEBUG - Trying 169.254.95.119:35151...
1286490899.571 DEBUG - Connection failed: Connection refused. Trying other addresses
1286490899.571 DEBUG - Trying 172.5.86.5:35151...
1286490899.573 DEBUG - Connection failed: Illegal seek. Trying other addresses
1286490899.573 DEBUG - Trying 192.5.86.5:35151...
1286490899.574 DEBUG - Connection failed: Illegal seek. Trying other addresses
1286490899.574 ERROR - Connection failed for all addresses. Retrying in 1 seconds
A second question here is about control of the worker log. I see the env variable WORKER_LOGGING_ENABLED getting set in the coaster pbs submit file. But as far as I can tell, this will not be picked up by the worker unless its exported.
Has this always been set this way? Is anyone actually *getting* worker logs in their ~/.globus/coasters directory using trunk?
I will look further into this; in the meantime Im forcing TRACE logging on in worker.pl (which is how I finally got the messages above).
Sarah: this is an interesting and challenging case in error reporting. Diagnosing this involves tracking the error from swift stdout to the pbs stderr file (with debug=true in the etc/provider-pbs.properties file) to the coaster worker log (with elevated logging levels). Lets discuss how the defaults in all these 3 places could be better, and how the relevant files could be better coalesced for the user, and perhaps integrated by some post-processing diagnostic tool.
- Mike
login1$ cat *54.submit
#PBS -S /bin/sh
#PBS -N Block-1007-340807-000000
#PBS -m n
#PBS -l nodes=1
#PBS -l walltime=01:00:00
#PBS -q short
#PBS -o /home/wilde/.globus/scripts/PBS3642641579913160354.submit.stdout
#PBS -e /home/wilde/.globus/scripts/PBS3642641579913160354.submit.stderr
WORKER_LOGGING_ENABLED=true
cd / && /usr/bin/perl /home/wilde/.globus/coasters/cscript4619936640935778716.pl http://169.254.95.119:37300,http://172.5.86.5:37300,http://192.5.86.5:37300 1007-340807-000000 /home/wilde/.globus/coasters
/bin/echo $? >/home/wilde/.globus/scripts/PBS3642641579913160354.submit.exitcode
login1$
--
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory
More information about the Swift-devel
mailing list