[Swift-devel] clustering problem:

Michael Wilde wilde at mcs.anl.gov
Tue Nov 13 04:30:48 CST 2007


I suspect a problem in clustering.

I had the following entries in tc.data:

UC      angle           /home/wilde/angle32/bin/angle.multiarch.sh 
INSTALLED       INTEL32::LINUX  GLOBUS::maxwalltime=20;
sdsc    angle           /users/ux454325/angle/bin/angle.sh 
INSTALLED       INTEL32::LINUX  GLOBUS::maxwalltime=20;
tungsten        angle           /u/ac/wilde/angle/bin/angle.sh 
INSTALLED       INTEL32::LINUX  GLOBUS::maxwalltime=20;
teraport        angle           /home/wilde/angle/bin/angle.sh 
INSTALLED       INTEL32::LINUX  GLOBUS::maxwalltime=20;
mercury         angle           /home/ncsa/wilde/angle/bin/angle.sh 
INSTALLED       INTEL32::LINUX  GLOBUS::maxwalltime=20;

and the following swift.properties:

kickstart.always.transfer=true

clustering.enabled=true
clustering.queue.delay=15
clustering.min.time=12000

throttle.transfers=64

sitedir.keep=true

lazy.errors=true

--
which when I ran a batch of 100 jobs, caused job manager failures and no 
jobs started.  the server side jobs, inf and status dirs were empty.

No jobs would show up in the PBS queue.

I found the following in the serve-side gram logs:

gram_job_mgr_1000.log:11/13 03:36:04 JM: GT3 extended error message: 
GRAM_SCRIPT_GT3_FAILURE_MESSAGE:This job will be charged to account: brn 
(TG-CCR080001) qsub: Illegal attribute or resource value for 
Resource_List.walltime
gram_job_mgr_1000.log:11/13 03:36:04 JMI: while return_buf = 
GRAM_SCRIPT_ERROR = 17

--
when I changed maxwalltime to "00:05:00" and the properties to:

clustering.queue.delay=30
clustering.min.time=1200
throttle.transfers=16

things work, and all 100 jobs finish smoothly.

I suspect that something in my previous parameters is causing an invalid 
walltime to be sent to pbs.  Still digging into this but need help.








More information about the Swift-devel mailing list