[Swift-devel] clustering problem:
Michael Wilde
wilde at mcs.anl.gov
Tue Nov 13 04:30:48 CST 2007
I suspect a problem in clustering.
I had the following entries in tc.data:
UC angle /home/wilde/angle32/bin/angle.multiarch.sh
INSTALLED INTEL32::LINUX GLOBUS::maxwalltime=20;
sdsc angle /users/ux454325/angle/bin/angle.sh
INSTALLED INTEL32::LINUX GLOBUS::maxwalltime=20;
tungsten angle /u/ac/wilde/angle/bin/angle.sh
INSTALLED INTEL32::LINUX GLOBUS::maxwalltime=20;
teraport angle /home/wilde/angle/bin/angle.sh
INSTALLED INTEL32::LINUX GLOBUS::maxwalltime=20;
mercury angle /home/ncsa/wilde/angle/bin/angle.sh
INSTALLED INTEL32::LINUX GLOBUS::maxwalltime=20;
and the following swift.properties:
kickstart.always.transfer=true
clustering.enabled=true
clustering.queue.delay=15
clustering.min.time=12000
throttle.transfers=64
sitedir.keep=true
lazy.errors=true
--
which when I ran a batch of 100 jobs, caused job manager failures and no
jobs started. the server side jobs, inf and status dirs were empty.
No jobs would show up in the PBS queue.
I found the following in the serve-side gram logs:
gram_job_mgr_1000.log:11/13 03:36:04 JM: GT3 extended error message:
GRAM_SCRIPT_GT3_FAILURE_MESSAGE:This job will be charged to account: brn
(TG-CCR080001) qsub: Illegal attribute or resource value for
Resource_List.walltime
gram_job_mgr_1000.log:11/13 03:36:04 JMI: while return_buf =
GRAM_SCRIPT_ERROR = 17
--
when I changed maxwalltime to "00:05:00" and the properties to:
clustering.queue.delay=30
clustering.min.time=1200
throttle.transfers=16
things work, and all 100 jobs finish smoothly.
I suspect that something in my previous parameters is causing an invalid
walltime to be sent to pbs. Still digging into this but need help.
More information about the Swift-devel
mailing list