[Swift-devel] Returning GRAM errors to swift user
Michael Wilde
wilde at mcs.anl.gov
Sun Feb 15 17:50:18 CST 2009
Im assuming that Swift and the CoG provider return as much about GRAM
errors back to the user as they know. But, for jobs that fail to start,
e.g., due to an invalid project code, that error never makes it back to
the user (but *is* present in the gram log).
In this case, can the message below, from the GRAM log,
"GRAM_SCRIPT_GT3_FAILURE_MESSAGE:qsub: Invalid Account MSG=invalid
account\n" available in the GRAM API so it can be sent to the user?
I'm assuming this particular issue is well known to users experienced
with TeraGrid sites, like Sarah, but is perhaps worth pointing out in a
troubleshooting section. If there's a chance that some of this GRAM
error info can be returned but is not currently, I can file this in
bugzilla.
It seems like a few errors, such as account/project errors, or other
invalid job specs (like time/queue mismatches?) are similarly not passed
back. Is that the case?
Relevant snips from the logs are below.
Also interesting to note: On the UC teragrid site, a project specified
in sites.xml via the globus profile does *not* override a default
project set by the tgprojects command. Im my case, I had an invalid
(old) project set via tgprojects, which too precedence over the one in
my sites.xml. When I set the default project to "None" in tgprojects,
then the sites.xml project was accepted and the job ran.
- Mike
In swift .log file:
2009-02-15 16:59:27,408-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION
jobid=uname-0gpb1o6j - Application exception: The job failed when the
job manager attempted to run it
Caused by: org.globus.gram.GramException: The job failed when the job
manager attempted to run it
Messages on swift stdout/err:
===============================
Swift svn swift-r2532 cog-r2300
RunID: 20090215-1733-0x4ksmd8
Progress:
Progress: Stage in:1
Progress: Active:1
Failed to transfer wrapper log from un-20090215-1733-0x4ksmd8/info/k on uc32
Progress: Failed:1
Execution failed:
Exception in uname:
Arguments: [-a]
Host: uc32
Directory: un-20090215-1733-0x4ksmd8/jobs/k/uname-kl1p2o6j
stderr.txt:
stdout.txt:
----
Caused by:
The job failed when the job manager attempted to run it
===============================
But the following useful info is in the gram log (on the server side),
which did not make it to the swift logs above:
Sun Feb 15 17:33:58 2009 JM_SCRIPT: submitting job --
/soft/torque/bin/qsub <
/home/wilde/.globus/job/tg-grid1.uc.teragrid.org/14326.1234740838/scheduler_pbs_job_script
2>/home/wilde/.glo
bus/job/tg-grid1.uc.teragrid.org/14326.1234740838/scheduler_pbs_submit_stderr
Sun Feb 15 17:33:58 2009 JM_SCRIPT: qsub returned
Sun Feb 15 17:33:58 2009 JM_SCRIPT: qsub stderr qsub: Invalid Account
MSG=invalid account
2/15 17:33:58 JM: GT3 extended error message:
GRAM_SCRIPT_GT3_FAILURE_MESSAGE:qsub: Invalid Account MSG=invalid account\n
2/15 17:33:58 JMI: while return_buf = GRAM_SCRIPT_GT3_FAILURE_MESSAGE =
qsub: Invalid Account MSG=invalid account\n
2/15 17:33:58 JMI: while return_buf = GRAM_SCRIPT_ERROR = 17
2/15 17:33:58 JM: in globus_gram_job_manager_reporting_file_create()
2/15 17:33:58 JM: not reporting job information
2/15 17:33:58 JM: in globus_gram_job_manager_history_file_create()
2/15 17:33:58 JM: NOT empty client callback list.
2/15 17:33:58 JM: sending callback of status 4 (failure code 17) to
https://128.135.125.17:50000/1234740837636.
2/15 17:33:58 JMI: testing job manager scripts for type pbs exist and
permissions are ok.
2/15 17:33:58 JMI: completed script validation: job manager type is pbs.
More information about the Swift-devel
mailing list