[Swift-devel] Returning GRAM errors to swift user

Michael Wilde wilde at mcs.anl.gov
Sun Feb 15 17:50:18 CST 2009


Im assuming that Swift and the CoG provider return as much about GRAM 
errors back to the user as they know. But, for jobs that fail to start, 
e.g., due to an invalid project code, that error never makes it back to 
the user (but *is* present in the gram log).

In this case, can the message below, from the GRAM log, 
"GRAM_SCRIPT_GT3_FAILURE_MESSAGE:qsub: Invalid Account MSG=invalid 
account\n" available in the GRAM API so it can be sent to the user?

I'm assuming this particular issue is well known to users experienced 
with TeraGrid sites, like Sarah, but is perhaps worth pointing out in a 
troubleshooting section. If there's a chance that some of this GRAM 
error info can be returned but is not currently, I can file this in 
bugzilla.

It seems like a few errors, such as account/project errors, or other 
invalid job specs (like time/queue mismatches?) are similarly not passed 
back. Is that the case?

Relevant snips from the logs are below.

Also interesting to note: On the UC teragrid site, a project specified 
in sites.xml via the globus profile does *not* override a default 
project set by the tgprojects command. Im my case, I had an invalid 
(old) project set via tgprojects, which too precedence over the one in 
my sites.xml. When I set the default project to "None" in tgprojects, 
then the sites.xml project was accepted and the job ran.

- Mike

In swift .log file:

2009-02-15 16:59:27,408-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
jobid=uname-0gpb1o6j - Application exception: The job failed when the 
job manager attempted to run it
Caused by: org.globus.gram.GramException: The job failed when the job 
manager attempted to run it

Messages on swift stdout/err:

===============================
Swift svn swift-r2532 cog-r2300

RunID: 20090215-1733-0x4ksmd8
Progress:
Progress:  Stage in:1
Progress:  Active:1
Failed to transfer wrapper log from un-20090215-1733-0x4ksmd8/info/k on uc32
Progress:  Failed:1
Execution failed:
         Exception in uname:
Arguments: [-a]
Host: uc32
Directory: un-20090215-1733-0x4ksmd8/jobs/k/uname-kl1p2o6j
stderr.txt:

stdout.txt:

----

Caused by:
         The job failed when the job manager attempted to run it

===============================


But the following useful info is in the gram log (on the server side), 
which did not make it to the swift logs above:


Sun Feb 15 17:33:58 2009 JM_SCRIPT: submitting job -- 
/soft/torque/bin/qsub < 
/home/wilde/.globus/job/tg-grid1.uc.teragrid.org/14326.1234740838/scheduler_pbs_job_script 
2>/home/wilde/.glo
bus/job/tg-grid1.uc.teragrid.org/14326.1234740838/scheduler_pbs_submit_stderr
Sun Feb 15 17:33:58 2009 JM_SCRIPT: qsub returned
Sun Feb 15 17:33:58 2009 JM_SCRIPT: qsub stderr qsub: Invalid Account 
MSG=invalid account

2/15 17:33:58 JM: GT3 extended error message: 
GRAM_SCRIPT_GT3_FAILURE_MESSAGE:qsub: Invalid Account MSG=invalid account\n
2/15 17:33:58 JMI: while return_buf = GRAM_SCRIPT_GT3_FAILURE_MESSAGE = 
qsub: Invalid Account MSG=invalid account\n
2/15 17:33:58 JMI: while return_buf = GRAM_SCRIPT_ERROR = 17
2/15 17:33:58 JM: in globus_gram_job_manager_reporting_file_create()
2/15 17:33:58 JM: not reporting job information
2/15 17:33:58 JM: in globus_gram_job_manager_history_file_create()
2/15 17:33:58 JM: NOT empty client callback list.
2/15 17:33:58 JM: sending callback of status 4 (failure code 17) to 
https://128.135.125.17:50000/1234740837636.
2/15 17:33:58 JMI: testing job manager scripts for type pbs exist and 
permissions are ok.
2/15 17:33:58 JMI: completed script validation: job manager type is pbs.




More information about the Swift-devel mailing list