[Swift-devel] swift jobs hanging on ranger
skenny at uchicago.edu
skenny at uchicago.edu
Wed Jan 21 15:16:09 CST 2009
looks like the parsing of maxwalltime is having an effect on
whether the job gets into the scheduler:
[skenny at gwynn check_env]$ cog-job-submit -p gt2 -jm SGE -a
project=TG-DBS090006,maxwalltime=240 -e /bin/hostname -s
gatekeeper.ranger.tacc.teragrid.org
Forcing redirection because the SGE JM is broken.
Job failed:
org.globus.gram.GramException: The job manager detected an
invalid script response
at
org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:534)
at org.globus.gram.GramJob.setStatus(GramJob.java:184)
at
org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176)
at java.lang.Thread.run(Thread.java:595)
whereas this succeeds:
[skenny at gwynn check_env]$ cog-job-submit -p gt2 -jm SGE -a
project=TG-DBS090006,maxwalltime=60 -e /bin/hostname -s
gatekeeper.ranger.tacc.teragrid.org
the problem seems to lie in sge's inconsistent passing of
maxwalltime.
---- Original message ----
>Date: Wed, 21 Jan 2009 13:37:39 -0600 (CST)
>From: <skenny at uchicago.edu>
>Subject: Re: [Swift-devel] swift jobs hanging on ranger
>To: Ben Clifford <benc at hawaga.org.uk>
>Cc: swift-devel at ci.uchicago.edu
>
>>try swift without using coasters and see if that goes through;
>
>still doesn't go thru and gives the same error in the gram
>log. but it doesn't hang indefinitely on the submit host. it
>gives:
>
>Progress: Stage in:1 Failed but can retry:2
>Failed to transfer wrapper log from
>env-20090121-1320-obxi57t5/info/4 on RANGER
>env failed
>Execution failed:
> Exception in env:
>Arguments: []
>Host: RANGER
>Directory: env-20090121-1320-obxi57t5/jobs/4/env-4vv5nh5j
>stderr.txt:
>
>stdout.txt:
>
>----
>
>Caused by:
> The job manager detected an invalid script response
>
>
>>cog-job-submit using coasters and see if that fails.
>
>[skenny at gwynn check_env]$ cog-job-submit -p coaster -jm
>gt2:gt2:SGE -a project=TG-DBS090006 -e /bin/hostname -s
>gatekeeper.ranger.tacc.teragrid.org
>Started local service: 128.135.92.83:50004
>Socket bound. URL is http://gwynn.bsd.uchicago.edu:50005
>[/129.114.50.163:34018] GET /coaster-bootstrap.jar HTTP/1.0
>[/129.114.50.163:34023] GET /list?serviceId=1487248510 HTTP/1.1
>GSSSChannel-null(0): Disabling heartbeats (config is null)
>Initialized connection handler
>Multiplexer 0 started
>(1) Scheduling GSSSChannel-null(1) for addition
>nullChannel started
>Connection handler started
>Multiplexer 1 started
>GSSSChannel-null(1) REQ: Handler(CHANNELCONFIG)
>Channel id:
341bc107:11efaac5124:-8000:-60989cff:11efaac517b:-8000
>MetaChannel: 11644607 -> null: Disabling heartbeats (disabled
>in config)
>MetaChannel: 11644607 -> null.bind -> GSSSChannel-null(1)
>GSSSChannel-null(1) REQ: Handler(REGISTER)
>Trying to re-bind current channel
>Sending Command(1, SUBMITJOB) on GSSSChannel-null(1)
>Command(1, SUBMITJOB) CMD: Command(1, SUBMITJOB)
>GSSSChannel-null(1) REPL: Command(1, SUBMITJOB)
>Submitted task Task(type=JOB_SUBMISSION,
>identity=urn:cog-1232566229197). Job id:
>urn:1232566229197-1232566244068-1232566244069
>Unregistering Command(1, SUBMITJOB)
>GSSSChannel-null(1) REQ: Handler(JOBSTATUS)
>GSSSChannel-null(1) REQ: Handler(JOBSTATUS)
>Job completed
>
>
>********logs for this are here:
>
>/home/skenny/logs
>
>
>_______________________________________________
>Swift-devel mailing list
>Swift-devel at ci.uchicago.edu
>http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list