[Swift-devel] Problems running coaster
Michael Wilde
wilde at mcs.anl.gov
Mon Jul 28 08:27:51 CDT 2008
I tried jobManager="gt2:gt2:pbs" but still get the same error.
Note that each time this fails I also see this in the log:
--
2008-07-28 08:22:58,853-0500 INFO ServiceManager Service task
Task(type=JOB_SUBMISSION, identity=urn:cog-1217251364678) terminated.
Removing service.
2008-07-28 08:22:58,853-0500 INFO ServiceManager Service does not
appear to be registered with this manager
--
Does that indicate a problem?
Other notes:
There was no coaster log in my home dir on the submit host
(communicado). Should there be, for remote execution? Or will that log
show up on the remote site, where the coaster service is run?
There was no gram log on abe to indicate that a job was started there.
It seems like the initial job that should run on abe to start the
coaster service is failing. What piece of code creates that job?
Full logs are on CI net at ~wilde/coast/run5
sites.xml was:
<config>
<pool handle="abe" >
<execution provider="coaster" url="grid-abe.ncsa.teragrid.org"
jobManager="gt2:gt2:pbs" />
<profile namespace="karajan" key="jobThrottle">4</profile>
<gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org"/>
<workdirectory>/u/ac/wilde/swiftwork</workdirectory>
<profile namespace="globus" key="project">TG-MCA01S018</profile>
<!--altworkdirectory>/cfs/scratch/users/wilde/swiftwork</altworkdirectory-->
<!--SwiftDACprofile namespace="globus"
key="project">TG-CCR080002N</SwiftDACprofile-->
</pool>
</config>
error was same:
2008-07-28 08:22:46,936-0500 INFO vdl:dostagein START
jobid=echo-28cx26xi - Staging in files
2008-07-28 08:22:46,936-0500 INFO vdl:dostagein END jobid=echo-28cx26xi
- Staging in finished
2008-07-28 08:22:46,937-0500 DEBUG vdl:execute2 JOB_START
jobid=echo-28cx26xi tr=echo arguments=[the string is, s000]
tmpdir=ctest-20080728-0822-23q64s0d/jobs/2/echo-28cx26xi host=abe
2008-07-28 08:22:46,952-0500 DEBUG WeightedHostScoreScheduler
multiplyScore(abe:0.000(1.000):1/5 overload: 0, -0.2)
2008-07-28 08:22:46,952-0500 DEBUG WeightedHostScoreScheduler Old score:
0.000, new score: -0.200
2008-07-28 08:22:46,957-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:0-1-1-1217251364677) setting status to Submitting
2008-07-28 08:22:46,971-0500 INFO LocalService Started local service:
128.135.125.17:50000
2008-07-28 08:22:46,979-0500 INFO BootstrapService Socket bound. URL is
http://128.135.125.17:50001
2008-07-28 08:22:47,008-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:cog-1217251364678) setting status to Submitting
2008-07-28 08:22:48,032-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:cog-1217251364678) setting status to Submitted
2008-07-28 08:22:48,389-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:cog-1217251364678) setting status to Active
2008-07-28 08:22:58,853-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:cog-1217251364678) setting status to Completed
2008-07-28 08:22:58,853-0500 INFO ServiceManager Service task
Task(type=JOB_SUBMISSION, identity=urn:cog-1217251364678) terminated.
Removing service.
2008-07-28 08:22:58,853-0500 INFO ServiceManager Service does not
appear to be registered with this manager
2008-07-28 08:22:59,055-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:0-1-1-1217251364677) setting status to Submitted
2008-07-28 08:22:59,056-0500 DEBUG WeightedHostScoreScheduler Submission
time for Task(type=JOB_SUBMISSION, identity=urn:0-1-1-1217251364677):
12098ms. Score delta: -0.05947692307692308
2008-07-28 08:22:59,056-0500 DEBUG WeightedHostScoreScheduler
multiplyScore(abe:-0.200(0.889):1/4 overload: 0, -0.05947692307692308)
2008-07-28 08:22:59,056-0500 DEBUG WeightedHostScoreScheduler Old score:
-0.200, new score: -0.259
2008-07-28 08:22:59,056-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:0-1-1-1217251364677) setting status to Active
2008-07-28 08:22:59,057-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:0-1-1-1217251364677) setting status to Failed Could not
submit job
2008-07-28 08:22:59,061-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION
jobid=echo-28cx26xi - Application exception: Could not submit job
vdl:execute @ vdl-int.k, line: 395
sys:sequential @ vdl-int.k, line: 387
...
rlog:restartlog @ ctest.kml, line: 66
kernel:project @ ctest.kml, line: 2
ctest-20080728-0822-23q64s0d
Caused by:
org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
Could not submit job
Caused by:
org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
Could not start coaster service
Caused by:
org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
Task ended before registration was received.
STDOUT: This node is in dedicated user mode.
STDERR: null
On 7/27/08 11:27 PM, Ben Clifford wrote:
> On Mon, 28 Jul 2008, Ben Clifford wrote:
>
>>> jobManager="gt2:pbs" />
>> Try gt2:gt2:pbs
>
> In more detail: this field, in the case of condor, encodes a lot of
> information in a not-so-obvious way:
>
> a:b[:c]
>
> a = cog provider to use to submit the remote headnode job
> b = cog provider that the remote headnode job will use to submit workers
> c = jobmanager to be used by cog provider b
>
> There happens to be a cog provider called pbs that doesn't work so well.
> That is what you were specifying.
>
> gt2:gt2:pbs specifies gt2 for both situations, using the pbs jobmanger for
> the worker node submissions in gram2.
>
More information about the Swift-devel
mailing list