[Swift-devel] Problems running coaster

Michael Wilde wilde at mcs.anl.gov
Mon Jul 28 08:27:51 CDT 2008


I tried jobManager="gt2:gt2:pbs" but still get the same error.

Note that each time this fails I also see this in the log:
--
2008-07-28 08:22:58,853-0500 INFO  ServiceManager Service task 
Task(type=JOB_SUBMISSION, identity=urn:cog-1217251364678) terminated. 
Removing service.
2008-07-28 08:22:58,853-0500 INFO  ServiceManager Service does not 
appear to be registered with this manager
--

Does that indicate a problem?

Other notes:

There was no coaster log in my home dir on the submit host 
(communicado). Should there be, for remote execution? Or will that log 
show up on the remote site, where the coaster service is run?

There was no gram log on abe to indicate that a job was started there.
It seems like the initial job that should run on abe to start the 
coaster service is failing. What piece of code creates that job?

Full logs are on CI net at ~wilde/coast/run5

sites.xml was:

<config>
<pool handle="abe" >
   <execution provider="coaster" url="grid-abe.ncsa.teragrid.org" 
jobManager="gt2:gt2:pbs" />
   <profile namespace="karajan" key="jobThrottle">4</profile>
   <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org"/>
   <workdirectory>/u/ac/wilde/swiftwork</workdirectory>
   <profile namespace="globus" key="project">TG-MCA01S018</profile>

 
<!--altworkdirectory>/cfs/scratch/users/wilde/swiftwork</altworkdirectory-->
   <!--SwiftDACprofile namespace="globus" 
key="project">TG-CCR080002N</SwiftDACprofile-->

</pool>
</config>

error was same:

2008-07-28 08:22:46,936-0500 INFO  vdl:dostagein START 
jobid=echo-28cx26xi - Staging in files
2008-07-28 08:22:46,936-0500 INFO  vdl:dostagein END jobid=echo-28cx26xi 
- Staging in finished
2008-07-28 08:22:46,937-0500 DEBUG vdl:execute2 JOB_START 
jobid=echo-28cx26xi tr=echo arguments=[the string is, s000] 
tmpdir=ctest-20080728-0822-23q64s0d/jobs/2/echo-28cx26xi host=abe
2008-07-28 08:22:46,952-0500 DEBUG WeightedHostScoreScheduler 
multiplyScore(abe:0.000(1.000):1/5 overload: 0, -0.2)
2008-07-28 08:22:46,952-0500 DEBUG WeightedHostScoreScheduler Old score: 
0.000, new score: -0.200
2008-07-28 08:22:46,957-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, 
identity=urn:0-1-1-1217251364677) setting status to Submitting
2008-07-28 08:22:46,971-0500 INFO  LocalService Started local service: 
128.135.125.17:50000
2008-07-28 08:22:46,979-0500 INFO  BootstrapService Socket bound. URL is 
http://128.135.125.17:50001
2008-07-28 08:22:47,008-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, 
identity=urn:cog-1217251364678) setting status to Submitting
2008-07-28 08:22:48,032-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, 
identity=urn:cog-1217251364678) setting status to Submitted
2008-07-28 08:22:48,389-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, 
identity=urn:cog-1217251364678) setting status to Active
2008-07-28 08:22:58,853-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, 
identity=urn:cog-1217251364678) setting status to Completed
2008-07-28 08:22:58,853-0500 INFO  ServiceManager Service task 
Task(type=JOB_SUBMISSION, identity=urn:cog-1217251364678) terminated. 
Removing service.
2008-07-28 08:22:58,853-0500 INFO  ServiceManager Service does not 
appear to be registered with this manager
2008-07-28 08:22:59,055-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, 
identity=urn:0-1-1-1217251364677) setting status to Submitted
2008-07-28 08:22:59,056-0500 DEBUG WeightedHostScoreScheduler Submission 
time for Task(type=JOB_SUBMISSION, identity=urn:0-1-1-1217251364677): 
12098ms. Score delta: -0.05947692307692308
2008-07-28 08:22:59,056-0500 DEBUG WeightedHostScoreScheduler 
multiplyScore(abe:-0.200(0.889):1/4 overload: 0, -0.05947692307692308)
2008-07-28 08:22:59,056-0500 DEBUG WeightedHostScoreScheduler Old score: 
-0.200, new score: -0.259
2008-07-28 08:22:59,056-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, 
identity=urn:0-1-1-1217251364677) setting status to Active
2008-07-28 08:22:59,057-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, 
identity=urn:0-1-1-1217251364677) setting status to Failed Could not 
submit job
2008-07-28 08:22:59,061-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
jobid=echo-28cx26xi - Application exception: Could not submit job
         vdl:execute @ vdl-int.k, line: 395
         sys:sequential @ vdl-int.k, line: 387
...
         rlog:restartlog @ ctest.kml, line: 66
         kernel:project @ ctest.kml, line: 2
         ctest-20080728-0822-23q64s0d
Caused by: 
org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: 
Could not submit job
Caused by: 
org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: 
Could not start coaster service
Caused by: 
org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: 
Task ended before registration was received.
STDOUT: This node is in dedicated user mode.

STDERR: null


On 7/27/08 11:27 PM, Ben Clifford wrote:
> On Mon, 28 Jul 2008, Ben Clifford wrote:
> 
>>> jobManager="gt2:pbs" />
>> Try gt2:gt2:pbs
> 
> In more detail: this field, in the case of condor, encodes a lot of 
> information in a not-so-obvious way:
> 
>    a:b[:c]
> 
> a = cog provider to use to submit the remote headnode job
> b = cog provider that the remote headnode job will use to submit workers
> c = jobmanager to be used by cog provider b
> 
> There happens to be a cog provider called pbs that doesn't work so well. 
> That is what you were specifying.
> 
> gt2:gt2:pbs specifies gt2 for both situations, using the pbs jobmanger for 
> the worker node submissions in gram2.
> 



More information about the Swift-devel mailing list