[Swift-devel] Problems running coaster

Mihael Hategan hategan at mcs.anl.gov
Mon Jul 28 09:17:06 CDT 2008


On Mon, 2008-07-28 at 08:27 -0500, Michael Wilde wrote:
> I tried jobManager="gt2:gt2:pbs" but still get the same error.

The error seems to be related to the fork job that tries to start the
service. Do they disallow fork jobs?

> 
> Note that each time this fails I also see this in the log:
> --
> 2008-07-28 08:22:58,853-0500 INFO  ServiceManager Service task 
> Task(type=JOB_SUBMISSION, identity=urn:cog-1217251364678) terminated. 
> Removing service.
> 2008-07-28 08:22:58,853-0500 INFO  ServiceManager Service does not 
> appear to be registered with this manager
> --
> 
> Does that indicate a problem?
> 
> Other notes:
> 
> There was no coaster log in my home dir on the submit host 
> (communicado). Should there be, for remote execution? Or will that log 
> show up on the remote site, where the coaster service is run?
> 
> There was no gram log on abe to indicate that a job was started there.
> It seems like the initial job that should run on abe to start the 
> coaster service is failing. What piece of code creates that job?
> 
> Full logs are on CI net at ~wilde/coast/run5
> 
> sites.xml was:
> 
> <config>
> <pool handle="abe" >
>    <execution provider="coaster" url="grid-abe.ncsa.teragrid.org" 
> jobManager="gt2:gt2:pbs" />
>    <profile namespace="karajan" key="jobThrottle">4</profile>
>    <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org"/>
>    <workdirectory>/u/ac/wilde/swiftwork</workdirectory>
>    <profile namespace="globus" key="project">TG-MCA01S018</profile>
> 
>  
> <!--altworkdirectory>/cfs/scratch/users/wilde/swiftwork</altworkdirectory-->
>    <!--SwiftDACprofile namespace="globus" 
> key="project">TG-CCR080002N</SwiftDACprofile-->
> 
> </pool>
> </config>
> 
> error was same:
> 
> 2008-07-28 08:22:46,936-0500 INFO  vdl:dostagein START 
> jobid=echo-28cx26xi - Staging in files
> 2008-07-28 08:22:46,936-0500 INFO  vdl:dostagein END jobid=echo-28cx26xi 
> - Staging in finished
> 2008-07-28 08:22:46,937-0500 DEBUG vdl:execute2 JOB_START 
> jobid=echo-28cx26xi tr=echo arguments=[the string is, s000] 
> tmpdir=ctest-20080728-0822-23q64s0d/jobs/2/echo-28cx26xi host=abe
> 2008-07-28 08:22:46,952-0500 DEBUG WeightedHostScoreScheduler 
> multiplyScore(abe:0.000(1.000):1/5 overload: 0, -0.2)
> 2008-07-28 08:22:46,952-0500 DEBUG WeightedHostScoreScheduler Old score: 
> 0.000, new score: -0.200
> 2008-07-28 08:22:46,957-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, 
> identity=urn:0-1-1-1217251364677) setting status to Submitting
> 2008-07-28 08:22:46,971-0500 INFO  LocalService Started local service: 
> 128.135.125.17:50000
> 2008-07-28 08:22:46,979-0500 INFO  BootstrapService Socket bound. URL is 
> http://128.135.125.17:50001
> 2008-07-28 08:22:47,008-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, 
> identity=urn:cog-1217251364678) setting status to Submitting
> 2008-07-28 08:22:48,032-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, 
> identity=urn:cog-1217251364678) setting status to Submitted
> 2008-07-28 08:22:48,389-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, 
> identity=urn:cog-1217251364678) setting status to Active
> 2008-07-28 08:22:58,853-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, 
> identity=urn:cog-1217251364678) setting status to Completed
> 2008-07-28 08:22:58,853-0500 INFO  ServiceManager Service task 
> Task(type=JOB_SUBMISSION, identity=urn:cog-1217251364678) terminated. 
> Removing service.
> 2008-07-28 08:22:58,853-0500 INFO  ServiceManager Service does not 
> appear to be registered with this manager
> 2008-07-28 08:22:59,055-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, 
> identity=urn:0-1-1-1217251364677) setting status to Submitted
> 2008-07-28 08:22:59,056-0500 DEBUG WeightedHostScoreScheduler Submission 
> time for Task(type=JOB_SUBMISSION, identity=urn:0-1-1-1217251364677): 
> 12098ms. Score delta: -0.05947692307692308
> 2008-07-28 08:22:59,056-0500 DEBUG WeightedHostScoreScheduler 
> multiplyScore(abe:-0.200(0.889):1/4 overload: 0, -0.05947692307692308)
> 2008-07-28 08:22:59,056-0500 DEBUG WeightedHostScoreScheduler Old score: 
> -0.200, new score: -0.259
> 2008-07-28 08:22:59,056-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, 
> identity=urn:0-1-1-1217251364677) setting status to Active
> 2008-07-28 08:22:59,057-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, 
> identity=urn:0-1-1-1217251364677) setting status to Failed Could not 
> submit job
> 2008-07-28 08:22:59,061-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> jobid=echo-28cx26xi - Application exception: Could not submit job
>          vdl:execute @ vdl-int.k, line: 395
>          sys:sequential @ vdl-int.k, line: 387
> ...
>          rlog:restartlog @ ctest.kml, line: 66
>          kernel:project @ ctest.kml, line: 2
>          ctest-20080728-0822-23q64s0d
> Caused by: 
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: 
> Could not submit job
> Caused by: 
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: 
> Could not start coaster service
> Caused by: 
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: 
> Task ended before registration was received.
> STDOUT: This node is in dedicated user mode.
> 
> STDERR: null
> 
> 
> On 7/27/08 11:27 PM, Ben Clifford wrote:
> > On Mon, 28 Jul 2008, Ben Clifford wrote:
> > 
> >>> jobManager="gt2:pbs" />
> >> Try gt2:gt2:pbs
> > 
> > In more detail: this field, in the case of condor, encodes a lot of 
> > information in a not-so-obvious way:
> > 
> >    a:b[:c]
> > 
> > a = cog provider to use to submit the remote headnode job
> > b = cog provider that the remote headnode job will use to submit workers
> > c = jobmanager to be used by cog provider b
> > 
> > There happens to be a cog provider called pbs that doesn't work so well. 
> > That is what you were specifying.
> > 
> > gt2:gt2:pbs specifies gt2 for both situations, using the pbs jobmanger for 
> > the worker node submissions in gram2.
> > 




More information about the Swift-devel mailing list