[Swift-devel] Problems running coaster
Mihael Hategan
hategan at mcs.anl.gov
Mon Jul 28 09:17:06 CDT 2008
On Mon, 2008-07-28 at 08:27 -0500, Michael Wilde wrote:
> I tried jobManager="gt2:gt2:pbs" but still get the same error.
The error seems to be related to the fork job that tries to start the
service. Do they disallow fork jobs?
>
> Note that each time this fails I also see this in the log:
> --
> 2008-07-28 08:22:58,853-0500 INFO ServiceManager Service task
> Task(type=JOB_SUBMISSION, identity=urn:cog-1217251364678) terminated.
> Removing service.
> 2008-07-28 08:22:58,853-0500 INFO ServiceManager Service does not
> appear to be registered with this manager
> --
>
> Does that indicate a problem?
>
> Other notes:
>
> There was no coaster log in my home dir on the submit host
> (communicado). Should there be, for remote execution? Or will that log
> show up on the remote site, where the coaster service is run?
>
> There was no gram log on abe to indicate that a job was started there.
> It seems like the initial job that should run on abe to start the
> coaster service is failing. What piece of code creates that job?
>
> Full logs are on CI net at ~wilde/coast/run5
>
> sites.xml was:
>
> <config>
> <pool handle="abe" >
> <execution provider="coaster" url="grid-abe.ncsa.teragrid.org"
> jobManager="gt2:gt2:pbs" />
> <profile namespace="karajan" key="jobThrottle">4</profile>
> <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org"/>
> <workdirectory>/u/ac/wilde/swiftwork</workdirectory>
> <profile namespace="globus" key="project">TG-MCA01S018</profile>
>
>
> <!--altworkdirectory>/cfs/scratch/users/wilde/swiftwork</altworkdirectory-->
> <!--SwiftDACprofile namespace="globus"
> key="project">TG-CCR080002N</SwiftDACprofile-->
>
> </pool>
> </config>
>
> error was same:
>
> 2008-07-28 08:22:46,936-0500 INFO vdl:dostagein START
> jobid=echo-28cx26xi - Staging in files
> 2008-07-28 08:22:46,936-0500 INFO vdl:dostagein END jobid=echo-28cx26xi
> - Staging in finished
> 2008-07-28 08:22:46,937-0500 DEBUG vdl:execute2 JOB_START
> jobid=echo-28cx26xi tr=echo arguments=[the string is, s000]
> tmpdir=ctest-20080728-0822-23q64s0d/jobs/2/echo-28cx26xi host=abe
> 2008-07-28 08:22:46,952-0500 DEBUG WeightedHostScoreScheduler
> multiplyScore(abe:0.000(1.000):1/5 overload: 0, -0.2)
> 2008-07-28 08:22:46,952-0500 DEBUG WeightedHostScoreScheduler Old score:
> 0.000, new score: -0.200
> 2008-07-28 08:22:46,957-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> identity=urn:0-1-1-1217251364677) setting status to Submitting
> 2008-07-28 08:22:46,971-0500 INFO LocalService Started local service:
> 128.135.125.17:50000
> 2008-07-28 08:22:46,979-0500 INFO BootstrapService Socket bound. URL is
> http://128.135.125.17:50001
> 2008-07-28 08:22:47,008-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> identity=urn:cog-1217251364678) setting status to Submitting
> 2008-07-28 08:22:48,032-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> identity=urn:cog-1217251364678) setting status to Submitted
> 2008-07-28 08:22:48,389-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> identity=urn:cog-1217251364678) setting status to Active
> 2008-07-28 08:22:58,853-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> identity=urn:cog-1217251364678) setting status to Completed
> 2008-07-28 08:22:58,853-0500 INFO ServiceManager Service task
> Task(type=JOB_SUBMISSION, identity=urn:cog-1217251364678) terminated.
> Removing service.
> 2008-07-28 08:22:58,853-0500 INFO ServiceManager Service does not
> appear to be registered with this manager
> 2008-07-28 08:22:59,055-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> identity=urn:0-1-1-1217251364677) setting status to Submitted
> 2008-07-28 08:22:59,056-0500 DEBUG WeightedHostScoreScheduler Submission
> time for Task(type=JOB_SUBMISSION, identity=urn:0-1-1-1217251364677):
> 12098ms. Score delta: -0.05947692307692308
> 2008-07-28 08:22:59,056-0500 DEBUG WeightedHostScoreScheduler
> multiplyScore(abe:-0.200(0.889):1/4 overload: 0, -0.05947692307692308)
> 2008-07-28 08:22:59,056-0500 DEBUG WeightedHostScoreScheduler Old score:
> -0.200, new score: -0.259
> 2008-07-28 08:22:59,056-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> identity=urn:0-1-1-1217251364677) setting status to Active
> 2008-07-28 08:22:59,057-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> identity=urn:0-1-1-1217251364677) setting status to Failed Could not
> submit job
> 2008-07-28 08:22:59,061-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION
> jobid=echo-28cx26xi - Application exception: Could not submit job
> vdl:execute @ vdl-int.k, line: 395
> sys:sequential @ vdl-int.k, line: 387
> ...
> rlog:restartlog @ ctest.kml, line: 66
> kernel:project @ ctest.kml, line: 2
> ctest-20080728-0822-23q64s0d
> Caused by:
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> Could not submit job
> Caused by:
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> Could not start coaster service
> Caused by:
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> Task ended before registration was received.
> STDOUT: This node is in dedicated user mode.
>
> STDERR: null
>
>
> On 7/27/08 11:27 PM, Ben Clifford wrote:
> > On Mon, 28 Jul 2008, Ben Clifford wrote:
> >
> >>> jobManager="gt2:pbs" />
> >> Try gt2:gt2:pbs
> >
> > In more detail: this field, in the case of condor, encodes a lot of
> > information in a not-so-obvious way:
> >
> > a:b[:c]
> >
> > a = cog provider to use to submit the remote headnode job
> > b = cog provider that the remote headnode job will use to submit workers
> > c = jobmanager to be used by cog provider b
> >
> > There happens to be a cog provider called pbs that doesn't work so well.
> > That is what you were specifying.
> >
> > gt2:gt2:pbs specifies gt2 for both situations, using the pbs jobmanger for
> > the worker node submissions in gram2.
> >
More information about the Swift-devel
mailing list