[Swift-user] Kickstart executable not found

Mihael Hategan hategan at mcs.anl.gov
Fri Aug 31 14:49:09 CDT 2007


On Fri, 2007-08-31 at 14:35 -0500, Jing Tie wrote:
> Hi Mihael,
> 
> OSG troubleshooting group would like to help me with some running
> issues on OSG sites. Is it possible for me to see the submit file that
> swift generated?

If you're referring to a condor submit file, then no, because Swift
doesn't use those. It makes a direct GRAM call.

You can however see the RSL specs that are submitted by adding the
following to etc/log4j.properties:
org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler=DEBUG

The relevant information will then be in the log file. Grep for "RSL:".

You can also try the following incantation for the OSG troubleshooting
group: "An empty string argument is not the same as no argument. Please
make sure empty string arguments make it to the executable."

Mihael


> 
> Thanks,
> Jing
> 
> On 8/31/07, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > On Fri, 2007-08-31 at 13:10 -0500, Jing Tie wrote:
> > > Hi Michael,
> > >
> > > You said that this problem is caused by condor's bug. But the site
> > > GLOW(see below) can run the job successfully with condor jobmanager.
> > > Could you explain this?
> >
> > I can't. Perhaps this site has the problem fixed in some way.
> >
> > Mihael
> >
> > >
> > > Many thanks,
> > > Jing
> > >
> > > On 8/20/07, Jing Tie <tiejing at gmail.com> wrote:
> > > > Hi,
> > > >
> > > > There is one site running the application successfully with
> > > > jobmanager-condor:
> > > >
> > > > site: GLOW
> > > > gatekeeper: cmsgrid01.hep.wisc.edu
> > > > app_dir: /afs/hep.wisc.edu/osg/app
> > > > data_dir: /afs/hep.wisc.edu/osg/data
> > > > condor_dir: /condor/bin
> > > > R_dir: /afs/hep.wisc.edu/osg/app/R-2.5.1/bin/R
> > > >
> > > > Maybe it has some special configurations or arguments.
> > > >
> > > > Jing
> > > >
> > > >
> > > >  On 8/20/07, Jing Tie <tiejing at gmail.com> wrote:
> > > > > Right, it's the problem of condor. After replacing jobmanager-condor
> > > > > with jobmanager, the job finished successfully.
> > > > >
> > > > > Thanks,
> > > > > Jing
> > > > >
> > > > > On 8/20/07, Mihael Hategan < hategan at mcs.anl.gov> wrote:
> > > > > > Right. The condor job manager has a bug. It does not properly quote
> > > > > > arguments. So you'll see strange things like this if you use it.
> > > > > >
> > > > > > Mihael
> > > > > >
> > > > > > On Mon, 2007-08-20 at 00:43 -0500, Jing Tie wrote:
> > > > > > > Sure.
> > > > > > >
> > > > > > > On 8/20/07, Mihael Hategan < hategan at mcs.anl.gov> wrote:
> > > > > > > > It puzzles me. Can you attach that file?
> > > > > > > >
> > > > > > > > On Sun, 2007-08-19 at 21:37 -0500, Jing Tie wrote:
> > > > > > > > > in $SWIFT_HOME/etc/swift.properties
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Jing
> > > > > > > > >
> > > > > > > > > On 8/19/07, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > > > > > > > > > On Sat, 2007-08-18 at 18:24 -0500, Jing Tie wrote:
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > I am working on SID application now. Job cwtsmall is a script
> > > > > > > > > > > wavelet.sh on AGLT2 site. In the wavelet.sh, R runs
> > > > runWaveletsAvg.R
> > > > > > > > > > > on input data 101_FB-epochs.Rdata, and should output
> > > > > > > > > > > 101-FBchannel1_cwt-avgResults.Rdata to
> > > > > > > > > > > 101-FBchannel28_cwt- avgResults.Rdata
> > > > > > > > > > > these 28 files.
> > > > > > > > > > >
> > > > > > > > > > > But when I runed swift client with kickstart.enabled = false,
> > > > > > > > > >
> > > > > > > > > > Where did you set this?
> > > > > > > > > >
> > > > > > > > > > Mihael
> > > > > > > > > >
> > > > > > > > > > >  it had
> > > > > > > > > > > the exit code 1024 error. And the stderr.txt said: Kickstart
> > > > > > > > > > > executable (101-FBchannel18_cwt-avgResults.Rdata) not found.
> > > > Details
> > > > > > > > > > > below:
> > > > > > > > > > >
> > > > > > > > > > > site: AGLT2
> > > > > > > > > > > gatekeeper: gate01.aglt2.org
> > > > > > > > > > > app_dir: /atlas/data08/OSG/APP/SIDGrid
> > > > > > > > > > > data_dir: /atlas/data08/OSG/DATA
> > > > > > > > > > > condor_dir: /opt/condor/bin
> > > > > > > > > > > R_dir: /atlas/data08/OSG/APP/R-2.5.1/bin/R
> > > > > > > > > > >
> > > > > > > > > > > output:
> > > > > > > > > > > Application exception: Job cwtsmall failed with an exit code
> > > > of 1024
> > > > > > > > > > >         sys:throw @ vdl-int.k, line: 109
> > > > > > > > > > >         vdl:checkexitcode @ vdl-int.k, line: 370
> > > > > > > > > > >         vdl:execute2 @ execute-default.k , line: 22
> > > > > > > > > > >         vdl:execute @ sid-wf1.kml, line: 20
> > > > > > > > > > >         wavelettransf @ sid-wf1.kml, line: 362
> > > > > > > > > > >         batchtrials @ sid-wf1.kml, line: 402
> > > > > > > > > > >         vdl:mains @ sid-wf1.kml, line: 399
> > > > > > > > > > > cwtsmall failed
> > > > > > > > > > > Provenance graph saved in sid-wf1-8cnxmo0qetg10.dot
> > > > > > > > > > > The following errors have occurred:
> > > > > > > > > > > 1. Application "cwtsmall" failed (Job cwtsmall failed with an
> > > > exit code of 1024)
> > > > > > > > > > >         Arguments: "scripts/runWaveletsAvg.R, 101, FB"
> > > > > > > > > > >         Host: NWICG_NotreDame
> > > > > > > > > > >         Directory:
> > > > sid-wf1-8cnxmo0qetg10/cwtsmall-zeb72rfi
> > > > > > > > > > >         STDERR: Kickstart executable
> > > > > > > > > > > (101-FBchannel18_cwt-avgResults.Rdata) not found
> > > > > > > > > > >         STDOUT:
> > > > > > > > > > > Errors detected. Cleanup not done.
> > > > > > > > > > > Execution completed with errors
> > > > > > > > > > >         sys:throw @ vdl.k, line: 140
> > > > > > > > > > >         vdl:mains @ sid-wf1.kml, line: 399
> > > > > > > > > > >         at
> > > > org.globus.cog.karajan.workflow.nodes.FlowNode.fail
> > > > (FlowNode.java:413)
> > > > > > > > > > >         at
> > > > org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:417)
> > > > > > > > > > >         at
> > > > org.globus.cog.karajan.workflow.nodes.GenerateErrorNode.post
> > > > (GenerateErrorNode.java:28)
> > > > > > > > > > >         at
> > > > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted
> > > > > > > > > > >         at
> > > > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent
> > > > (Sequential.java:33)
> > > > > > > > > > >         at
> > > > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:334)
> > > > > > > > > > >         at
> > > > org.globus.cog.karajan.workflow.events.EventBus.send
> > > > (EventBus.java:123)
> > > > > > > > > > >         at
> > > > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:97)
> > > > > > > > > > >         at
> > > > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent
> > > > (FlowNode.java:172)
> > > > > > > > > > >         at
> > > > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:298)
> > > > > > > > > > >         at
> > > > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren
> > > > (AbstractFunction.java:37)
> > > > > > > > > > >         at
> > > > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
> > > > > > > > > > >         at
> > > > org.globus.cog.karajan.workflow.nodes.FlowNode.restart
> > > > (FlowNode.java:239)
> > > > > > > > > > >         at
> > > > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:280)
> > > > > > > > > > >         at
> > > > org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent
> > > > (FlowNode.java:392)
> > > > > > > > > > >         at
> > > > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:331)
> > > > > > > > > > >         at
> > > > org.globus.cog.karajan.workflow.FlowElementWrapper.event
> > > > (FlowElementWrapper.java:227)
> > > > > > > > > > >         at
> > > > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:123)
> > > > > > > > > > >         at
> > > > org.globus.cog.karajan.workflow.events.EventBus.sendHooked
> > > > (EventBus.java:97)
> > > > > > > > > > >         at
> > > > org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
> > > > > > > > > > >
> > > > > > > > > > > I found that there are about 8 sites in OSG having the
> > > > problem.
> > > > > > > > > > >
> > > > > > > > > > > Many thanks,
> > > > > > > > > > > Jing
> > > > > > > > > > >
> > > > _______________________________________________
> > > > > > > > > > > Swift-user mailing list
> > > > > > > > > > > Swift-user at ci.uchicago.edu
> > > > > > > > > > >
> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > >
> >
> >
> 




More information about the Swift-user mailing list