[Swift-user] follow up of kickstart executable not found problem

Jing Tie tiejing at gmail.com
Mon Sep 10 13:46:49 CDT 2007


Sure.

On 9/10/07, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> On Mon, 2007-09-10 at 12:27 -0500, Jing Tie wrote:
> > Hi,
> >
> > The test job works fine with jobmanager-condor. But SID program has an
> > exception:
> >
> > cwtsmall failed
> > The following errors have occurred:
> > 1. Application "cwtsmall" failed (Exit code 126)
> >         Arguments: "scripts/runWaveletsAvg.R, 102, FB"
> >         Host: NWICG_NotreDame
> >         Directory: sid-wf1-5blglq655nj21/cwtsmall-7hdhm1hi
> >         STDERR: shared/wrapper.sh: line 164:
> > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh: Text file busy
> >         STDOUT:
> >
> > log file is attached.
> >
> > I checked that wavelet.sh is in good condition under
> > /dscratch/osg/app/osg/jtie/SIDGrid/. No wrapper.log was generated.
>
> The wrapper log is no more. Relevant information about a job (including
> what its wrapper does) can be found in info/<jobid>-info. In this case
> info/cwtsmall-7hdhm1hi-info.
>
> Can you post that file?
>
> Mihael
>
>
>
> >
> > Thanks,
> > Jing
> >
> > On 9/10/07, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > > Can you try it with Condor? It should work now.
> > >
> > > On Mon, 2007-09-10 at 11:26 -0500, Jing Tie wrote:
> > > > Hi,
> > > >
> > > > It works fine now (log is attached). I'll try sid program next.
> > > >
> > > > Many thanks,
> > > > Jing
> > > >
> > > > On 9/10/07, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > > > > You need to do an SVN update for both CoG and Swift.
> > > > >
> > > > > On Mon, 2007-09-10 at 11:03 -0500, Jing Tie wrote:
> > > > > > Hi,
> > > > > >
> > > > > > It has the same exception. Log is attached.
> > > > > >
> > > > > > Thanks,
> > > > > > Jing
> > > > > >
> > > > > > On 9/10/07, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > > > > > > On Sun, 2007-09-09 at 12:29 -0500, Mihael Hategan wrote:
> > > > > > > > It's definitely a bug in the code I put in a few days ago. But
> > > > I don't
> > > > > > > > quite see how it happens. Such simple code. Yet how complex.
> > > > I'll have
> > > > > > > > to get back to you on it.
> > > > > > >
> > > > > > > Fixed, I think. Can you try again?
> > > > > > >
> > > > > > > Mihael
> > > > > > >
> > > > > > > >
> > > > > > > > On Sun, 2007-09-09 at 12:09 -0500, Jing Tie wrote:
> > > > > > > > > Sure.
> > > > > > > > >
> > > > > > > > > On 9/9/07, Mihael Hategan < hategan at mcs.anl.gov> wrote:
> > > > > > > > > > Please post the whole workflow and the whole log.
> > > > > > > > > >
> > > > > > > > > > On Sun, 2007-09-09 at 11:43 -0500, Jing Tie wrote:
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > I tried duplicate job again using the latest swift with
> > > > > > > > > > > "log4j.logger.org.globus.cog.abstraction=DEBUG ".
> > > > > > > > > > >
> > > > > > > > > > > It didn't generated duplicate-*** directory under
> > > > simple-wf-***
> > > > > > > > > > > directory. Details:
> > > > > > > > > > > Resource
> > > > ( org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3)
> > > > > > > > > > > successfully released
> > > > > > > > > > > Task(type=4, identity=urn:0-0-1189348296362) setting
> > > > status to Failed
> > > > > > > > > > > Could not set current directory to "null"
> > > > > > > > > > > duplicate failed
> > > > > > > > > > > The following errors have occurred:
> > > > > > > > > > > 1. Could not initialize shared directory on
> > > > NWICG_NotreDame
> > > > > > > > > > > Caused by:
> > > > > > > > > > >         Could not set current directory to "null"
> > > > > > > > > > > Caused by:
> > > > > > > > > > >         Required argument missing
> > > > > > > > > > >
> > > > > > > > > > > in simple-wf-fc06kzz28d880.log :
> > > > > > > > > > > 2007-09-09 09:31:39,911 DEBUG FileResourceCache Resource
> > > > > > > > > > >
> > > > (org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3)
> > > > > > > > > > > successfully released
> > > > > > > > > > > 2007-09-09 09:31:39,911 DEBUG TaskImpl Task(type=4,
> > > > > > > > > > > identity=urn:0-0-1189348296362) setting status to Failed
> > > > Could not set
> > > > > > > > > > > current directory to "null"
> > > > > > > > > > > 2007-09-09 09:31:39,938 INFO  vdl:mains Errors detected.
> > > > Cleanup not done.
> > > > > > > > > > > 2007-09-09 09:31:39,963 DEBUG VDL2ExecutionContext
> > > > Execution completed
> > > > > > > > > > > with errors
> > > > > > > > > > > Execution completed with errors
> > > > > > > > > > >
> > > > > > > > > > > There is nothing under
> > > > > > > > > > >
> > > > osg.hpcc.nd.edu/dscratch/osg/data/osg/jtie/simple-wf-fc06kzz28d880
> > > > > > > > > > > except empty shared directory.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Jing
> > > > > > > > > > >
> > > > > > > > > > > On 9/7/07, Mihael Hategan < hategan at mcs.anl.gov> wrote:
> > > > > > > > > > > > Nevermind that. It's eating the empty stdin argument.
> > > > This doesn't make
> > > > > > > > > > > > sense. Fork used to behave.
> > > > > > > > > > > >
> > > > > > > > > > > > Can you add this to log4j.properties, run again, and
> > > > post the log?
> > > > > > > > > > > >
> > > > > > > > > > > > log4j.logger.org.globus.cog.abstraction=DEBUG
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, 2007-09-07 at 23:02 -0500, Mihael Hategan
> > > > wrote:
> > > > > > > > > > > > > Can you post the workflow?
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, 2007-09-07 at 21:56 -0500, Jing Tie wrote:
> > > > > > > > > > > > > > Thanks!
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I checked the procedure, and found that the
> > > > filenames of the output
> > > > > > > > > > > > > > files of "myapp" are the same as "File f". So the
> > > > first option should
> > > > > > > > > > > > > > not be the problem.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Then I tried a very simple swift script that
> > > > doesn't need R:
> > > > > > > > > > > > > > app function: add some lines to the input file to
> > > > generate the output file
> > > > > > > > > > > > > > input file name: simpleFile.txt
> > > > > > > > > > > > > > output file name: simpleFile.output
> > > > > > > > > > > > > > application script location:
> > > > $OSG_APP/osg/jtie/duplicate.sh
> > > > > > > > > > > > > > jobmanager: jobmanager-fork
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > duplicate failed
> > > > > > > > > > > > > > The following errors have occurred:
> > > > > > > > > > > > > > 1. Application "duplicate" failed (Failed to link
> > > > input file
> > > > > > > > > > > > > > /dscratch/osg/app/osg/jtie/duplicate.sh)
> > > > > > > > > > > > > >         Arguments: "simpleFile.txt"
> > > > > > > > > > > > > >         Host: NWICG_NotreDame
> > > > > > > > > > > > > >         Directory:
> > > > simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi
> > > > > > > > > > > > > >         STDERR:
> > > > > > > > > > > > > >         STDOUT:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > after execution, 3 empty directories (not files,
> > > > and also weird name)
> > > > > > > > > > > > > > were generated:
> > > > > > > > > > > > > > duplicate-gt7hyvgi-simpleFile.output
> > > > > > > > > > > > > > duplicate-ht7hyvgi-simpleFile.output
> > > > > > > > > > > > > > duplicate-it7hyvgi-simpleFile.output
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > wrapper.log:
> > > > > > > > > > > > > > DIR=duplicate-gt7hyvgi
> > > > > > > > > > > > > > STDOUT=simpleFile.output
> > > > > > > > > > > > > > STDERR=stderr.txt
> > > > > > > > > > > > > > DIRS=simpleFile.output
> > > > > > > > > > > > > > LINKS=/dscratch/osg/app/osg/jtie/duplicate.sh
> > > > > > > > > > > > > > OUTS=simpleFile.txt
> > > > > > > > > > > > > > ln: creating symbolic link
> > > > > > > > > > > > > >
> > > > `duplicate-gt7hyvgi//dscratch/osg/app/osg/jtie/duplicate.sh' to
> > > > > > > > > > > > > >
> > > > `/dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared//dscratch/osg/app/osg/jtie/duplicate.sh':
> > > > > > > > > > > > > > No such file or directory
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > under
> > > > dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared:
> > > > > > > > > > > > > > seq.sh, simpleFile.txt, wrapper.sh
> > > > > > > > > > > > > > under
> > > > dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi:
> > > > > > > > > > > > > > one directory - simpleFile.output
> > > > > > > > > > > > >
> > > > > /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi/simpleFile.output:
> > > > > > > > > > > > > > empty
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I think the swift program is right, since I run it
> > > > successfully using
> > > > > > > > > > > > > > localhost duplicate.sh.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Many thanks,
> > > > > > > > > > > > > > Jing
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On 9/7/07, Mihael Hategan < hategan at mcs.anl.gov>
> > > > wrote:
> > > > > > > > > > > > > > > So when the output files are not created, there
> > > > can be two reasons:
> > > > > > > > > > > > > > > 1. The specification of what files should be
> > > > created is broken. This is,
> > > > > > > > > > > > > > > at this time, done by looking at the filenames
> > > > of the return values from
> > > > > > > > > > > > > > > the atomic procedure. Normally one passes those
> > > > file names to the
> > > > > > > > > > > > > > > application as output file parameters. Example:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > (File f) proc(...) {
> > > > > > > > > > > > > > >   app {
> > > > > > > > > > > > > > >     myapp ... "-o" @filename(f);
> > > > > > > > > > > > > > >   }
> > > > > > > > > > > > > > > }
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 2. The specification is correct, but the
> > > > application doesn't behave.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Mihael
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Fri, 2007-09-07 at 12:35 -0500, Jing Tie
> > > > wrote:
> > > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I tried jobmanger-fork instead of
> > > > jobmanager-condor on osg.hpcc.nd.edu site:
> > > > > > > > > > > > > > > > <recall the site used to have the exception -
> > > > kickstart executable
> > > > > > > > > > > > > > > > (101-FBchannel18_cwt- avgResults.Rdata) not
> > > > found>
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > jobmanager-fork:
> > > > > > > > > > > > > > > > ------------------------
> > > > > > > > > > > > > > > > Application exception: The following output
> > > > files were not created by
> > > > > > > > > > > > > > > > the
> > > > application: /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > globus-job-run
> > > > osg.hpcc.nd.edu/jobmanager /bin/ls -al
> > > > > > > > > > > > > > >
> > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/cwtsmall-0bhnnvgi
> > > > > > > > > > > > > > > > lrwxrwxrwx  1 osg osgusers   93 Sep  7 12:42
> > > > > > > > > > > > > > > > 101-FBchannel10_cwt-avgResults.Rdata ->
> > > > > > > > > > > > > > >
> > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared/101-FBchannel10_cwt- avgResults.Rdata
> > > > > > > > > > > > > > > > ... (total 28 links, the same number as the
> > > > number of the expected output files)
> > > > > > > > > > > > > > > > drwxr-xr-x  2 osg osgusers 4096 Sep  7 12:42
> > > > 101_FB- epochs.Rdata
> > > > > > > > > > > > > > > > drwxr-xr-x  3 osg osgusers 4096 Sep  7 12:42
> > > > scripts
> > > > > > > > > > > > > > > > -rw-r--r--  1 osg osgusers   58 Sep  7 12:42
> > > > stderr.txt
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > globus-job-run
> > > > osg.hpcc.nd.edu/jobmanager /bin/ls -al
> > > > > > > > > > > > > > >
> > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared
> > > > > > > > > > > > > > > > -rw-r--r--  1 osg osgusers 4109752 Sep  7
> > > > 12:41 101_FB- epochs.Rdata
> > > > > > > > > > > > > > > > drwxr-xr-x  2 osg osgusers    4096 Sep  7
> > > > 12:41 scripts
> > > > > > > > > > > > > > > > -rw-r--r--  1 osg osgusers     571 Sep  7
> > > > 12:41 seq.sh
> > > > > > > > > > > > > > > > -rw-r--r--  1 osg osgusers    3278 Sep  7
> > > > 12:41 wrapper.sh
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > empty kickstart directory
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > globus-job-run
> > > > osg.hpcc.nd.edu/jobmanager /bin/cat
> > > > > > > > > > > > > > >
> > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/status/cwtsmall-0bhnnvgi-error
> > > > > > > > > > > > > > > > The following output files were not created by
> > > > the application:
> > > > > > > > > > > > > > > > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > jobmanager-condor:
> > > > > > > > > > > > > > > > -------------------------------
> > > > > > > > > > > > > > > > Application exception: The following output
> > > > files were not created by
> > > > > > > > > > > > > > > > the application:
> > > > 101-FBchannel20_cwt-avgResults.Rdata
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > globus-job-run
> > > > osg.hpcc.nd.edu/jobmanager /bin/ls -al
> > > > > > > > > > > > > > >
> > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/cwtsmall-cpteovgi
> > > > > > > > > > > > > > > > lrwxrwxrwx  1 osg osgusers   76 Sep  7 13:01
> > > > 101_FB-epochs.Rdata ->
> > > > > > > > > > > > > > >
> > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared/101_FB-
> > > > epochs.Rdata
> > > > > > > > > > > > > > > > drwxr-xr-x  3 osg osgusers 4096 Sep  7 13:01
> > > > scripts
> > > > > > > > > > > > > > > > -rw-r--r--  1 osg osgusers   70 Sep  7 13:01
> > > > stderr.txt
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > globus-job-run
> > > > osg.hpcc.nd.edu/jobmanager /bin/ls -al
> > > > > > > > > > > > > > >
> > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared
> > > > > > > > > > > > > > > > -rw-r--r--  1 osg osgusers 4109752 Sep  7
> > > > 13:00 101_FB- epochs.Rdata
> > > > > > > > > > > > > > > > drwxr-xr-x  2 osg osgusers    4096 Sep  7
> > > > 13:00 scripts
> > > > > > > > > > > > > > > > -rw-r--r--  1 osg osgusers     571 Sep  7
> > > > 13:00 seq.sh
> > > > > > > > > > > > > > > > -rw-r--r--  1 osg osgusers    3278 Sep  7
> > > > 13:00 wrapper.sh
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I think the descriptions of exception are all
> > > > right now. The
> > > > > > > > > > > > > > > > difference between fork and condor was that
> > > > fork created the output
> > > > > > > > > > > > > > > > links to the shared directory, but condor
> > > > didn't. But the essential
> > > > > > > > > > > > > > > > problem is the output files not being created.
> > > > I will do more
> > > > > > > > > > > > > > > > experiments to see whether the problem of file
> > > > system or application.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > Jing
> > > > > > > > > > > > > > > >
> > > > _______________________________________________
> > > > > > > > > > > > > > > > Swift-user mailing list
> > > > > > > > > > > > > > > > Swift-user at ci.uchicago.edu
> > > > > > > > > > > > > > > >
> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > _______________________________________________
> > > > > > > > > > > > > Swift-user mailing list
> > > > > > > > > > > > > Swift-user at ci.uchicago.edu
> > > > > > > > > > > > >
> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > > > _______________________________________________
> > > > > > > > Swift-user mailing list
> > > > > > > > Swift-user at ci.uchicago.edu
> > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> > >
> > >
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cwtsmall-7hdhm1hi-info
Type: application/octet-stream
Size: 8245 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20070910/f9fa0738/attachment.obj>


More information about the Swift-user mailing list