From michael.mccracken at gmail.com Wed Sep 5 17:21:16 2007 From: michael.mccracken at gmail.com (Michael McCracken) Date: Wed, 5 Sep 2007 15:21:16 -0700 Subject: [Swift-user] Swift for large-scale batch jobs? Message-ID: Hi, I'm studying the workflow of large-scale, 'capability'-class experiments, specifically using workflow time prediction to support planning and system choice for experiments that are too large for test runs. I'm wondering if the kinds of workflows I'm studying could be implemented in swift. I'm a bit under-informed of the state of globus and related technologies, so please excuse me if this should be obvious: Can swift scripts control submission of jobs to a batch queue (local or remote), either sequentially linked or independent? The experiments I'm working with involve many sequentially-dependent full-system runs (thousands of processors), with large enough data to require transfer to archive or secondary storage between runs. Has anyone tried something like this in swift? What I'd like to do, if swift can support experiments like that, is build a tool to read swift scripts ( or the appropriate intermediate form ) and generate task descriptions for my current tool, which simulates large scale experiments to predict total time to solution (including queue wait and network transfer). Any advice on which point in the swift tool chain to start at would be helpful. I've scanned the code, but it is a lot to digest. Thanks, -mike -- Michael McCracken UCSD CSE PhD Candidate research: http://www.cse.ucsd.edu/~mmccrack/ misc: http://michael-mccracken.net/wp/ From benc at hawaga.org.uk Thu Sep 6 03:16:44 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 6 Sep 2007 08:16:44 +0000 (GMT) Subject: [Swift-user] Swift for large-scale batch jobs? In-Reply-To: References: Message-ID: On Wed, 5 Sep 2007, Michael McCracken wrote: > Can swift scripts control submission of jobs to a batch queue (local > or remote), either sequentially linked or independent? yes. In terms of linking jobs together, the idea is that those relations are expressed by how they share data. So rather than saying 'job A runs before job B', you'd say 'job A generates file X, and job B needs file X as an input'. > The experiments I'm working with involve many sequentially-dependent > full-system runs (thousands of processors), with large enough data to > require transfer to archive or secondary storage between runs. Has > anyone tried something like this in swift? I don't know what the statistics are for our recent large runs are - someone else on this list might comment. What you say is within the scope of what we're trying to do, though. > What I'd like to do, if swift can support experiments like that, is > build a tool to read swift scripts ( or the appropriate intermediate > form ) and generate task descriptions for my current tool, which > simulates large scale experiments to predict total time to solution > (including queue wait and network transfer). Any advice on which point > in the swift tool chain to start at would be helpful. I've scanned the > code, but it is a lot to digest. A couple of ideas: i) Swift submits jobs through execution through the java cog kit execution providers. there are various providers available by default, such as one to run programs on the local machine, one to submit the job to globus, one to submit directly to the PBS batch queueing system. Execution providers can be written for other systems. For example, our group has a research project called Falkon which does job submision and execution; this ties into swift through a specially written execution provider. If you follow 'building swift' instructions on the swift download page, the provider source code for the various default providers lives in cog/modules/provider-* Perhaps you could write your own provider which, rather than executing the task it is given, instead performs a simulation of that task. ii) There are a couple of options, -typecheck and -dryrun, which cause normal execution to be replaced by other code at the karajan runtime layer. The code that is changed here is in: cog/modules/vdsk/libexec/execute-*.k By default, execute-default.k is used, which deals with actual execution. The much simpler execute-typecheck.k and execute-dryrun.k replace that execution code with different behaviour. You could plug in at that point. In case i) you could write the code in Java, but I think you would have to do a good job convincing swift that you really had produced output files and the like. In case ii) you'd have to write some code in the Karajan language which you are likely less familiar with, but you would have (I think) less to do in terms of simulating fake execution of your jobs. Both of these approaches would use a large part of Swift as-is, so you'd be able to re-use a large part of our codebase (all the language parsing, etc). -- From tiejing at gmail.com Fri Sep 7 12:35:03 2007 From: tiejing at gmail.com (Jing Tie) Date: Fri, 7 Sep 2007 12:35:03 -0500 Subject: [Swift-user] follow up of kickstart executable not found problem Message-ID: Hi, I tried jobmanger-fork instead of jobmanager-condor on osg.hpcc.nd.edu site: jobmanager-fork: ------------------------ Application exception: The following output files were not created by the application: /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/cwtsmall-0bhnnvgi lrwxrwxrwx 1 osg osgusers 93 Sep 7 12:42 101-FBchannel10_cwt-avgResults.Rdata -> /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared/101-FBchannel10_cwt-avgResults.Rdata ... (total 28 links, the same number as the number of the expected output files) drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:42 101_FB-epochs.Rdata drwxr-xr-x 3 osg osgusers 4096 Sep 7 12:42 scripts -rw-r--r-- 1 osg osgusers 58 Sep 7 12:42 stderr.txt globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared -rw-r--r-- 1 osg osgusers 4109752 Sep 7 12:41 101_FB-epochs.Rdata drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:41 scripts -rw-r--r-- 1 osg osgusers 571 Sep 7 12:41 seq.sh -rw-r--r-- 1 osg osgusers 3278 Sep 7 12:41 wrapper.sh empty kickstart directory globus-job-run osg.hpcc.nd.edu/jobmanager /bin/cat /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/status/cwtsmall-0bhnnvgi-error The following output files were not created by the application: /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh jobmanager-condor: ------------------------------- Application exception: The following output files were not created by the application: 101-FBchannel20_cwt-avgResults.Rdata globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/cwtsmall-cpteovgi lrwxrwxrwx 1 osg osgusers 76 Sep 7 13:01 101_FB-epochs.Rdata -> /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared/101_FB-epochs.Rdata drwxr-xr-x 3 osg osgusers 4096 Sep 7 13:01 scripts -rw-r--r-- 1 osg osgusers 70 Sep 7 13:01 stderr.txt globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared -rw-r--r-- 1 osg osgusers 4109752 Sep 7 13:00 101_FB-epochs.Rdata drwxr-xr-x 2 osg osgusers 4096 Sep 7 13:00 scripts -rw-r--r-- 1 osg osgusers 571 Sep 7 13:00 seq.sh -rw-r--r-- 1 osg osgusers 3278 Sep 7 13:00 wrapper.sh I think the descriptions of exception are all right now. The difference between fork and condor was that fork created the output links to the shared directory, but condor didn't. But the essential problem is the output files not being created. I will do more experiments to see whether the problem of file system or application. Thanks, Jing From hategan at mcs.anl.gov Fri Sep 7 12:40:32 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 07 Sep 2007 12:40:32 -0500 Subject: [Swift-user] follow up of kickstart executable not found problem In-Reply-To: References: Message-ID: <1189186832.28276.4.camel@blabla.mcs.anl.gov> So when the output files are not created, there can be two reasons: 1. The specification of what files should be created is broken. This is, at this time, done by looking at the filenames of the return values from the atomic procedure. Normally one passes those file names to the application as output file parameters. Example: (File f) proc(...) { app { myapp ... "-o" @filename(f); } } 2. The specification is correct, but the application doesn't behave. Mihael On Fri, 2007-09-07 at 12:35 -0500, Jing Tie wrote: > Hi, > > I tried jobmanger-fork instead of jobmanager-condor on osg.hpcc.nd.edu site: > (101-FBchannel18_cwt-avgResults.Rdata) not found> > > jobmanager-fork: > ------------------------ > Application exception: The following output files were not created by > the application: /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/cwtsmall-0bhnnvgi > lrwxrwxrwx 1 osg osgusers 93 Sep 7 12:42 > 101-FBchannel10_cwt-avgResults.Rdata -> > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared/101-FBchannel10_cwt-avgResults.Rdata > ... (total 28 links, the same number as the number of the expected output files) > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:42 101_FB-epochs.Rdata > drwxr-xr-x 3 osg osgusers 4096 Sep 7 12:42 scripts > -rw-r--r-- 1 osg osgusers 58 Sep 7 12:42 stderr.txt > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 12:41 101_FB-epochs.Rdata > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:41 scripts > -rw-r--r-- 1 osg osgusers 571 Sep 7 12:41 seq.sh > -rw-r--r-- 1 osg osgusers 3278 Sep 7 12:41 wrapper.sh > > empty kickstart directory > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/cat > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/status/cwtsmall-0bhnnvgi-error > The following output files were not created by the application: > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > jobmanager-condor: > ------------------------------- > Application exception: The following output files were not created by > the application: 101-FBchannel20_cwt-avgResults.Rdata > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/cwtsmall-cpteovgi > lrwxrwxrwx 1 osg osgusers 76 Sep 7 13:01 101_FB-epochs.Rdata -> > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared/101_FB-epochs.Rdata > drwxr-xr-x 3 osg osgusers 4096 Sep 7 13:01 scripts > -rw-r--r-- 1 osg osgusers 70 Sep 7 13:01 stderr.txt > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 13:00 101_FB-epochs.Rdata > drwxr-xr-x 2 osg osgusers 4096 Sep 7 13:00 scripts > -rw-r--r-- 1 osg osgusers 571 Sep 7 13:00 seq.sh > -rw-r--r-- 1 osg osgusers 3278 Sep 7 13:00 wrapper.sh > > I think the descriptions of exception are all right now. The > difference between fork and condor was that fork created the output > links to the shared directory, but condor didn't. But the essential > problem is the output files not being created. I will do more > experiments to see whether the problem of file system or application. > > Thanks, > Jing > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > From tiejing at gmail.com Fri Sep 7 21:56:31 2007 From: tiejing at gmail.com (Jing Tie) Date: Fri, 7 Sep 2007 21:56:31 -0500 Subject: [Swift-user] follow up of kickstart executable not found problem In-Reply-To: <1189186832.28276.4.camel@blabla.mcs.anl.gov> References: <1189186832.28276.4.camel@blabla.mcs.anl.gov> Message-ID: Thanks! I checked the procedure, and found that the filenames of the output files of "myapp" are the same as "File f". So the first option should not be the problem. Then I tried a very simple swift script that doesn't need R: app function: add some lines to the input file to generate the output file input file name: simpleFile.txt output file name: simpleFile.output application script location: $OSG_APP/osg/jtie/duplicate.sh jobmanager: jobmanager-fork duplicate failed The following errors have occurred: 1. Application "duplicate" failed (Failed to link input file /dscratch/osg/app/osg/jtie/duplicate.sh) Arguments: "simpleFile.txt" Host: NWICG_NotreDame Directory: simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi STDERR: STDOUT: after execution, 3 empty directories (not files, and also weird name) were generated: duplicate-gt7hyvgi-simpleFile.output duplicate-ht7hyvgi-simpleFile.output duplicate-it7hyvgi-simpleFile.output wrapper.log: DIR=duplicate-gt7hyvgi STDOUT=simpleFile.output STDERR=stderr.txt DIRS=simpleFile.output LINKS=/dscratch/osg/app/osg/jtie/duplicate.sh OUTS=simpleFile.txt ln: creating symbolic link `duplicate-gt7hyvgi//dscratch/osg/app/osg/jtie/duplicate.sh' to `/dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared//dscratch/osg/app/osg/jtie/duplicate.sh': No such file or directory under dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared: seq.sh, simpleFile.txt, wrapper.sh under dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi: one directory - simpleFile.output /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi/simpleFile.output: empty I think the swift program is right, since I run it successfully using localhost duplicate.sh. Many thanks, Jing On 9/7/07, Mihael Hategan wrote: > So when the output files are not created, there can be two reasons: > 1. The specification of what files should be created is broken. This is, > at this time, done by looking at the filenames of the return values from > the atomic procedure. Normally one passes those file names to the > application as output file parameters. Example: > > (File f) proc(...) { > app { > myapp ... "-o" @filename(f); > } > } > > 2. The specification is correct, but the application doesn't behave. > > Mihael > > On Fri, 2007-09-07 at 12:35 -0500, Jing Tie wrote: > > Hi, > > > > I tried jobmanger-fork instead of jobmanager-condor on osg.hpcc.nd.edu site: > > > (101-FBchannel18_cwt-avgResults.Rdata) not found> > > > > jobmanager-fork: > > ------------------------ > > Application exception: The following output files were not created by > > the application: /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/cwtsmall-0bhnnvgi > > lrwxrwxrwx 1 osg osgusers 93 Sep 7 12:42 > > 101-FBchannel10_cwt-avgResults.Rdata -> > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared/101-FBchannel10_cwt-avgResults.Rdata > > ... (total 28 links, the same number as the number of the expected output files) > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:42 101_FB-epochs.Rdata > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 12:42 scripts > > -rw-r--r-- 1 osg osgusers 58 Sep 7 12:42 stderr.txt > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 12:41 101_FB-epochs.Rdata > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:41 scripts > > -rw-r--r-- 1 osg osgusers 571 Sep 7 12:41 seq.sh > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 12:41 wrapper.sh > > > > empty kickstart directory > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/cat > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/status/cwtsmall-0bhnnvgi-error > > The following output files were not created by the application: > > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > jobmanager-condor: > > ------------------------------- > > Application exception: The following output files were not created by > > the application: 101-FBchannel20_cwt-avgResults.Rdata > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/cwtsmall-cpteovgi > > lrwxrwxrwx 1 osg osgusers 76 Sep 7 13:01 101_FB-epochs.Rdata -> > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared/101_FB-epochs.Rdata > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 13:01 scripts > > -rw-r--r-- 1 osg osgusers 70 Sep 7 13:01 stderr.txt > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 13:00 101_FB-epochs.Rdata > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 13:00 scripts > > -rw-r--r-- 1 osg osgusers 571 Sep 7 13:00 seq.sh > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 13:00 wrapper.sh > > > > I think the descriptions of exception are all right now. The > > difference between fork and condor was that fork created the output > > links to the shared directory, but condor didn't. But the essential > > problem is the output files not being created. I will do more > > experiments to see whether the problem of file system or application. > > > > Thanks, > > Jing > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > From hategan at mcs.anl.gov Fri Sep 7 23:02:00 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 07 Sep 2007 23:02:00 -0500 Subject: [Swift-user] follow up of kickstart executable not found problem In-Reply-To: References: <1189186832.28276.4.camel@blabla.mcs.anl.gov> Message-ID: <1189224120.4369.0.camel@blabla.mcs.anl.gov> Can you post the workflow? On Fri, 2007-09-07 at 21:56 -0500, Jing Tie wrote: > Thanks! > > I checked the procedure, and found that the filenames of the output > files of "myapp" are the same as "File f". So the first option should > not be the problem. > > Then I tried a very simple swift script that doesn't need R: > app function: add some lines to the input file to generate the output file > input file name: simpleFile.txt > output file name: simpleFile.output > application script location: $OSG_APP/osg/jtie/duplicate.sh > jobmanager: jobmanager-fork > > duplicate failed > The following errors have occurred: > 1. Application "duplicate" failed (Failed to link input file > /dscratch/osg/app/osg/jtie/duplicate.sh) > Arguments: "simpleFile.txt" > Host: NWICG_NotreDame > Directory: simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi > STDERR: > STDOUT: > > after execution, 3 empty directories (not files, and also weird name) > were generated: > duplicate-gt7hyvgi-simpleFile.output > duplicate-ht7hyvgi-simpleFile.output > duplicate-it7hyvgi-simpleFile.output > > wrapper.log: > DIR=duplicate-gt7hyvgi > STDOUT=simpleFile.output > STDERR=stderr.txt > DIRS=simpleFile.output > LINKS=/dscratch/osg/app/osg/jtie/duplicate.sh > OUTS=simpleFile.txt > ln: creating symbolic link > `duplicate-gt7hyvgi//dscratch/osg/app/osg/jtie/duplicate.sh' to > `/dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared//dscratch/osg/app/osg/jtie/duplicate.sh': > No such file or directory > > under dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared: > seq.sh, simpleFile.txt, wrapper.sh > under dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi: > one directory - simpleFile.output > /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi/simpleFile.output: > empty > > I think the swift program is right, since I run it successfully using > localhost duplicate.sh. > > > Many thanks, > Jing > > > On 9/7/07, Mihael Hategan wrote: > > So when the output files are not created, there can be two reasons: > > 1. The specification of what files should be created is broken. This is, > > at this time, done by looking at the filenames of the return values from > > the atomic procedure. Normally one passes those file names to the > > application as output file parameters. Example: > > > > (File f) proc(...) { > > app { > > myapp ... "-o" @filename(f); > > } > > } > > > > 2. The specification is correct, but the application doesn't behave. > > > > Mihael > > > > On Fri, 2007-09-07 at 12:35 -0500, Jing Tie wrote: > > > Hi, > > > > > > I tried jobmanger-fork instead of jobmanager-condor on osg.hpcc.nd.edu site: > > > > > (101-FBchannel18_cwt-avgResults.Rdata) not found> > > > > > > jobmanager-fork: > > > ------------------------ > > > Application exception: The following output files were not created by > > > the application: /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/cwtsmall-0bhnnvgi > > > lrwxrwxrwx 1 osg osgusers 93 Sep 7 12:42 > > > 101-FBchannel10_cwt-avgResults.Rdata -> > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared/101-FBchannel10_cwt-avgResults.Rdata > > > ... (total 28 links, the same number as the number of the expected output files) > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:42 101_FB-epochs.Rdata > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 12:42 scripts > > > -rw-r--r-- 1 osg osgusers 58 Sep 7 12:42 stderr.txt > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 12:41 101_FB-epochs.Rdata > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:41 scripts > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 12:41 seq.sh > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 12:41 wrapper.sh > > > > > > empty kickstart directory > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/cat > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/status/cwtsmall-0bhnnvgi-error > > > The following output files were not created by the application: > > > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > jobmanager-condor: > > > ------------------------------- > > > Application exception: The following output files were not created by > > > the application: 101-FBchannel20_cwt-avgResults.Rdata > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/cwtsmall-cpteovgi > > > lrwxrwxrwx 1 osg osgusers 76 Sep 7 13:01 101_FB-epochs.Rdata -> > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared/101_FB-epochs.Rdata > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 13:01 scripts > > > -rw-r--r-- 1 osg osgusers 70 Sep 7 13:01 stderr.txt > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 13:00 101_FB-epochs.Rdata > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 13:00 scripts > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 13:00 seq.sh > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 13:00 wrapper.sh > > > > > > I think the descriptions of exception are all right now. The > > > difference between fork and condor was that fork created the output > > > links to the shared directory, but condor didn't. But the essential > > > problem is the output files not being created. I will do more > > > experiments to see whether the problem of file system or application. > > > > > > Thanks, > > > Jing > > > _______________________________________________ > > > Swift-user mailing list > > > Swift-user at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > From hategan at mcs.anl.gov Fri Sep 7 23:41:58 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 07 Sep 2007 23:41:58 -0500 Subject: [Swift-user] follow up of kickstart executable not found problem In-Reply-To: <1189224120.4369.0.camel@blabla.mcs.anl.gov> References: <1189186832.28276.4.camel@blabla.mcs.anl.gov> <1189224120.4369.0.camel@blabla.mcs.anl.gov> Message-ID: <1189226518.5306.4.camel@blabla.mcs.anl.gov> Nevermind that. It's eating the empty stdin argument. This doesn't make sense. Fork used to behave. Can you add this to log4j.properties, run again, and post the log? log4j.logger.org.globus.cog.abstraction=DEBUG On Fri, 2007-09-07 at 23:02 -0500, Mihael Hategan wrote: > Can you post the workflow? > > On Fri, 2007-09-07 at 21:56 -0500, Jing Tie wrote: > > Thanks! > > > > I checked the procedure, and found that the filenames of the output > > files of "myapp" are the same as "File f". So the first option should > > not be the problem. > > > > Then I tried a very simple swift script that doesn't need R: > > app function: add some lines to the input file to generate the output file > > input file name: simpleFile.txt > > output file name: simpleFile.output > > application script location: $OSG_APP/osg/jtie/duplicate.sh > > jobmanager: jobmanager-fork > > > > duplicate failed > > The following errors have occurred: > > 1. Application "duplicate" failed (Failed to link input file > > /dscratch/osg/app/osg/jtie/duplicate.sh) > > Arguments: "simpleFile.txt" > > Host: NWICG_NotreDame > > Directory: simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi > > STDERR: > > STDOUT: > > > > after execution, 3 empty directories (not files, and also weird name) > > were generated: > > duplicate-gt7hyvgi-simpleFile.output > > duplicate-ht7hyvgi-simpleFile.output > > duplicate-it7hyvgi-simpleFile.output > > > > wrapper.log: > > DIR=duplicate-gt7hyvgi > > STDOUT=simpleFile.output > > STDERR=stderr.txt > > DIRS=simpleFile.output > > LINKS=/dscratch/osg/app/osg/jtie/duplicate.sh > > OUTS=simpleFile.txt > > ln: creating symbolic link > > `duplicate-gt7hyvgi//dscratch/osg/app/osg/jtie/duplicate.sh' to > > `/dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared//dscratch/osg/app/osg/jtie/duplicate.sh': > > No such file or directory > > > > under dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared: > > seq.sh, simpleFile.txt, wrapper.sh > > under dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi: > > one directory - simpleFile.output > > /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi/simpleFile.output: > > empty > > > > I think the swift program is right, since I run it successfully using > > localhost duplicate.sh. > > > > > > Many thanks, > > Jing > > > > > > On 9/7/07, Mihael Hategan wrote: > > > So when the output files are not created, there can be two reasons: > > > 1. The specification of what files should be created is broken. This is, > > > at this time, done by looking at the filenames of the return values from > > > the atomic procedure. Normally one passes those file names to the > > > application as output file parameters. Example: > > > > > > (File f) proc(...) { > > > app { > > > myapp ... "-o" @filename(f); > > > } > > > } > > > > > > 2. The specification is correct, but the application doesn't behave. > > > > > > Mihael > > > > > > On Fri, 2007-09-07 at 12:35 -0500, Jing Tie wrote: > > > > Hi, > > > > > > > > I tried jobmanger-fork instead of jobmanager-condor on osg.hpcc.nd.edu site: > > > > > > > (101-FBchannel18_cwt-avgResults.Rdata) not found> > > > > > > > > jobmanager-fork: > > > > ------------------------ > > > > Application exception: The following output files were not created by > > > > the application: /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/cwtsmall-0bhnnvgi > > > > lrwxrwxrwx 1 osg osgusers 93 Sep 7 12:42 > > > > 101-FBchannel10_cwt-avgResults.Rdata -> > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared/101-FBchannel10_cwt-avgResults.Rdata > > > > ... (total 28 links, the same number as the number of the expected output files) > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:42 101_FB-epochs.Rdata > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 12:42 scripts > > > > -rw-r--r-- 1 osg osgusers 58 Sep 7 12:42 stderr.txt > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 12:41 101_FB-epochs.Rdata > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:41 scripts > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 12:41 seq.sh > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 12:41 wrapper.sh > > > > > > > > empty kickstart directory > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/cat > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/status/cwtsmall-0bhnnvgi-error > > > > The following output files were not created by the application: > > > > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > jobmanager-condor: > > > > ------------------------------- > > > > Application exception: The following output files were not created by > > > > the application: 101-FBchannel20_cwt-avgResults.Rdata > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/cwtsmall-cpteovgi > > > > lrwxrwxrwx 1 osg osgusers 76 Sep 7 13:01 101_FB-epochs.Rdata -> > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared/101_FB-epochs.Rdata > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 13:01 scripts > > > > -rw-r--r-- 1 osg osgusers 70 Sep 7 13:01 stderr.txt > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 13:00 101_FB-epochs.Rdata > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 13:00 scripts > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 13:00 seq.sh > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 13:00 wrapper.sh > > > > > > > > I think the descriptions of exception are all right now. The > > > > difference between fork and condor was that fork created the output > > > > links to the shared directory, but condor didn't. But the essential > > > > problem is the output files not being created. I will do more > > > > experiments to see whether the problem of file system or application. > > > > > > > > Thanks, > > > > Jing > > > > _______________________________________________ > > > > Swift-user mailing list > > > > Swift-user at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > From tiejing at gmail.com Sun Sep 9 11:43:58 2007 From: tiejing at gmail.com (Jing Tie) Date: Sun, 9 Sep 2007 11:43:58 -0500 Subject: [Swift-user] follow up of kickstart executable not found problem In-Reply-To: <1189226518.5306.4.camel@blabla.mcs.anl.gov> References: <1189186832.28276.4.camel@blabla.mcs.anl.gov> <1189224120.4369.0.camel@blabla.mcs.anl.gov> <1189226518.5306.4.camel@blabla.mcs.anl.gov> Message-ID: Hi, I tried duplicate job again using the latest swift with "log4j.logger.org.globus.cog.abstraction=DEBUG". It didn't generated duplicate-*** directory under simple-wf-*** directory. Details: Resource (org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) successfully released Task(type=4, identity=urn:0-0-1189348296362) setting status to Failed Could not set current directory to "null" duplicate failed The following errors have occurred: 1. Could not initialize shared directory on NWICG_NotreDame Caused by: Could not set current directory to "null" Caused by: Required argument missing in simple-wf-fc06kzz28d880.log: 2007-09-09 09:31:39,911 DEBUG FileResourceCache Resource (org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) successfully released 2007-09-09 09:31:39,911 DEBUG TaskImpl Task(type=4, identity=urn:0-0-1189348296362) setting status to Failed Could not set current directory to "null" 2007-09-09 09:31:39,938 INFO vdl:mains Errors detected. Cleanup not done. 2007-09-09 09:31:39,963 DEBUG VDL2ExecutionContext Execution completed with errors Execution completed with errors There is nothing under osg.hpcc.nd.edu/dscratch/osg/data/osg/jtie/simple-wf-fc06kzz28d880 except empty shared directory. Thanks, Jing On 9/7/07, Mihael Hategan wrote: > Nevermind that. It's eating the empty stdin argument. This doesn't make > sense. Fork used to behave. > > Can you add this to log4j.properties, run again, and post the log? > > log4j.logger.org.globus.cog.abstraction=DEBUG > > > On Fri, 2007-09-07 at 23:02 -0500, Mihael Hategan wrote: > > Can you post the workflow? > > > > On Fri, 2007-09-07 at 21:56 -0500, Jing Tie wrote: > > > Thanks! > > > > > > I checked the procedure, and found that the filenames of the output > > > files of "myapp" are the same as "File f". So the first option should > > > not be the problem. > > > > > > Then I tried a very simple swift script that doesn't need R: > > > app function: add some lines to the input file to generate the output file > > > input file name: simpleFile.txt > > > output file name: simpleFile.output > > > application script location: $OSG_APP/osg/jtie/duplicate.sh > > > jobmanager: jobmanager-fork > > > > > > duplicate failed > > > The following errors have occurred: > > > 1. Application "duplicate" failed (Failed to link input file > > > /dscratch/osg/app/osg/jtie/duplicate.sh) > > > Arguments: "simpleFile.txt" > > > Host: NWICG_NotreDame > > > Directory: simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi > > > STDERR: > > > STDOUT: > > > > > > after execution, 3 empty directories (not files, and also weird name) > > > were generated: > > > duplicate-gt7hyvgi-simpleFile.output > > > duplicate-ht7hyvgi-simpleFile.output > > > duplicate-it7hyvgi-simpleFile.output > > > > > > wrapper.log: > > > DIR=duplicate-gt7hyvgi > > > STDOUT=simpleFile.output > > > STDERR=stderr.txt > > > DIRS=simpleFile.output > > > LINKS=/dscratch/osg/app/osg/jtie/duplicate.sh > > > OUTS=simpleFile.txt > > > ln: creating symbolic link > > > `duplicate-gt7hyvgi//dscratch/osg/app/osg/jtie/duplicate.sh' to > > > `/dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared//dscratch/osg/app/osg/jtie/duplicate.sh': > > > No such file or directory > > > > > > under dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared: > > > seq.sh, simpleFile.txt, wrapper.sh > > > under dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi: > > > one directory - simpleFile.output > > > /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi/simpleFile.output: > > > empty > > > > > > I think the swift program is right, since I run it successfully using > > > localhost duplicate.sh. > > > > > > > > > Many thanks, > > > Jing > > > > > > > > > On 9/7/07, Mihael Hategan wrote: > > > > So when the output files are not created, there can be two reasons: > > > > 1. The specification of what files should be created is broken. This is, > > > > at this time, done by looking at the filenames of the return values from > > > > the atomic procedure. Normally one passes those file names to the > > > > application as output file parameters. Example: > > > > > > > > (File f) proc(...) { > > > > app { > > > > myapp ... "-o" @filename(f); > > > > } > > > > } > > > > > > > > 2. The specification is correct, but the application doesn't behave. > > > > > > > > Mihael > > > > > > > > On Fri, 2007-09-07 at 12:35 -0500, Jing Tie wrote: > > > > > Hi, > > > > > > > > > > I tried jobmanger-fork instead of jobmanager-condor on osg.hpcc.nd.edu site: > > > > > > > > > (101-FBchannel18_cwt-avgResults.Rdata) not found> > > > > > > > > > > jobmanager-fork: > > > > > ------------------------ > > > > > Application exception: The following output files were not created by > > > > > the application: /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/cwtsmall-0bhnnvgi > > > > > lrwxrwxrwx 1 osg osgusers 93 Sep 7 12:42 > > > > > 101-FBchannel10_cwt-avgResults.Rdata -> > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared/101-FBchannel10_cwt-avgResults.Rdata > > > > > ... (total 28 links, the same number as the number of the expected output files) > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:42 101_FB-epochs.Rdata > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 12:42 scripts > > > > > -rw-r--r-- 1 osg osgusers 58 Sep 7 12:42 stderr.txt > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 12:41 101_FB-epochs.Rdata > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:41 scripts > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 12:41 seq.sh > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 12:41 wrapper.sh > > > > > > > > > > empty kickstart directory > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/cat > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/status/cwtsmall-0bhnnvgi-error > > > > > The following output files were not created by the application: > > > > > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > jobmanager-condor: > > > > > ------------------------------- > > > > > Application exception: The following output files were not created by > > > > > the application: 101-FBchannel20_cwt-avgResults.Rdata > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/cwtsmall-cpteovgi > > > > > lrwxrwxrwx 1 osg osgusers 76 Sep 7 13:01 101_FB-epochs.Rdata -> > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared/101_FB-epochs.Rdata > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 13:01 scripts > > > > > -rw-r--r-- 1 osg osgusers 70 Sep 7 13:01 stderr.txt > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 13:00 101_FB-epochs.Rdata > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 13:00 scripts > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 13:00 seq.sh > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 13:00 wrapper.sh > > > > > > > > > > I think the descriptions of exception are all right now. The > > > > > difference between fork and condor was that fork created the output > > > > > links to the shared directory, but condor didn't. But the essential > > > > > problem is the output files not being created. I will do more > > > > > experiments to see whether the problem of file system or application. > > > > > > > > > > Thanks, > > > > > Jing > > > > > _______________________________________________ > > > > > Swift-user mailing list > > > > > Swift-user at ci.uchicago.edu > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > From hategan at mcs.anl.gov Sun Sep 9 12:03:11 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 09 Sep 2007 12:03:11 -0500 Subject: [Swift-user] follow up of kickstart executable not found problem In-Reply-To: References: <1189186832.28276.4.camel@blabla.mcs.anl.gov> <1189224120.4369.0.camel@blabla.mcs.anl.gov> <1189226518.5306.4.camel@blabla.mcs.anl.gov> Message-ID: <1189357391.21155.0.camel@blabla.mcs.anl.gov> Please post the whole workflow and the whole log. On Sun, 2007-09-09 at 11:43 -0500, Jing Tie wrote: > Hi, > > I tried duplicate job again using the latest swift with > "log4j.logger.org.globus.cog.abstraction=DEBUG". > > It didn't generated duplicate-*** directory under simple-wf-*** > directory. Details: > Resource (org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > successfully released > Task(type=4, identity=urn:0-0-1189348296362) setting status to Failed > Could not set current directory to "null" > duplicate failed > The following errors have occurred: > 1. Could not initialize shared directory on NWICG_NotreDame > Caused by: > Could not set current directory to "null" > Caused by: > Required argument missing > > in simple-wf-fc06kzz28d880.log: > 2007-09-09 09:31:39,911 DEBUG FileResourceCache Resource > (org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > successfully released > 2007-09-09 09:31:39,911 DEBUG TaskImpl Task(type=4, > identity=urn:0-0-1189348296362) setting status to Failed Could not set > current directory to "null" > 2007-09-09 09:31:39,938 INFO vdl:mains Errors detected. Cleanup not done. > 2007-09-09 09:31:39,963 DEBUG VDL2ExecutionContext Execution completed > with errors > Execution completed with errors > > There is nothing under > osg.hpcc.nd.edu/dscratch/osg/data/osg/jtie/simple-wf-fc06kzz28d880 > except empty shared directory. > > Thanks, > Jing > > On 9/7/07, Mihael Hategan wrote: > > Nevermind that. It's eating the empty stdin argument. This doesn't make > > sense. Fork used to behave. > > > > Can you add this to log4j.properties, run again, and post the log? > > > > log4j.logger.org.globus.cog.abstraction=DEBUG > > > > > > On Fri, 2007-09-07 at 23:02 -0500, Mihael Hategan wrote: > > > Can you post the workflow? > > > > > > On Fri, 2007-09-07 at 21:56 -0500, Jing Tie wrote: > > > > Thanks! > > > > > > > > I checked the procedure, and found that the filenames of the output > > > > files of "myapp" are the same as "File f". So the first option should > > > > not be the problem. > > > > > > > > Then I tried a very simple swift script that doesn't need R: > > > > app function: add some lines to the input file to generate the output file > > > > input file name: simpleFile.txt > > > > output file name: simpleFile.output > > > > application script location: $OSG_APP/osg/jtie/duplicate.sh > > > > jobmanager: jobmanager-fork > > > > > > > > duplicate failed > > > > The following errors have occurred: > > > > 1. Application "duplicate" failed (Failed to link input file > > > > /dscratch/osg/app/osg/jtie/duplicate.sh) > > > > Arguments: "simpleFile.txt" > > > > Host: NWICG_NotreDame > > > > Directory: simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi > > > > STDERR: > > > > STDOUT: > > > > > > > > after execution, 3 empty directories (not files, and also weird name) > > > > were generated: > > > > duplicate-gt7hyvgi-simpleFile.output > > > > duplicate-ht7hyvgi-simpleFile.output > > > > duplicate-it7hyvgi-simpleFile.output > > > > > > > > wrapper.log: > > > > DIR=duplicate-gt7hyvgi > > > > STDOUT=simpleFile.output > > > > STDERR=stderr.txt > > > > DIRS=simpleFile.output > > > > LINKS=/dscratch/osg/app/osg/jtie/duplicate.sh > > > > OUTS=simpleFile.txt > > > > ln: creating symbolic link > > > > `duplicate-gt7hyvgi//dscratch/osg/app/osg/jtie/duplicate.sh' to > > > > `/dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared//dscratch/osg/app/osg/jtie/duplicate.sh': > > > > No such file or directory > > > > > > > > under dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared: > > > > seq.sh, simpleFile.txt, wrapper.sh > > > > under dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi: > > > > one directory - simpleFile.output > > > > /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi/simpleFile.output: > > > > empty > > > > > > > > I think the swift program is right, since I run it successfully using > > > > localhost duplicate.sh. > > > > > > > > > > > > Many thanks, > > > > Jing > > > > > > > > > > > > On 9/7/07, Mihael Hategan wrote: > > > > > So when the output files are not created, there can be two reasons: > > > > > 1. The specification of what files should be created is broken. This is, > > > > > at this time, done by looking at the filenames of the return values from > > > > > the atomic procedure. Normally one passes those file names to the > > > > > application as output file parameters. Example: > > > > > > > > > > (File f) proc(...) { > > > > > app { > > > > > myapp ... "-o" @filename(f); > > > > > } > > > > > } > > > > > > > > > > 2. The specification is correct, but the application doesn't behave. > > > > > > > > > > Mihael > > > > > > > > > > On Fri, 2007-09-07 at 12:35 -0500, Jing Tie wrote: > > > > > > Hi, > > > > > > > > > > > > I tried jobmanger-fork instead of jobmanager-condor on osg.hpcc.nd.edu site: > > > > > > > > > > > (101-FBchannel18_cwt-avgResults.Rdata) not found> > > > > > > > > > > > > jobmanager-fork: > > > > > > ------------------------ > > > > > > Application exception: The following output files were not created by > > > > > > the application: /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/cwtsmall-0bhnnvgi > > > > > > lrwxrwxrwx 1 osg osgusers 93 Sep 7 12:42 > > > > > > 101-FBchannel10_cwt-avgResults.Rdata -> > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared/101-FBchannel10_cwt-avgResults.Rdata > > > > > > ... (total 28 links, the same number as the number of the expected output files) > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:42 101_FB-epochs.Rdata > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 12:42 scripts > > > > > > -rw-r--r-- 1 osg osgusers 58 Sep 7 12:42 stderr.txt > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 12:41 101_FB-epochs.Rdata > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:41 scripts > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 12:41 seq.sh > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 12:41 wrapper.sh > > > > > > > > > > > > empty kickstart directory > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/cat > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/status/cwtsmall-0bhnnvgi-error > > > > > > The following output files were not created by the application: > > > > > > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > jobmanager-condor: > > > > > > ------------------------------- > > > > > > Application exception: The following output files were not created by > > > > > > the application: 101-FBchannel20_cwt-avgResults.Rdata > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/cwtsmall-cpteovgi > > > > > > lrwxrwxrwx 1 osg osgusers 76 Sep 7 13:01 101_FB-epochs.Rdata -> > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared/101_FB-epochs.Rdata > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 13:01 scripts > > > > > > -rw-r--r-- 1 osg osgusers 70 Sep 7 13:01 stderr.txt > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 13:00 101_FB-epochs.Rdata > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 13:00 scripts > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 13:00 seq.sh > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 13:00 wrapper.sh > > > > > > > > > > > > I think the descriptions of exception are all right now. The > > > > > > difference between fork and condor was that fork created the output > > > > > > links to the shared directory, but condor didn't. But the essential > > > > > > problem is the output files not being created. I will do more > > > > > > experiments to see whether the problem of file system or application. > > > > > > > > > > > > Thanks, > > > > > > Jing > > > > > > _______________________________________________ > > > > > > Swift-user mailing list > > > > > > Swift-user at ci.uchicago.edu > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > Swift-user mailing list > > > Swift-user at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > From tiejing at gmail.com Sun Sep 9 12:09:34 2007 From: tiejing at gmail.com (Jing Tie) Date: Sun, 9 Sep 2007 12:09:34 -0500 Subject: [Swift-user] follow up of kickstart executable not found problem In-Reply-To: <1189357391.21155.0.camel@blabla.mcs.anl.gov> References: <1189186832.28276.4.camel@blabla.mcs.anl.gov> <1189224120.4369.0.camel@blabla.mcs.anl.gov> <1189226518.5306.4.camel@blabla.mcs.anl.gov> <1189357391.21155.0.camel@blabla.mcs.anl.gov> Message-ID: Sure. On 9/9/07, Mihael Hategan wrote: > Please post the whole workflow and the whole log. > > On Sun, 2007-09-09 at 11:43 -0500, Jing Tie wrote: > > Hi, > > > > I tried duplicate job again using the latest swift with > > "log4j.logger.org.globus.cog.abstraction=DEBUG". > > > > It didn't generated duplicate-*** directory under simple-wf-*** > > directory. Details: > > Resource (org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > successfully released > > Task(type=4, identity=urn:0-0-1189348296362) setting status to Failed > > Could not set current directory to "null" > > duplicate failed > > The following errors have occurred: > > 1. Could not initialize shared directory on NWICG_NotreDame > > Caused by: > > Could not set current directory to "null" > > Caused by: > > Required argument missing > > > > in simple-wf-fc06kzz28d880.log: > > 2007-09-09 09:31:39,911 DEBUG FileResourceCache Resource > > (org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > successfully released > > 2007-09-09 09:31:39,911 DEBUG TaskImpl Task(type=4, > > identity=urn:0-0-1189348296362) setting status to Failed Could not set > > current directory to "null" > > 2007-09-09 09:31:39,938 INFO vdl:mains Errors detected. Cleanup not done. > > 2007-09-09 09:31:39,963 DEBUG VDL2ExecutionContext Execution completed > > with errors > > Execution completed with errors > > > > There is nothing under > > osg.hpcc.nd.edu/dscratch/osg/data/osg/jtie/simple-wf-fc06kzz28d880 > > except empty shared directory. > > > > Thanks, > > Jing > > > > On 9/7/07, Mihael Hategan wrote: > > > Nevermind that. It's eating the empty stdin argument. This doesn't make > > > sense. Fork used to behave. > > > > > > Can you add this to log4j.properties, run again, and post the log? > > > > > > log4j.logger.org.globus.cog.abstraction=DEBUG > > > > > > > > > On Fri, 2007-09-07 at 23:02 -0500, Mihael Hategan wrote: > > > > Can you post the workflow? > > > > > > > > On Fri, 2007-09-07 at 21:56 -0500, Jing Tie wrote: > > > > > Thanks! > > > > > > > > > > I checked the procedure, and found that the filenames of the output > > > > > files of "myapp" are the same as "File f". So the first option should > > > > > not be the problem. > > > > > > > > > > Then I tried a very simple swift script that doesn't need R: > > > > > app function: add some lines to the input file to generate the output file > > > > > input file name: simpleFile.txt > > > > > output file name: simpleFile.output > > > > > application script location: $OSG_APP/osg/jtie/duplicate.sh > > > > > jobmanager: jobmanager-fork > > > > > > > > > > duplicate failed > > > > > The following errors have occurred: > > > > > 1. Application "duplicate" failed (Failed to link input file > > > > > /dscratch/osg/app/osg/jtie/duplicate.sh) > > > > > Arguments: "simpleFile.txt" > > > > > Host: NWICG_NotreDame > > > > > Directory: simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi > > > > > STDERR: > > > > > STDOUT: > > > > > > > > > > after execution, 3 empty directories (not files, and also weird name) > > > > > were generated: > > > > > duplicate-gt7hyvgi-simpleFile.output > > > > > duplicate-ht7hyvgi-simpleFile.output > > > > > duplicate-it7hyvgi-simpleFile.output > > > > > > > > > > wrapper.log: > > > > > DIR=duplicate-gt7hyvgi > > > > > STDOUT=simpleFile.output > > > > > STDERR=stderr.txt > > > > > DIRS=simpleFile.output > > > > > LINKS=/dscratch/osg/app/osg/jtie/duplicate.sh > > > > > OUTS=simpleFile.txt > > > > > ln: creating symbolic link > > > > > `duplicate-gt7hyvgi//dscratch/osg/app/osg/jtie/duplicate.sh' to > > > > > `/dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared//dscratch/osg/app/osg/jtie/duplicate.sh': > > > > > No such file or directory > > > > > > > > > > under dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared: > > > > > seq.sh, simpleFile.txt, wrapper.sh > > > > > under dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi: > > > > > one directory - simpleFile.output > > > > > /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi/simpleFile.output: > > > > > empty > > > > > > > > > > I think the swift program is right, since I run it successfully using > > > > > localhost duplicate.sh. > > > > > > > > > > > > > > > Many thanks, > > > > > Jing > > > > > > > > > > > > > > > On 9/7/07, Mihael Hategan wrote: > > > > > > So when the output files are not created, there can be two reasons: > > > > > > 1. The specification of what files should be created is broken. This is, > > > > > > at this time, done by looking at the filenames of the return values from > > > > > > the atomic procedure. Normally one passes those file names to the > > > > > > application as output file parameters. Example: > > > > > > > > > > > > (File f) proc(...) { > > > > > > app { > > > > > > myapp ... "-o" @filename(f); > > > > > > } > > > > > > } > > > > > > > > > > > > 2. The specification is correct, but the application doesn't behave. > > > > > > > > > > > > Mihael > > > > > > > > > > > > On Fri, 2007-09-07 at 12:35 -0500, Jing Tie wrote: > > > > > > > Hi, > > > > > > > > > > > > > > I tried jobmanger-fork instead of jobmanager-condor on osg.hpcc.nd.edu site: > > > > > > > > > > > > > (101-FBchannel18_cwt-avgResults.Rdata) not found> > > > > > > > > > > > > > > jobmanager-fork: > > > > > > > ------------------------ > > > > > > > Application exception: The following output files were not created by > > > > > > > the application: /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/cwtsmall-0bhnnvgi > > > > > > > lrwxrwxrwx 1 osg osgusers 93 Sep 7 12:42 > > > > > > > 101-FBchannel10_cwt-avgResults.Rdata -> > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared/101-FBchannel10_cwt-avgResults.Rdata > > > > > > > ... (total 28 links, the same number as the number of the expected output files) > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:42 101_FB-epochs.Rdata > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 12:42 scripts > > > > > > > -rw-r--r-- 1 osg osgusers 58 Sep 7 12:42 stderr.txt > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 12:41 101_FB-epochs.Rdata > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:41 scripts > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 12:41 seq.sh > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 12:41 wrapper.sh > > > > > > > > > > > > > > empty kickstart directory > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/cat > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/status/cwtsmall-0bhnnvgi-error > > > > > > > The following output files were not created by the application: > > > > > > > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > > > > jobmanager-condor: > > > > > > > ------------------------------- > > > > > > > Application exception: The following output files were not created by > > > > > > > the application: 101-FBchannel20_cwt-avgResults.Rdata > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/cwtsmall-cpteovgi > > > > > > > lrwxrwxrwx 1 osg osgusers 76 Sep 7 13:01 101_FB-epochs.Rdata -> > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared/101_FB-epochs.Rdata > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 13:01 scripts > > > > > > > -rw-r--r-- 1 osg osgusers 70 Sep 7 13:01 stderr.txt > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 13:00 101_FB-epochs.Rdata > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 13:00 scripts > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 13:00 seq.sh > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 13:00 wrapper.sh > > > > > > > > > > > > > > I think the descriptions of exception are all right now. The > > > > > > > difference between fork and condor was that fork created the output > > > > > > > links to the shared directory, but condor didn't. But the essential > > > > > > > problem is the output files not being created. I will do more > > > > > > > experiments to see whether the problem of file system or application. > > > > > > > > > > > > > > Thanks, > > > > > > > Jing > > > > > > > _______________________________________________ > > > > > > > Swift-user mailing list > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Swift-user mailing list > > > > Swift-user at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > -------------- next part -------------- A non-text attachment was scrubbed... Name: simple-wf.dtm Type: application/octet-stream Size: 305 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: simple-wf-fc06kzz28d880.log Type: application/octet-stream Size: 8450 bytes Desc: not available URL: From hategan at mcs.anl.gov Sun Sep 9 12:29:36 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 09 Sep 2007 12:29:36 -0500 Subject: [Swift-user] follow up of kickstart executable not found problem In-Reply-To: References: <1189186832.28276.4.camel@blabla.mcs.anl.gov> <1189224120.4369.0.camel@blabla.mcs.anl.gov> <1189226518.5306.4.camel@blabla.mcs.anl.gov> <1189357391.21155.0.camel@blabla.mcs.anl.gov> Message-ID: <1189358977.21560.1.camel@blabla.mcs.anl.gov> It's definitely a bug in the code I put in a few days ago. But I don't quite see how it happens. Such simple code. Yet how complex. I'll have to get back to you on it. On Sun, 2007-09-09 at 12:09 -0500, Jing Tie wrote: > Sure. > > On 9/9/07, Mihael Hategan wrote: > > Please post the whole workflow and the whole log. > > > > On Sun, 2007-09-09 at 11:43 -0500, Jing Tie wrote: > > > Hi, > > > > > > I tried duplicate job again using the latest swift with > > > "log4j.logger.org.globus.cog.abstraction=DEBUG". > > > > > > It didn't generated duplicate-*** directory under simple-wf-*** > > > directory. Details: > > > Resource (org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > > successfully released > > > Task(type=4, identity=urn:0-0-1189348296362) setting status to Failed > > > Could not set current directory to "null" > > > duplicate failed > > > The following errors have occurred: > > > 1. Could not initialize shared directory on NWICG_NotreDame > > > Caused by: > > > Could not set current directory to "null" > > > Caused by: > > > Required argument missing > > > > > > in simple-wf-fc06kzz28d880.log: > > > 2007-09-09 09:31:39,911 DEBUG FileResourceCache Resource > > > (org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > > successfully released > > > 2007-09-09 09:31:39,911 DEBUG TaskImpl Task(type=4, > > > identity=urn:0-0-1189348296362) setting status to Failed Could not set > > > current directory to "null" > > > 2007-09-09 09:31:39,938 INFO vdl:mains Errors detected. Cleanup not done. > > > 2007-09-09 09:31:39,963 DEBUG VDL2ExecutionContext Execution completed > > > with errors > > > Execution completed with errors > > > > > > There is nothing under > > > osg.hpcc.nd.edu/dscratch/osg/data/osg/jtie/simple-wf-fc06kzz28d880 > > > except empty shared directory. > > > > > > Thanks, > > > Jing > > > > > > On 9/7/07, Mihael Hategan wrote: > > > > Nevermind that. It's eating the empty stdin argument. This doesn't make > > > > sense. Fork used to behave. > > > > > > > > Can you add this to log4j.properties, run again, and post the log? > > > > > > > > log4j.logger.org.globus.cog.abstraction=DEBUG > > > > > > > > > > > > On Fri, 2007-09-07 at 23:02 -0500, Mihael Hategan wrote: > > > > > Can you post the workflow? > > > > > > > > > > On Fri, 2007-09-07 at 21:56 -0500, Jing Tie wrote: > > > > > > Thanks! > > > > > > > > > > > > I checked the procedure, and found that the filenames of the output > > > > > > files of "myapp" are the same as "File f". So the first option should > > > > > > not be the problem. > > > > > > > > > > > > Then I tried a very simple swift script that doesn't need R: > > > > > > app function: add some lines to the input file to generate the output file > > > > > > input file name: simpleFile.txt > > > > > > output file name: simpleFile.output > > > > > > application script location: $OSG_APP/osg/jtie/duplicate.sh > > > > > > jobmanager: jobmanager-fork > > > > > > > > > > > > duplicate failed > > > > > > The following errors have occurred: > > > > > > 1. Application "duplicate" failed (Failed to link input file > > > > > > /dscratch/osg/app/osg/jtie/duplicate.sh) > > > > > > Arguments: "simpleFile.txt" > > > > > > Host: NWICG_NotreDame > > > > > > Directory: simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi > > > > > > STDERR: > > > > > > STDOUT: > > > > > > > > > > > > after execution, 3 empty directories (not files, and also weird name) > > > > > > were generated: > > > > > > duplicate-gt7hyvgi-simpleFile.output > > > > > > duplicate-ht7hyvgi-simpleFile.output > > > > > > duplicate-it7hyvgi-simpleFile.output > > > > > > > > > > > > wrapper.log: > > > > > > DIR=duplicate-gt7hyvgi > > > > > > STDOUT=simpleFile.output > > > > > > STDERR=stderr.txt > > > > > > DIRS=simpleFile.output > > > > > > LINKS=/dscratch/osg/app/osg/jtie/duplicate.sh > > > > > > OUTS=simpleFile.txt > > > > > > ln: creating symbolic link > > > > > > `duplicate-gt7hyvgi//dscratch/osg/app/osg/jtie/duplicate.sh' to > > > > > > `/dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared//dscratch/osg/app/osg/jtie/duplicate.sh': > > > > > > No such file or directory > > > > > > > > > > > > under dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared: > > > > > > seq.sh, simpleFile.txt, wrapper.sh > > > > > > under dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi: > > > > > > one directory - simpleFile.output > > > > > > /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi/simpleFile.output: > > > > > > empty > > > > > > > > > > > > I think the swift program is right, since I run it successfully using > > > > > > localhost duplicate.sh. > > > > > > > > > > > > > > > > > > Many thanks, > > > > > > Jing > > > > > > > > > > > > > > > > > > On 9/7/07, Mihael Hategan wrote: > > > > > > > So when the output files are not created, there can be two reasons: > > > > > > > 1. The specification of what files should be created is broken. This is, > > > > > > > at this time, done by looking at the filenames of the return values from > > > > > > > the atomic procedure. Normally one passes those file names to the > > > > > > > application as output file parameters. Example: > > > > > > > > > > > > > > (File f) proc(...) { > > > > > > > app { > > > > > > > myapp ... "-o" @filename(f); > > > > > > > } > > > > > > > } > > > > > > > > > > > > > > 2. The specification is correct, but the application doesn't behave. > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > On Fri, 2007-09-07 at 12:35 -0500, Jing Tie wrote: > > > > > > > > Hi, > > > > > > > > > > > > > > > > I tried jobmanger-fork instead of jobmanager-condor on osg.hpcc.nd.edu site: > > > > > > > > > > > > > > > (101-FBchannel18_cwt-avgResults.Rdata) not found> > > > > > > > > > > > > > > > > jobmanager-fork: > > > > > > > > ------------------------ > > > > > > > > Application exception: The following output files were not created by > > > > > > > > the application: /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/cwtsmall-0bhnnvgi > > > > > > > > lrwxrwxrwx 1 osg osgusers 93 Sep 7 12:42 > > > > > > > > 101-FBchannel10_cwt-avgResults.Rdata -> > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared/101-FBchannel10_cwt-avgResults.Rdata > > > > > > > > ... (total 28 links, the same number as the number of the expected output files) > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:42 101_FB-epochs.Rdata > > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 12:42 scripts > > > > > > > > -rw-r--r-- 1 osg osgusers 58 Sep 7 12:42 stderr.txt > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared > > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 12:41 101_FB-epochs.Rdata > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:41 scripts > > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 12:41 seq.sh > > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 12:41 wrapper.sh > > > > > > > > > > > > > > > > empty kickstart directory > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/cat > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/status/cwtsmall-0bhnnvgi-error > > > > > > > > The following output files were not created by the application: > > > > > > > > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > > > > > > > jobmanager-condor: > > > > > > > > ------------------------------- > > > > > > > > Application exception: The following output files were not created by > > > > > > > > the application: 101-FBchannel20_cwt-avgResults.Rdata > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/cwtsmall-cpteovgi > > > > > > > > lrwxrwxrwx 1 osg osgusers 76 Sep 7 13:01 101_FB-epochs.Rdata -> > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared/101_FB-epochs.Rdata > > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 13:01 scripts > > > > > > > > -rw-r--r-- 1 osg osgusers 70 Sep 7 13:01 stderr.txt > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared > > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 13:00 101_FB-epochs.Rdata > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 13:00 scripts > > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 13:00 seq.sh > > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 13:00 wrapper.sh > > > > > > > > > > > > > > > > I think the descriptions of exception are all right now. The > > > > > > > > difference between fork and condor was that fork created the output > > > > > > > > links to the shared directory, but condor didn't. But the essential > > > > > > > > problem is the output files not being created. I will do more > > > > > > > > experiments to see whether the problem of file system or application. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Jing > > > > > > > > _______________________________________________ > > > > > > > > Swift-user mailing list > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-user mailing list > > > > > Swift-user at ci.uchicago.edu > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > From hategan at mcs.anl.gov Mon Sep 10 10:47:24 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 10 Sep 2007 10:47:24 -0500 Subject: [Swift-user] follow up of kickstart executable not found problem In-Reply-To: <1189358977.21560.1.camel@blabla.mcs.anl.gov> References: <1189186832.28276.4.camel@blabla.mcs.anl.gov> <1189224120.4369.0.camel@blabla.mcs.anl.gov> <1189226518.5306.4.camel@blabla.mcs.anl.gov> <1189357391.21155.0.camel@blabla.mcs.anl.gov> <1189358977.21560.1.camel@blabla.mcs.anl.gov> Message-ID: <1189439244.11819.0.camel@blabla.mcs.anl.gov> On Sun, 2007-09-09 at 12:29 -0500, Mihael Hategan wrote: > It's definitely a bug in the code I put in a few days ago. But I don't > quite see how it happens. Such simple code. Yet how complex. I'll have > to get back to you on it. Fixed, I think. Can you try again? Mihael > > On Sun, 2007-09-09 at 12:09 -0500, Jing Tie wrote: > > Sure. > > > > On 9/9/07, Mihael Hategan wrote: > > > Please post the whole workflow and the whole log. > > > > > > On Sun, 2007-09-09 at 11:43 -0500, Jing Tie wrote: > > > > Hi, > > > > > > > > I tried duplicate job again using the latest swift with > > > > "log4j.logger.org.globus.cog.abstraction=DEBUG". > > > > > > > > It didn't generated duplicate-*** directory under simple-wf-*** > > > > directory. Details: > > > > Resource (org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > > > successfully released > > > > Task(type=4, identity=urn:0-0-1189348296362) setting status to Failed > > > > Could not set current directory to "null" > > > > duplicate failed > > > > The following errors have occurred: > > > > 1. Could not initialize shared directory on NWICG_NotreDame > > > > Caused by: > > > > Could not set current directory to "null" > > > > Caused by: > > > > Required argument missing > > > > > > > > in simple-wf-fc06kzz28d880.log: > > > > 2007-09-09 09:31:39,911 DEBUG FileResourceCache Resource > > > > (org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > > > successfully released > > > > 2007-09-09 09:31:39,911 DEBUG TaskImpl Task(type=4, > > > > identity=urn:0-0-1189348296362) setting status to Failed Could not set > > > > current directory to "null" > > > > 2007-09-09 09:31:39,938 INFO vdl:mains Errors detected. Cleanup not done. > > > > 2007-09-09 09:31:39,963 DEBUG VDL2ExecutionContext Execution completed > > > > with errors > > > > Execution completed with errors > > > > > > > > There is nothing under > > > > osg.hpcc.nd.edu/dscratch/osg/data/osg/jtie/simple-wf-fc06kzz28d880 > > > > except empty shared directory. > > > > > > > > Thanks, > > > > Jing > > > > > > > > On 9/7/07, Mihael Hategan wrote: > > > > > Nevermind that. It's eating the empty stdin argument. This doesn't make > > > > > sense. Fork used to behave. > > > > > > > > > > Can you add this to log4j.properties, run again, and post the log? > > > > > > > > > > log4j.logger.org.globus.cog.abstraction=DEBUG > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 23:02 -0500, Mihael Hategan wrote: > > > > > > Can you post the workflow? > > > > > > > > > > > > On Fri, 2007-09-07 at 21:56 -0500, Jing Tie wrote: > > > > > > > Thanks! > > > > > > > > > > > > > > I checked the procedure, and found that the filenames of the output > > > > > > > files of "myapp" are the same as "File f". So the first option should > > > > > > > not be the problem. > > > > > > > > > > > > > > Then I tried a very simple swift script that doesn't need R: > > > > > > > app function: add some lines to the input file to generate the output file > > > > > > > input file name: simpleFile.txt > > > > > > > output file name: simpleFile.output > > > > > > > application script location: $OSG_APP/osg/jtie/duplicate.sh > > > > > > > jobmanager: jobmanager-fork > > > > > > > > > > > > > > duplicate failed > > > > > > > The following errors have occurred: > > > > > > > 1. Application "duplicate" failed (Failed to link input file > > > > > > > /dscratch/osg/app/osg/jtie/duplicate.sh) > > > > > > > Arguments: "simpleFile.txt" > > > > > > > Host: NWICG_NotreDame > > > > > > > Directory: simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi > > > > > > > STDERR: > > > > > > > STDOUT: > > > > > > > > > > > > > > after execution, 3 empty directories (not files, and also weird name) > > > > > > > were generated: > > > > > > > duplicate-gt7hyvgi-simpleFile.output > > > > > > > duplicate-ht7hyvgi-simpleFile.output > > > > > > > duplicate-it7hyvgi-simpleFile.output > > > > > > > > > > > > > > wrapper.log: > > > > > > > DIR=duplicate-gt7hyvgi > > > > > > > STDOUT=simpleFile.output > > > > > > > STDERR=stderr.txt > > > > > > > DIRS=simpleFile.output > > > > > > > LINKS=/dscratch/osg/app/osg/jtie/duplicate.sh > > > > > > > OUTS=simpleFile.txt > > > > > > > ln: creating symbolic link > > > > > > > `duplicate-gt7hyvgi//dscratch/osg/app/osg/jtie/duplicate.sh' to > > > > > > > `/dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared//dscratch/osg/app/osg/jtie/duplicate.sh': > > > > > > > No such file or directory > > > > > > > > > > > > > > under dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared: > > > > > > > seq.sh, simpleFile.txt, wrapper.sh > > > > > > > under dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi: > > > > > > > one directory - simpleFile.output > > > > > > > /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi/simpleFile.output: > > > > > > > empty > > > > > > > > > > > > > > I think the swift program is right, since I run it successfully using > > > > > > > localhost duplicate.sh. > > > > > > > > > > > > > > > > > > > > > Many thanks, > > > > > > > Jing > > > > > > > > > > > > > > > > > > > > > On 9/7/07, Mihael Hategan wrote: > > > > > > > > So when the output files are not created, there can be two reasons: > > > > > > > > 1. The specification of what files should be created is broken. This is, > > > > > > > > at this time, done by looking at the filenames of the return values from > > > > > > > > the atomic procedure. Normally one passes those file names to the > > > > > > > > application as output file parameters. Example: > > > > > > > > > > > > > > > > (File f) proc(...) { > > > > > > > > app { > > > > > > > > myapp ... "-o" @filename(f); > > > > > > > > } > > > > > > > > } > > > > > > > > > > > > > > > > 2. The specification is correct, but the application doesn't behave. > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 12:35 -0500, Jing Tie wrote: > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > I tried jobmanger-fork instead of jobmanager-condor on osg.hpcc.nd.edu site: > > > > > > > > > > > > > > > > > (101-FBchannel18_cwt-avgResults.Rdata) not found> > > > > > > > > > > > > > > > > > > jobmanager-fork: > > > > > > > > > ------------------------ > > > > > > > > > Application exception: The following output files were not created by > > > > > > > > > the application: /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/cwtsmall-0bhnnvgi > > > > > > > > > lrwxrwxrwx 1 osg osgusers 93 Sep 7 12:42 > > > > > > > > > 101-FBchannel10_cwt-avgResults.Rdata -> > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared/101-FBchannel10_cwt-avgResults.Rdata > > > > > > > > > ... (total 28 links, the same number as the number of the expected output files) > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:42 101_FB-epochs.Rdata > > > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 12:42 scripts > > > > > > > > > -rw-r--r-- 1 osg osgusers 58 Sep 7 12:42 stderr.txt > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared > > > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 12:41 101_FB-epochs.Rdata > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:41 scripts > > > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 12:41 seq.sh > > > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 12:41 wrapper.sh > > > > > > > > > > > > > > > > > > empty kickstart directory > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/cat > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/status/cwtsmall-0bhnnvgi-error > > > > > > > > > The following output files were not created by the application: > > > > > > > > > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > jobmanager-condor: > > > > > > > > > ------------------------------- > > > > > > > > > Application exception: The following output files were not created by > > > > > > > > > the application: 101-FBchannel20_cwt-avgResults.Rdata > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/cwtsmall-cpteovgi > > > > > > > > > lrwxrwxrwx 1 osg osgusers 76 Sep 7 13:01 101_FB-epochs.Rdata -> > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared/101_FB-epochs.Rdata > > > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 13:01 scripts > > > > > > > > > -rw-r--r-- 1 osg osgusers 70 Sep 7 13:01 stderr.txt > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared > > > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 13:00 101_FB-epochs.Rdata > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 13:00 scripts > > > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 13:00 seq.sh > > > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 13:00 wrapper.sh > > > > > > > > > > > > > > > > > > I think the descriptions of exception are all right now. The > > > > > > > > > difference between fork and condor was that fork created the output > > > > > > > > > links to the shared directory, but condor didn't. But the essential > > > > > > > > > problem is the output files not being created. I will do more > > > > > > > > > experiments to see whether the problem of file system or application. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Jing > > > > > > > > > _______________________________________________ > > > > > > > > > Swift-user mailing list > > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > Swift-user mailing list > > > > > > Swift-user at ci.uchicago.edu > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > From tiejing at gmail.com Mon Sep 10 11:03:35 2007 From: tiejing at gmail.com (Jing Tie) Date: Mon, 10 Sep 2007 11:03:35 -0500 Subject: [Swift-user] follow up of kickstart executable not found problem In-Reply-To: <1189439244.11819.0.camel@blabla.mcs.anl.gov> References: <1189186832.28276.4.camel@blabla.mcs.anl.gov> <1189224120.4369.0.camel@blabla.mcs.anl.gov> <1189226518.5306.4.camel@blabla.mcs.anl.gov> <1189357391.21155.0.camel@blabla.mcs.anl.gov> <1189358977.21560.1.camel@blabla.mcs.anl.gov> <1189439244.11819.0.camel@blabla.mcs.anl.gov> Message-ID: Hi, It has the same exception. Log is attached. Thanks, Jing On 9/10/07, Mihael Hategan wrote: > On Sun, 2007-09-09 at 12:29 -0500, Mihael Hategan wrote: > > It's definitely a bug in the code I put in a few days ago. But I don't > > quite see how it happens. Such simple code. Yet how complex. I'll have > > to get back to you on it. > > Fixed, I think. Can you try again? > > Mihael > > > > > On Sun, 2007-09-09 at 12:09 -0500, Jing Tie wrote: > > > Sure. > > > > > > On 9/9/07, Mihael Hategan wrote: > > > > Please post the whole workflow and the whole log. > > > > > > > > On Sun, 2007-09-09 at 11:43 -0500, Jing Tie wrote: > > > > > Hi, > > > > > > > > > > I tried duplicate job again using the latest swift with > > > > > "log4j.logger.org.globus.cog.abstraction=DEBUG". > > > > > > > > > > It didn't generated duplicate-*** directory under simple-wf-*** > > > > > directory. Details: > > > > > Resource (org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > > > > successfully released > > > > > Task(type=4, identity=urn:0-0-1189348296362) setting status to Failed > > > > > Could not set current directory to "null" > > > > > duplicate failed > > > > > The following errors have occurred: > > > > > 1. Could not initialize shared directory on NWICG_NotreDame > > > > > Caused by: > > > > > Could not set current directory to "null" > > > > > Caused by: > > > > > Required argument missing > > > > > > > > > > in simple-wf-fc06kzz28d880.log: > > > > > 2007-09-09 09:31:39,911 DEBUG FileResourceCache Resource > > > > > (org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > > > > successfully released > > > > > 2007-09-09 09:31:39,911 DEBUG TaskImpl Task(type=4, > > > > > identity=urn:0-0-1189348296362) setting status to Failed Could not set > > > > > current directory to "null" > > > > > 2007-09-09 09:31:39,938 INFO vdl:mains Errors detected. Cleanup not done. > > > > > 2007-09-09 09:31:39,963 DEBUG VDL2ExecutionContext Execution completed > > > > > with errors > > > > > Execution completed with errors > > > > > > > > > > There is nothing under > > > > > osg.hpcc.nd.edu/dscratch/osg/data/osg/jtie/simple-wf-fc06kzz28d880 > > > > > except empty shared directory. > > > > > > > > > > Thanks, > > > > > Jing > > > > > > > > > > On 9/7/07, Mihael Hategan wrote: > > > > > > Nevermind that. It's eating the empty stdin argument. This doesn't make > > > > > > sense. Fork used to behave. > > > > > > > > > > > > Can you add this to log4j.properties, run again, and post the log? > > > > > > > > > > > > log4j.logger.org.globus.cog.abstraction=DEBUG > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 23:02 -0500, Mihael Hategan wrote: > > > > > > > Can you post the workflow? > > > > > > > > > > > > > > On Fri, 2007-09-07 at 21:56 -0500, Jing Tie wrote: > > > > > > > > Thanks! > > > > > > > > > > > > > > > > I checked the procedure, and found that the filenames of the output > > > > > > > > files of "myapp" are the same as "File f". So the first option should > > > > > > > > not be the problem. > > > > > > > > > > > > > > > > Then I tried a very simple swift script that doesn't need R: > > > > > > > > app function: add some lines to the input file to generate the output file > > > > > > > > input file name: simpleFile.txt > > > > > > > > output file name: simpleFile.output > > > > > > > > application script location: $OSG_APP/osg/jtie/duplicate.sh > > > > > > > > jobmanager: jobmanager-fork > > > > > > > > > > > > > > > > duplicate failed > > > > > > > > The following errors have occurred: > > > > > > > > 1. Application "duplicate" failed (Failed to link input file > > > > > > > > /dscratch/osg/app/osg/jtie/duplicate.sh) > > > > > > > > Arguments: "simpleFile.txt" > > > > > > > > Host: NWICG_NotreDame > > > > > > > > Directory: simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi > > > > > > > > STDERR: > > > > > > > > STDOUT: > > > > > > > > > > > > > > > > after execution, 3 empty directories (not files, and also weird name) > > > > > > > > were generated: > > > > > > > > duplicate-gt7hyvgi-simpleFile.output > > > > > > > > duplicate-ht7hyvgi-simpleFile.output > > > > > > > > duplicate-it7hyvgi-simpleFile.output > > > > > > > > > > > > > > > > wrapper.log: > > > > > > > > DIR=duplicate-gt7hyvgi > > > > > > > > STDOUT=simpleFile.output > > > > > > > > STDERR=stderr.txt > > > > > > > > DIRS=simpleFile.output > > > > > > > > LINKS=/dscratch/osg/app/osg/jtie/duplicate.sh > > > > > > > > OUTS=simpleFile.txt > > > > > > > > ln: creating symbolic link > > > > > > > > `duplicate-gt7hyvgi//dscratch/osg/app/osg/jtie/duplicate.sh' to > > > > > > > > `/dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared//dscratch/osg/app/osg/jtie/duplicate.sh': > > > > > > > > No such file or directory > > > > > > > > > > > > > > > > under dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared: > > > > > > > > seq.sh, simpleFile.txt, wrapper.sh > > > > > > > > under dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi: > > > > > > > > one directory - simpleFile.output > > > > > > > > /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi/simpleFile.output: > > > > > > > > empty > > > > > > > > > > > > > > > > I think the swift program is right, since I run it successfully using > > > > > > > > localhost duplicate.sh. > > > > > > > > > > > > > > > > > > > > > > > > Many thanks, > > > > > > > > Jing > > > > > > > > > > > > > > > > > > > > > > > > On 9/7/07, Mihael Hategan wrote: > > > > > > > > > So when the output files are not created, there can be two reasons: > > > > > > > > > 1. The specification of what files should be created is broken. This is, > > > > > > > > > at this time, done by looking at the filenames of the return values from > > > > > > > > > the atomic procedure. Normally one passes those file names to the > > > > > > > > > application as output file parameters. Example: > > > > > > > > > > > > > > > > > > (File f) proc(...) { > > > > > > > > > app { > > > > > > > > > myapp ... "-o" @filename(f); > > > > > > > > > } > > > > > > > > > } > > > > > > > > > > > > > > > > > > 2. The specification is correct, but the application doesn't behave. > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 12:35 -0500, Jing Tie wrote: > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > I tried jobmanger-fork instead of jobmanager-condor on osg.hpcc.nd.edu site: > > > > > > > > > > > > > > > > > > > (101-FBchannel18_cwt-avgResults.Rdata) not found> > > > > > > > > > > > > > > > > > > > > jobmanager-fork: > > > > > > > > > > ------------------------ > > > > > > > > > > Application exception: The following output files were not created by > > > > > > > > > > the application: /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/cwtsmall-0bhnnvgi > > > > > > > > > > lrwxrwxrwx 1 osg osgusers 93 Sep 7 12:42 > > > > > > > > > > 101-FBchannel10_cwt-avgResults.Rdata -> > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared/101-FBchannel10_cwt-avgResults.Rdata > > > > > > > > > > ... (total 28 links, the same number as the number of the expected output files) > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:42 101_FB-epochs.Rdata > > > > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 12:42 scripts > > > > > > > > > > -rw-r--r-- 1 osg osgusers 58 Sep 7 12:42 stderr.txt > > > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared > > > > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 12:41 101_FB-epochs.Rdata > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:41 scripts > > > > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 12:41 seq.sh > > > > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 12:41 wrapper.sh > > > > > > > > > > > > > > > > > > > > empty kickstart directory > > > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/cat > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/status/cwtsmall-0bhnnvgi-error > > > > > > > > > > The following output files were not created by the application: > > > > > > > > > > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > jobmanager-condor: > > > > > > > > > > ------------------------------- > > > > > > > > > > Application exception: The following output files were not created by > > > > > > > > > > the application: 101-FBchannel20_cwt-avgResults.Rdata > > > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/cwtsmall-cpteovgi > > > > > > > > > > lrwxrwxrwx 1 osg osgusers 76 Sep 7 13:01 101_FB-epochs.Rdata -> > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared/101_FB-epochs.Rdata > > > > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 13:01 scripts > > > > > > > > > > -rw-r--r-- 1 osg osgusers 70 Sep 7 13:01 stderr.txt > > > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared > > > > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 13:00 101_FB-epochs.Rdata > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 13:00 scripts > > > > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 13:00 seq.sh > > > > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 13:00 wrapper.sh > > > > > > > > > > > > > > > > > > > > I think the descriptions of exception are all right now. The > > > > > > > > > > difference between fork and condor was that fork created the output > > > > > > > > > > links to the shared directory, but condor didn't. But the essential > > > > > > > > > > problem is the output files not being created. I will do more > > > > > > > > > > experiments to see whether the problem of file system or application. > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Jing > > > > > > > > > > _______________________________________________ > > > > > > > > > > Swift-user mailing list > > > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Swift-user mailing list > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > -------------- next part -------------- A non-text attachment was scrubbed... Name: simple-wf-x1pa6aqy08cl1.log Type: application/octet-stream Size: 8483 bytes Desc: not available URL: From tiejing at gmail.com Mon Sep 10 11:07:47 2007 From: tiejing at gmail.com (Jing Tie) Date: Mon, 10 Sep 2007 11:07:47 -0500 Subject: [Swift-user] follow up of kickstart executable not found problem In-Reply-To: References: <1189224120.4369.0.camel@blabla.mcs.anl.gov> <1189226518.5306.4.camel@blabla.mcs.anl.gov> <1189357391.21155.0.camel@blabla.mcs.anl.gov> <1189358977.21560.1.camel@blabla.mcs.anl.gov> <1189439244.11819.0.camel@blabla.mcs.anl.gov> Message-ID: Sorry. I run on a problem site. Please wait for a second. Thanks, Jing On 9/10/07, Jing Tie wrote: > Hi, > > It has the same exception. Log is attached. > > Thanks, > Jing > > On 9/10/07, Mihael Hategan wrote: > > On Sun, 2007-09-09 at 12:29 -0500, Mihael Hategan wrote: > > > It's definitely a bug in the code I put in a few days ago. But I don't > > > quite see how it happens. Such simple code. Yet how complex. I'll have > > > to get back to you on it. > > > > Fixed, I think. Can you try again? > > > > Mihael > > > > > > > > On Sun, 2007-09-09 at 12:09 -0500, Jing Tie wrote: > > > > Sure. > > > > > > > > On 9/9/07, Mihael Hategan wrote: > > > > > Please post the whole workflow and the whole log. > > > > > > > > > > On Sun, 2007-09-09 at 11:43 -0500, Jing Tie wrote: > > > > > > Hi, > > > > > > > > > > > > I tried duplicate job again using the latest swift with > > > > > > "log4j.logger.org.globus.cog.abstraction=DEBUG". > > > > > > > > > > > > It didn't generated duplicate-*** directory under simple-wf-*** > > > > > > directory. Details: > > > > > > Resource (org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > > > > > successfully released > > > > > > Task(type=4, identity=urn:0-0-1189348296362) setting status to Failed > > > > > > Could not set current directory to "null" > > > > > > duplicate failed > > > > > > The following errors have occurred: > > > > > > 1. Could not initialize shared directory on NWICG_NotreDame > > > > > > Caused by: > > > > > > Could not set current directory to "null" > > > > > > Caused by: > > > > > > Required argument missing > > > > > > > > > > > > in simple-wf-fc06kzz28d880.log: > > > > > > 2007-09-09 09:31:39,911 DEBUG FileResourceCache Resource > > > > > > (org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > > > > > successfully released > > > > > > 2007-09-09 09:31:39,911 DEBUG TaskImpl Task(type=4, > > > > > > identity=urn:0-0-1189348296362) setting status to Failed Could not set > > > > > > current directory to "null" > > > > > > 2007-09-09 09:31:39,938 INFO vdl:mains Errors detected. Cleanup not done. > > > > > > 2007-09-09 09:31:39,963 DEBUG VDL2ExecutionContext Execution completed > > > > > > with errors > > > > > > Execution completed with errors > > > > > > > > > > > > There is nothing under > > > > > > osg.hpcc.nd.edu/dscratch/osg/data/osg/jtie/simple-wf-fc06kzz28d880 > > > > > > except empty shared directory. > > > > > > > > > > > > Thanks, > > > > > > Jing > > > > > > > > > > > > On 9/7/07, Mihael Hategan wrote: > > > > > > > Nevermind that. It's eating the empty stdin argument. This doesn't make > > > > > > > sense. Fork used to behave. > > > > > > > > > > > > > > Can you add this to log4j.properties, run again, and post the log? > > > > > > > > > > > > > > log4j.logger.org.globus.cog.abstraction=DEBUG > > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 23:02 -0500, Mihael Hategan wrote: > > > > > > > > Can you post the workflow? > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 21:56 -0500, Jing Tie wrote: > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > I checked the procedure, and found that the filenames of the output > > > > > > > > > files of "myapp" are the same as "File f". So the first option should > > > > > > > > > not be the problem. > > > > > > > > > > > > > > > > > > Then I tried a very simple swift script that doesn't need R: > > > > > > > > > app function: add some lines to the input file to generate the output file > > > > > > > > > input file name: simpleFile.txt > > > > > > > > > output file name: simpleFile.output > > > > > > > > > application script location: $OSG_APP/osg/jtie/duplicate.sh > > > > > > > > > jobmanager: jobmanager-fork > > > > > > > > > > > > > > > > > > duplicate failed > > > > > > > > > The following errors have occurred: > > > > > > > > > 1. Application "duplicate" failed (Failed to link input file > > > > > > > > > /dscratch/osg/app/osg/jtie/duplicate.sh) > > > > > > > > > Arguments: "simpleFile.txt" > > > > > > > > > Host: NWICG_NotreDame > > > > > > > > > Directory: simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi > > > > > > > > > STDERR: > > > > > > > > > STDOUT: > > > > > > > > > > > > > > > > > > after execution, 3 empty directories (not files, and also weird name) > > > > > > > > > were generated: > > > > > > > > > duplicate-gt7hyvgi-simpleFile.output > > > > > > > > > duplicate-ht7hyvgi-simpleFile.output > > > > > > > > > duplicate-it7hyvgi-simpleFile.output > > > > > > > > > > > > > > > > > > wrapper.log: > > > > > > > > > DIR=duplicate-gt7hyvgi > > > > > > > > > STDOUT=simpleFile.output > > > > > > > > > STDERR=stderr.txt > > > > > > > > > DIRS=simpleFile.output > > > > > > > > > LINKS=/dscratch/osg/app/osg/jtie/duplicate.sh > > > > > > > > > OUTS=simpleFile.txt > > > > > > > > > ln: creating symbolic link > > > > > > > > > `duplicate-gt7hyvgi//dscratch/osg/app/osg/jtie/duplicate.sh' to > > > > > > > > > `/dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared//dscratch/osg/app/osg/jtie/duplicate.sh': > > > > > > > > > No such file or directory > > > > > > > > > > > > > > > > > > under dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared: > > > > > > > > > seq.sh, simpleFile.txt, wrapper.sh > > > > > > > > > under dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi: > > > > > > > > > one directory - simpleFile.output > > > > > > > > > /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi/simpleFile.output: > > > > > > > > > empty > > > > > > > > > > > > > > > > > > I think the swift program is right, since I run it successfully using > > > > > > > > > localhost duplicate.sh. > > > > > > > > > > > > > > > > > > > > > > > > > > > Many thanks, > > > > > > > > > Jing > > > > > > > > > > > > > > > > > > > > > > > > > > > On 9/7/07, Mihael Hategan wrote: > > > > > > > > > > So when the output files are not created, there can be two reasons: > > > > > > > > > > 1. The specification of what files should be created is broken. This is, > > > > > > > > > > at this time, done by looking at the filenames of the return values from > > > > > > > > > > the atomic procedure. Normally one passes those file names to the > > > > > > > > > > application as output file parameters. Example: > > > > > > > > > > > > > > > > > > > > (File f) proc(...) { > > > > > > > > > > app { > > > > > > > > > > myapp ... "-o" @filename(f); > > > > > > > > > > } > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > 2. The specification is correct, but the application doesn't behave. > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 12:35 -0500, Jing Tie wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > I tried jobmanger-fork instead of jobmanager-condor on osg.hpcc.nd.edu site: > > > > > > > > > > > > > > > > > > > > > (101-FBchannel18_cwt-avgResults.Rdata) not found> > > > > > > > > > > > > > > > > > > > > > > jobmanager-fork: > > > > > > > > > > > ------------------------ > > > > > > > > > > > Application exception: The following output files were not created by > > > > > > > > > > > the application: /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/cwtsmall-0bhnnvgi > > > > > > > > > > > lrwxrwxrwx 1 osg osgusers 93 Sep 7 12:42 > > > > > > > > > > > 101-FBchannel10_cwt-avgResults.Rdata -> > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared/101-FBchannel10_cwt-avgResults.Rdata > > > > > > > > > > > ... (total 28 links, the same number as the number of the expected output files) > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:42 101_FB-epochs.Rdata > > > > > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 12:42 scripts > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 58 Sep 7 12:42 stderr.txt > > > > > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 12:41 101_FB-epochs.Rdata > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:41 scripts > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 12:41 seq.sh > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 12:41 wrapper.sh > > > > > > > > > > > > > > > > > > > > > > empty kickstart directory > > > > > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/cat > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/status/cwtsmall-0bhnnvgi-error > > > > > > > > > > > The following output files were not created by the application: > > > > > > > > > > > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > jobmanager-condor: > > > > > > > > > > > ------------------------------- > > > > > > > > > > > Application exception: The following output files were not created by > > > > > > > > > > > the application: 101-FBchannel20_cwt-avgResults.Rdata > > > > > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/cwtsmall-cpteovgi > > > > > > > > > > > lrwxrwxrwx 1 osg osgusers 76 Sep 7 13:01 101_FB-epochs.Rdata -> > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared/101_FB-epochs.Rdata > > > > > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 13:01 scripts > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 70 Sep 7 13:01 stderr.txt > > > > > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 13:00 101_FB-epochs.Rdata > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 13:00 scripts > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 13:00 seq.sh > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 13:00 wrapper.sh > > > > > > > > > > > > > > > > > > > > > > I think the descriptions of exception are all right now. The > > > > > > > > > > > difference between fork and condor was that fork created the output > > > > > > > > > > > links to the shared directory, but condor didn't. But the essential > > > > > > > > > > > problem is the output files not being created. I will do more > > > > > > > > > > > experiments to see whether the problem of file system or application. > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > Jing > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > Swift-user mailing list > > > > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > Swift-user mailing list > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > Swift-user mailing list > > > Swift-user at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > From hategan at mcs.anl.gov Mon Sep 10 11:22:40 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 10 Sep 2007 11:22:40 -0500 Subject: [Swift-user] follow up of kickstart executable not found problem In-Reply-To: References: <1189186832.28276.4.camel@blabla.mcs.anl.gov> <1189224120.4369.0.camel@blabla.mcs.anl.gov> <1189226518.5306.4.camel@blabla.mcs.anl.gov> <1189357391.21155.0.camel@blabla.mcs.anl.gov> <1189358977.21560.1.camel@blabla.mcs.anl.gov> <1189439244.11819.0.camel@blabla.mcs.anl.gov> Message-ID: <1189441360.14271.0.camel@blabla.mcs.anl.gov> You need to do an SVN update for both CoG and Swift. On Mon, 2007-09-10 at 11:03 -0500, Jing Tie wrote: > Hi, > > It has the same exception. Log is attached. > > Thanks, > Jing > > On 9/10/07, Mihael Hategan wrote: > > On Sun, 2007-09-09 at 12:29 -0500, Mihael Hategan wrote: > > > It's definitely a bug in the code I put in a few days ago. But I don't > > > quite see how it happens. Such simple code. Yet how complex. I'll have > > > to get back to you on it. > > > > Fixed, I think. Can you try again? > > > > Mihael > > > > > > > > On Sun, 2007-09-09 at 12:09 -0500, Jing Tie wrote: > > > > Sure. > > > > > > > > On 9/9/07, Mihael Hategan wrote: > > > > > Please post the whole workflow and the whole log. > > > > > > > > > > On Sun, 2007-09-09 at 11:43 -0500, Jing Tie wrote: > > > > > > Hi, > > > > > > > > > > > > I tried duplicate job again using the latest swift with > > > > > > "log4j.logger.org.globus.cog.abstraction=DEBUG". > > > > > > > > > > > > It didn't generated duplicate-*** directory under simple-wf-*** > > > > > > directory. Details: > > > > > > Resource (org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > > > > > successfully released > > > > > > Task(type=4, identity=urn:0-0-1189348296362) setting status to Failed > > > > > > Could not set current directory to "null" > > > > > > duplicate failed > > > > > > The following errors have occurred: > > > > > > 1. Could not initialize shared directory on NWICG_NotreDame > > > > > > Caused by: > > > > > > Could not set current directory to "null" > > > > > > Caused by: > > > > > > Required argument missing > > > > > > > > > > > > in simple-wf-fc06kzz28d880.log: > > > > > > 2007-09-09 09:31:39,911 DEBUG FileResourceCache Resource > > > > > > (org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > > > > > successfully released > > > > > > 2007-09-09 09:31:39,911 DEBUG TaskImpl Task(type=4, > > > > > > identity=urn:0-0-1189348296362) setting status to Failed Could not set > > > > > > current directory to "null" > > > > > > 2007-09-09 09:31:39,938 INFO vdl:mains Errors detected. Cleanup not done. > > > > > > 2007-09-09 09:31:39,963 DEBUG VDL2ExecutionContext Execution completed > > > > > > with errors > > > > > > Execution completed with errors > > > > > > > > > > > > There is nothing under > > > > > > osg.hpcc.nd.edu/dscratch/osg/data/osg/jtie/simple-wf-fc06kzz28d880 > > > > > > except empty shared directory. > > > > > > > > > > > > Thanks, > > > > > > Jing > > > > > > > > > > > > On 9/7/07, Mihael Hategan wrote: > > > > > > > Nevermind that. It's eating the empty stdin argument. This doesn't make > > > > > > > sense. Fork used to behave. > > > > > > > > > > > > > > Can you add this to log4j.properties, run again, and post the log? > > > > > > > > > > > > > > log4j.logger.org.globus.cog.abstraction=DEBUG > > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 23:02 -0500, Mihael Hategan wrote: > > > > > > > > Can you post the workflow? > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 21:56 -0500, Jing Tie wrote: > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > I checked the procedure, and found that the filenames of the output > > > > > > > > > files of "myapp" are the same as "File f". So the first option should > > > > > > > > > not be the problem. > > > > > > > > > > > > > > > > > > Then I tried a very simple swift script that doesn't need R: > > > > > > > > > app function: add some lines to the input file to generate the output file > > > > > > > > > input file name: simpleFile.txt > > > > > > > > > output file name: simpleFile.output > > > > > > > > > application script location: $OSG_APP/osg/jtie/duplicate.sh > > > > > > > > > jobmanager: jobmanager-fork > > > > > > > > > > > > > > > > > > duplicate failed > > > > > > > > > The following errors have occurred: > > > > > > > > > 1. Application "duplicate" failed (Failed to link input file > > > > > > > > > /dscratch/osg/app/osg/jtie/duplicate.sh) > > > > > > > > > Arguments: "simpleFile.txt" > > > > > > > > > Host: NWICG_NotreDame > > > > > > > > > Directory: simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi > > > > > > > > > STDERR: > > > > > > > > > STDOUT: > > > > > > > > > > > > > > > > > > after execution, 3 empty directories (not files, and also weird name) > > > > > > > > > were generated: > > > > > > > > > duplicate-gt7hyvgi-simpleFile.output > > > > > > > > > duplicate-ht7hyvgi-simpleFile.output > > > > > > > > > duplicate-it7hyvgi-simpleFile.output > > > > > > > > > > > > > > > > > > wrapper.log: > > > > > > > > > DIR=duplicate-gt7hyvgi > > > > > > > > > STDOUT=simpleFile.output > > > > > > > > > STDERR=stderr.txt > > > > > > > > > DIRS=simpleFile.output > > > > > > > > > LINKS=/dscratch/osg/app/osg/jtie/duplicate.sh > > > > > > > > > OUTS=simpleFile.txt > > > > > > > > > ln: creating symbolic link > > > > > > > > > `duplicate-gt7hyvgi//dscratch/osg/app/osg/jtie/duplicate.sh' to > > > > > > > > > `/dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared//dscratch/osg/app/osg/jtie/duplicate.sh': > > > > > > > > > No such file or directory > > > > > > > > > > > > > > > > > > under dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared: > > > > > > > > > seq.sh, simpleFile.txt, wrapper.sh > > > > > > > > > under dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi: > > > > > > > > > one directory - simpleFile.output > > > > > > > > > /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi/simpleFile.output: > > > > > > > > > empty > > > > > > > > > > > > > > > > > > I think the swift program is right, since I run it successfully using > > > > > > > > > localhost duplicate.sh. > > > > > > > > > > > > > > > > > > > > > > > > > > > Many thanks, > > > > > > > > > Jing > > > > > > > > > > > > > > > > > > > > > > > > > > > On 9/7/07, Mihael Hategan wrote: > > > > > > > > > > So when the output files are not created, there can be two reasons: > > > > > > > > > > 1. The specification of what files should be created is broken. This is, > > > > > > > > > > at this time, done by looking at the filenames of the return values from > > > > > > > > > > the atomic procedure. Normally one passes those file names to the > > > > > > > > > > application as output file parameters. Example: > > > > > > > > > > > > > > > > > > > > (File f) proc(...) { > > > > > > > > > > app { > > > > > > > > > > myapp ... "-o" @filename(f); > > > > > > > > > > } > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > 2. The specification is correct, but the application doesn't behave. > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 12:35 -0500, Jing Tie wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > I tried jobmanger-fork instead of jobmanager-condor on osg.hpcc.nd.edu site: > > > > > > > > > > > > > > > > > > > > > (101-FBchannel18_cwt-avgResults.Rdata) not found> > > > > > > > > > > > > > > > > > > > > > > jobmanager-fork: > > > > > > > > > > > ------------------------ > > > > > > > > > > > Application exception: The following output files were not created by > > > > > > > > > > > the application: /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/cwtsmall-0bhnnvgi > > > > > > > > > > > lrwxrwxrwx 1 osg osgusers 93 Sep 7 12:42 > > > > > > > > > > > 101-FBchannel10_cwt-avgResults.Rdata -> > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared/101-FBchannel10_cwt-avgResults.Rdata > > > > > > > > > > > ... (total 28 links, the same number as the number of the expected output files) > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:42 101_FB-epochs.Rdata > > > > > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 12:42 scripts > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 58 Sep 7 12:42 stderr.txt > > > > > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 12:41 101_FB-epochs.Rdata > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:41 scripts > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 12:41 seq.sh > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 12:41 wrapper.sh > > > > > > > > > > > > > > > > > > > > > > empty kickstart directory > > > > > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/cat > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/status/cwtsmall-0bhnnvgi-error > > > > > > > > > > > The following output files were not created by the application: > > > > > > > > > > > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > jobmanager-condor: > > > > > > > > > > > ------------------------------- > > > > > > > > > > > Application exception: The following output files were not created by > > > > > > > > > > > the application: 101-FBchannel20_cwt-avgResults.Rdata > > > > > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/cwtsmall-cpteovgi > > > > > > > > > > > lrwxrwxrwx 1 osg osgusers 76 Sep 7 13:01 101_FB-epochs.Rdata -> > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared/101_FB-epochs.Rdata > > > > > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 13:01 scripts > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 70 Sep 7 13:01 stderr.txt > > > > > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 13:00 101_FB-epochs.Rdata > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 13:00 scripts > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 13:00 seq.sh > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 13:00 wrapper.sh > > > > > > > > > > > > > > > > > > > > > > I think the descriptions of exception are all right now. The > > > > > > > > > > > difference between fork and condor was that fork created the output > > > > > > > > > > > links to the shared directory, but condor didn't. But the essential > > > > > > > > > > > problem is the output files not being created. I will do more > > > > > > > > > > > experiments to see whether the problem of file system or application. > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > Jing > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > Swift-user mailing list > > > > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > Swift-user mailing list > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > Swift-user mailing list > > > Swift-user at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > From tiejing at gmail.com Mon Sep 10 11:26:12 2007 From: tiejing at gmail.com (Jing Tie) Date: Mon, 10 Sep 2007 11:26:12 -0500 Subject: [Swift-user] follow up of kickstart executable not found problem In-Reply-To: <1189441360.14271.0.camel@blabla.mcs.anl.gov> References: <1189224120.4369.0.camel@blabla.mcs.anl.gov> <1189226518.5306.4.camel@blabla.mcs.anl.gov> <1189357391.21155.0.camel@blabla.mcs.anl.gov> <1189358977.21560.1.camel@blabla.mcs.anl.gov> <1189439244.11819.0.camel@blabla.mcs.anl.gov> <1189441360.14271.0.camel@blabla.mcs.anl.gov> Message-ID: Hi, It works fine now (log is attached). I'll try sid program next. Many thanks, Jing On 9/10/07, Mihael Hategan wrote: > You need to do an SVN update for both CoG and Swift. > > On Mon, 2007-09-10 at 11:03 -0500, Jing Tie wrote: > > Hi, > > > > It has the same exception. Log is attached. > > > > Thanks, > > Jing > > > > On 9/10/07, Mihael Hategan wrote: > > > On Sun, 2007-09-09 at 12:29 -0500, Mihael Hategan wrote: > > > > It's definitely a bug in the code I put in a few days ago. But I don't > > > > quite see how it happens. Such simple code. Yet how complex. I'll have > > > > to get back to you on it. > > > > > > Fixed, I think. Can you try again? > > > > > > Mihael > > > > > > > > > > > On Sun, 2007-09-09 at 12:09 -0500, Jing Tie wrote: > > > > > Sure. > > > > > > > > > > On 9/9/07, Mihael Hategan wrote: > > > > > > Please post the whole workflow and the whole log. > > > > > > > > > > > > On Sun, 2007-09-09 at 11:43 -0500, Jing Tie wrote: > > > > > > > Hi, > > > > > > > > > > > > > > I tried duplicate job again using the latest swift with > > > > > > > "log4j.logger.org.globus.cog.abstraction=DEBUG". > > > > > > > > > > > > > > It didn't generated duplicate-*** directory under simple-wf-*** > > > > > > > directory. Details: > > > > > > > Resource ( org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > > > > > > successfully released > > > > > > > Task(type=4, identity=urn:0-0-1189348296362) setting status to Failed > > > > > > > Could not set current directory to "null" > > > > > > > duplicate failed > > > > > > > The following errors have occurred: > > > > > > > 1. Could not initialize shared directory on NWICG_NotreDame > > > > > > > Caused by: > > > > > > > Could not set current directory to "null" > > > > > > > Caused by: > > > > > > > Required argument missing > > > > > > > > > > > > > > in simple-wf-fc06kzz28d880.log: > > > > > > > 2007-09-09 09:31:39,911 DEBUG FileResourceCache Resource > > > > > > > ( org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > > > > > > successfully released > > > > > > > 2007-09-09 09:31:39,911 DEBUG TaskImpl Task(type=4, > > > > > > > identity=urn:0-0-1189348296362) setting status to Failed Could not set > > > > > > > current directory to "null" > > > > > > > 2007-09-09 09:31:39,938 INFO vdl:mains Errors detected. Cleanup not done. > > > > > > > 2007-09-09 09:31:39,963 DEBUG VDL2ExecutionContext Execution completed > > > > > > > with errors > > > > > > > Execution completed with errors > > > > > > > > > > > > > > There is nothing under > > > > > > > osg.hpcc.nd.edu/dscratch/osg/data/osg/jtie/simple-wf-fc06kzz28d880 > > > > > > > except empty shared directory. > > > > > > > > > > > > > > Thanks, > > > > > > > Jing > > > > > > > > > > > > > > On 9/7/07, Mihael Hategan wrote: > > > > > > > > Nevermind that. It's eating the empty stdin argument. This doesn't make > > > > > > > > sense. Fork used to behave. > > > > > > > > > > > > > > > > Can you add this to log4j.properties, run again, and post the log? > > > > > > > > > > > > > > > > log4j.logger.org.globus.cog.abstraction=DEBUG > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 23:02 -0500, Mihael Hategan wrote: > > > > > > > > > Can you post the workflow? > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 21:56 -0500, Jing Tie wrote: > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > > > I checked the procedure, and found that the filenames of the output > > > > > > > > > > files of "myapp" are the same as "File f". So the first option should > > > > > > > > > > not be the problem. > > > > > > > > > > > > > > > > > > > > Then I tried a very simple swift script that doesn't need R: > > > > > > > > > > app function: add some lines to the input file to generate the output file > > > > > > > > > > input file name: simpleFile.txt > > > > > > > > > > output file name: simpleFile.output > > > > > > > > > > application script location: $OSG_APP/osg/jtie/duplicate.sh > > > > > > > > > > jobmanager: jobmanager-fork > > > > > > > > > > > > > > > > > > > > duplicate failed > > > > > > > > > > The following errors have occurred: > > > > > > > > > > 1. Application "duplicate" failed (Failed to link input file > > > > > > > > > > /dscratch/osg/app/osg/jtie/duplicate.sh) > > > > > > > > > > Arguments: "simpleFile.txt" > > > > > > > > > > Host: NWICG_NotreDame > > > > > > > > > > Directory: simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi > > > > > > > > > > STDERR: > > > > > > > > > > STDOUT: > > > > > > > > > > > > > > > > > > > > after execution, 3 empty directories (not files, and also weird name) > > > > > > > > > > were generated: > > > > > > > > > > duplicate-gt7hyvgi-simpleFile.output > > > > > > > > > > duplicate-ht7hyvgi-simpleFile.output > > > > > > > > > > duplicate-it7hyvgi-simpleFile.output > > > > > > > > > > > > > > > > > > > > wrapper.log: > > > > > > > > > > DIR=duplicate-gt7hyvgi > > > > > > > > > > STDOUT=simpleFile.output > > > > > > > > > > STDERR=stderr.txt > > > > > > > > > > DIRS=simpleFile.output > > > > > > > > > > LINKS=/dscratch/osg/app/osg/jtie/duplicate.sh > > > > > > > > > > OUTS=simpleFile.txt > > > > > > > > > > ln: creating symbolic link > > > > > > > > > > `duplicate-gt7hyvgi//dscratch/osg/app/osg/jtie/duplicate.sh' to > > > > > > > > > > `/dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared//dscratch/osg/app/osg/jtie/duplicate.sh': > > > > > > > > > > No such file or directory > > > > > > > > > > > > > > > > > > > > under dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared: > > > > > > > > > > seq.sh, simpleFile.txt, wrapper.sh > > > > > > > > > > under dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi: > > > > > > > > > > one directory - simpleFile.output > > > > > > > > > > /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi/simpleFile.output: > > > > > > > > > > empty > > > > > > > > > > > > > > > > > > > > I think the swift program is right, since I run it successfully using > > > > > > > > > > localhost duplicate.sh. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Many thanks, > > > > > > > > > > Jing > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 9/7/07, Mihael Hategan wrote: > > > > > > > > > > > So when the output files are not created, there can be two reasons: > > > > > > > > > > > 1. The specification of what files should be created is broken. This is, > > > > > > > > > > > at this time, done by looking at the filenames of the return values from > > > > > > > > > > > the atomic procedure. Normally one passes those file names to the > > > > > > > > > > > application as output file parameters. Example: > > > > > > > > > > > > > > > > > > > > > > (File f) proc(...) { > > > > > > > > > > > app { > > > > > > > > > > > myapp ... "-o" @filename(f); > > > > > > > > > > > } > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > 2. The specification is correct, but the application doesn't behave. > > > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 12:35 -0500, Jing Tie wrote: > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > > > I tried jobmanger-fork instead of jobmanager-condor on osg.hpcc.nd.edu site: > > > > > > > > > > > > > > > > > > > > > > > (101-FBchannel18_cwt-avgResults.Rdata) not found> > > > > > > > > > > > > > > > > > > > > > > > > jobmanager-fork: > > > > > > > > > > > > ------------------------ > > > > > > > > > > > > Application exception: The following output files were not created by > > > > > > > > > > > > the application: /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/cwtsmall-0bhnnvgi > > > > > > > > > > > > lrwxrwxrwx 1 osg osgusers 93 Sep 7 12:42 > > > > > > > > > > > > 101-FBchannel10_cwt-avgResults.Rdata -> > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared/101-FBchannel10_cwt- avgResults.Rdata > > > > > > > > > > > > ... (total 28 links, the same number as the number of the expected output files) > > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:42 101_FB- epochs.Rdata > > > > > > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 12:42 scripts > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 58 Sep 7 12:42 stderr.txt > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 12:41 101_FB-epochs.Rdata > > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:41 scripts > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 12:41 seq.sh > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 12:41 wrapper.sh > > > > > > > > > > > > > > > > > > > > > > > > empty kickstart directory > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/cat > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/status/cwtsmall-0bhnnvgi-error > > > > > > > > > > > > The following output files were not created by the application: > > > > > > > > > > > > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > jobmanager-condor: > > > > > > > > > > > > ------------------------------- > > > > > > > > > > > > Application exception: The following output files were not created by > > > > > > > > > > > > the application: 101-FBchannel20_cwt- avgResults.Rdata > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/cwtsmall-cpteovgi > > > > > > > > > > > > lrwxrwxrwx 1 osg osgusers 76 Sep 7 13:01 101_FB- epochs.Rdata -> > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared/101_FB-epochs.Rdata > > > > > > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 13:01 scripts > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 70 Sep 7 13:01 stderr.txt > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 13:00 101_FB-epochs.Rdata > > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 13:00 scripts > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 13:00 seq.sh > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 13:00 wrapper.sh > > > > > > > > > > > > > > > > > > > > > > > > I think the descriptions of exception are all right now. The > > > > > > > > > > > > difference between fork and condor was that fork created the output > > > > > > > > > > > > links to the shared directory, but condor didn't. But the essential > > > > > > > > > > > > problem is the output files not being created. I will do more > > > > > > > > > > > > experiments to see whether the problem of file system or application. > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Jing > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > Swift-user mailing list > > > > > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > Swift-user mailing list > > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Swift-user mailing list > > > > Swift-user at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: simple-wf-7bf5sd95dccz1.log Type: application/octet-stream Size: 24973 bytes Desc: not available URL: From foster at mcs.anl.gov Mon Sep 10 11:27:41 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Mon, 10 Sep 2007 11:27:41 -0500 Subject: [Swift-user] follow up of kickstart executable not found problem In-Reply-To: References: <1189224120.4369.0.camel@blabla.mcs.anl.gov> <1189226518.5306.4.camel@blabla.mcs.anl.gov> <1189357391.21155.0.camel@blabla.mcs.anl.gov> <1189358977.21560.1.camel@blabla.mcs.anl.gov> <1189439244.11819.0.camel@blabla.mcs.anl.gov> <1189441360.14271.0.camel@blabla.mcs.anl.gov> Message-ID: <46E5707D.1070203@mcs.anl.gov> great! Jing Tie wrote: > Hi, > > It works fine now (log is attached). I'll try sid program next. > > Many thanks, > Jing > > On 9/10/07, Mihael Hategan > wrote: > > You need to do an SVN update for both CoG and Swift. > > > > On Mon, 2007-09-10 at 11:03 -0500, Jing Tie wrote: > > > Hi, > > > > > > It has the same exception. Log is attached. > > > > > > Thanks, > > > Jing > > > > > > On 9/10/07, Mihael Hategan > wrote: > > > > On Sun, 2007-09-09 at 12:29 -0500, Mihael Hategan wrote: > > > > > It's definitely a bug in the code I put in a few days ago. But > I don't > > > > > quite see how it happens. Such simple code. Yet how complex. > I'll have > > > > > to get back to you on it. > > > > > > > > Fixed, I think. Can you try again? > > > > > > > > Mihael > > > > > > > > > > > > > > On Sun, 2007-09-09 at 12:09 -0500, Jing Tie wrote: > > > > > > Sure. > > > > > > > > > > > > On 9/9/07, Mihael Hategan < hategan at mcs.anl.gov > > wrote: > > > > > > > Please post the whole workflow and the whole log. > > > > > > > > > > > > > > On Sun, 2007-09-09 at 11:43 -0500, Jing Tie wrote: > > > > > > > > Hi, > > > > > > > > > > > > > > > > I tried duplicate job again using the latest swift with > > > > > > > > "log4j.logger.org.globus.cog.abstraction=DEBUG ". > > > > > > > > > > > > > > > > It didn't generated duplicate-*** directory under > simple-wf-*** > > > > > > > > directory. Details: > > > > > > > > Resource ( > org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > > > > > > > successfully released > > > > > > > > Task(type=4, identity=urn:0-0-1189348296362) setting > status to Failed > > > > > > > > Could not set current directory to "null" > > > > > > > > duplicate failed > > > > > > > > The following errors have occurred: > > > > > > > > 1. Could not initialize shared directory on NWICG_NotreDame > > > > > > > > Caused by: > > > > > > > > Could not set current directory to "null" > > > > > > > > Caused by: > > > > > > > > Required argument missing > > > > > > > > > > > > > > > > in simple-wf-fc06kzz28d880.log : > > > > > > > > 2007-09-09 09:31:39,911 DEBUG FileResourceCache Resource > > > > > > > > > (org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > > > > > > > successfully released > > > > > > > > 2007-09-09 09:31:39,911 DEBUG TaskImpl Task(type=4, > > > > > > > > identity=urn:0-0-1189348296362) setting status to Failed > Could not set > > > > > > > > current directory to "null" > > > > > > > > 2007-09-09 09:31:39,938 INFO vdl:mains Errors detected. > Cleanup not done. > > > > > > > > 2007-09-09 09:31:39,963 DEBUG VDL2ExecutionContext > Execution completed > > > > > > > > with errors > > > > > > > > Execution completed with errors > > > > > > > > > > > > > > > > There is nothing under > > > > > > > > > osg.hpcc.nd.edu/dscratch/osg/data/osg/jtie/simple-wf-fc06kzz28d880 > > > > > > > > > except empty shared directory. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Jing > > > > > > > > > > > > > > > > On 9/7/07, Mihael Hategan < hategan at mcs.anl.gov > > wrote: > > > > > > > > > Nevermind that. It's eating the empty stdin argument. > This doesn't make > > > > > > > > > sense. Fork used to behave. > > > > > > > > > > > > > > > > > > Can you add this to log4j.properties, run again, and > post the log? > > > > > > > > > > > > > > > > > > log4j.logger.org.globus.cog.abstraction=DEBUG > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 23:02 -0500, Mihael Hategan wrote: > > > > > > > > > > Can you post the workflow? > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 21:56 -0500, Jing Tie wrote: > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > > > > > I checked the procedure, and found that the > filenames of the output > > > > > > > > > > > files of "myapp" are the same as "File f". So the > first option should > > > > > > > > > > > not be the problem. > > > > > > > > > > > > > > > > > > > > > > Then I tried a very simple swift script that > doesn't need R: > > > > > > > > > > > app function: add some lines to the input file to > generate the output file > > > > > > > > > > > input file name: simpleFile.txt > > > > > > > > > > > output file name: simpleFile.output > > > > > > > > > > > application script location: > $OSG_APP/osg/jtie/duplicate.sh > > > > > > > > > > > jobmanager: jobmanager-fork > > > > > > > > > > > > > > > > > > > > > > duplicate failed > > > > > > > > > > > The following errors have occurred: > > > > > > > > > > > 1. Application "duplicate" failed (Failed to link > input file > > > > > > > > > > > /dscratch/osg/app/osg/jtie/duplicate.sh) > > > > > > > > > > > Arguments: "simpleFile.txt" > > > > > > > > > > > Host: NWICG_NotreDame > > > > > > > > > > > Directory: > simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi > > > > > > > > > > > STDERR: > > > > > > > > > > > STDOUT: > > > > > > > > > > > > > > > > > > > > > > after execution, 3 empty directories (not files, > and also weird name) > > > > > > > > > > > were generated: > > > > > > > > > > > duplicate-gt7hyvgi-simpleFile.output > > > > > > > > > > > duplicate-ht7hyvgi-simpleFile.output > > > > > > > > > > > duplicate-it7hyvgi-simpleFile.output > > > > > > > > > > > > > > > > > > > > > > wrapper.log: > > > > > > > > > > > DIR=duplicate-gt7hyvgi > > > > > > > > > > > STDOUT=simpleFile.output > > > > > > > > > > > STDERR=stderr.txt > > > > > > > > > > > DIRS=simpleFile.output > > > > > > > > > > > LINKS=/dscratch/osg/app/osg/jtie/duplicate.sh > > > > > > > > > > > OUTS=simpleFile.txt > > > > > > > > > > > ln: creating symbolic link > > > > > > > > > > > > `duplicate-gt7hyvgi//dscratch/osg/app/osg/jtie/duplicate.sh' to > > > > > > > > > > > > `/dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared//dscratch/osg/app/osg/jtie/duplicate.sh': > > > > > > > > > > > > No such file or directory > > > > > > > > > > > > > > > > > > > > > > under dir > /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared: > > > > > > > > > > > seq.sh, simpleFile.txt, wrapper.sh > > > > > > > > > > > under dir > /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi: > > > > > > > > > > > one directory - simpleFile.output > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi/simpleFile.output: > > > > > > > > > > > > empty > > > > > > > > > > > > > > > > > > > > > > I think the swift program is right, since I run it > successfully using > > > > > > > > > > > localhost duplicate.sh. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Many thanks, > > > > > > > > > > > Jing > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 9/7/07, Mihael Hategan < hategan at mcs.anl.gov > > wrote: > > > > > > > > > > > > So when the output files are not created, there > can be two reasons: > > > > > > > > > > > > 1. The specification of what files should be > created is broken. This is, > > > > > > > > > > > > at this time, done by looking at the filenames > of the return values from > > > > > > > > > > > > the atomic procedure. Normally one passes those > file names to the > > > > > > > > > > > > application as output file parameters. Example: > > > > > > > > > > > > > > > > > > > > > > > > (File f) proc(...) { > > > > > > > > > > > > app { > > > > > > > > > > > > myapp ... "-o" @filename(f); > > > > > > > > > > > > } > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > 2. The specification is correct, but the > application doesn't behave. > > > > > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 12:35 -0500, Jing Tie wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > > > > > I tried jobmanger-fork instead of > jobmanager-condor on osg.hpcc.nd.edu site: > > > > > > > > > > > > > kickstart executable > > > > > > > > > > > > > (101-FBchannel18_cwt- avgResults.Rdata) not found> > > > > > > > > > > > > > > > > > > > > > > > > > > jobmanager-fork: > > > > > > > > > > > > > ------------------------ > > > > > > > > > > > > > Application exception: The following output > files were not created by > > > > > > > > > > > > > the application: > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager > /bin/ls -al > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/cwtsmall-0bhnnvgi > > > > > > > > > > > > > lrwxrwxrwx 1 osg osgusers 93 Sep 7 12:42 > > > > > > > > > > > > > 101-FBchannel10_cwt-avgResults.Rdata -> > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared/101-FBchannel10_cwt- > avgResults.Rdata > > > > > > > > > > > > > ... (total 28 links, the same number as the > number of the expected output files) > > > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:42 > 101_FB- epochs.Rdata > > > > > > > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 12:42 > scripts > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 58 Sep 7 12:42 > stderr.txt > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager > /bin/ls -al > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 > 12:41 101_FB- epochs.Rdata > > > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 > 12:41 scripts > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 > 12:41 seq.sh > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 > 12:41 wrapper.sh > > > > > > > > > > > > > > > > > > > > > > > > > > empty kickstart directory > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager > /bin/cat > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/status/cwtsmall-0bhnnvgi-error > > > > > > > > > > > > > The following output files were not created by > the application: > > > > > > > > > > > > > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > jobmanager-condor: > > > > > > > > > > > > > ------------------------------- > > > > > > > > > > > > > Application exception: The following output > files were not created by > > > > > > > > > > > > > the application: > 101-FBchannel20_cwt-avgResults.Rdata > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager > /bin/ls -al > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/cwtsmall-cpteovgi > > > > > > > > > > > > > lrwxrwxrwx 1 osg osgusers 76 Sep 7 13:01 > 101_FB-epochs.Rdata -> > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared/101_FB- > epochs.Rdata > > > > > > > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 13:01 > scripts > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 70 Sep 7 13:01 > stderr.txt > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run osg.hpcc.nd.edu/jobmanager > /bin/ls -al > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 > 13:00 101_FB- epochs.Rdata > > > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 > 13:00 scripts > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 > 13:00 seq.sh > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 > 13:00 wrapper.sh > > > > > > > > > > > > > > > > > > > > > > > > > > I think the descriptions of exception are all > right now. The > > > > > > > > > > > > > difference between fork and condor was that > fork created the output > > > > > > > > > > > > > links to the shared directory, but condor > didn't. But the essential > > > > > > > > > > > > > problem is the output files not being created. > I will do more > > > > > > > > > > > > > experiments to see whether the problem of file > system or application. > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > Jing > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > > Swift-user mailing list > > > > > > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > Swift-user mailing list > > > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-user mailing list > > > > > Swift-user at ci.uchicago.edu > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Sep 10 11:35:06 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 10 Sep 2007 11:35:06 -0500 Subject: [Swift-user] follow up of kickstart executable not found problem In-Reply-To: References: <1189224120.4369.0.camel@blabla.mcs.anl.gov> <1189226518.5306.4.camel@blabla.mcs.anl.gov> <1189357391.21155.0.camel@blabla.mcs.anl.gov> <1189358977.21560.1.camel@blabla.mcs.anl.gov> <1189439244.11819.0.camel@blabla.mcs.anl.gov> <1189441360.14271.0.camel@blabla.mcs.anl.gov> Message-ID: <1189442106.14271.3.camel@blabla.mcs.anl.gov> Can you try it with Condor? It should work now. On Mon, 2007-09-10 at 11:26 -0500, Jing Tie wrote: > Hi, > > It works fine now (log is attached). I'll try sid program next. > > Many thanks, > Jing > > On 9/10/07, Mihael Hategan wrote: > > You need to do an SVN update for both CoG and Swift. > > > > On Mon, 2007-09-10 at 11:03 -0500, Jing Tie wrote: > > > Hi, > > > > > > It has the same exception. Log is attached. > > > > > > Thanks, > > > Jing > > > > > > On 9/10/07, Mihael Hategan wrote: > > > > On Sun, 2007-09-09 at 12:29 -0500, Mihael Hategan wrote: > > > > > It's definitely a bug in the code I put in a few days ago. But > I don't > > > > > quite see how it happens. Such simple code. Yet how complex. > I'll have > > > > > to get back to you on it. > > > > > > > > Fixed, I think. Can you try again? > > > > > > > > Mihael > > > > > > > > > > > > > > On Sun, 2007-09-09 at 12:09 -0500, Jing Tie wrote: > > > > > > Sure. > > > > > > > > > > > > On 9/9/07, Mihael Hategan < hategan at mcs.anl.gov> wrote: > > > > > > > Please post the whole workflow and the whole log. > > > > > > > > > > > > > > On Sun, 2007-09-09 at 11:43 -0500, Jing Tie wrote: > > > > > > > > Hi, > > > > > > > > > > > > > > > > I tried duplicate job again using the latest swift with > > > > > > > > "log4j.logger.org.globus.cog.abstraction=DEBUG ". > > > > > > > > > > > > > > > > It didn't generated duplicate-*** directory under > simple-wf-*** > > > > > > > > directory. Details: > > > > > > > > Resource > ( org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > > > > > > > successfully released > > > > > > > > Task(type=4, identity=urn:0-0-1189348296362) setting > status to Failed > > > > > > > > Could not set current directory to "null" > > > > > > > > duplicate failed > > > > > > > > The following errors have occurred: > > > > > > > > 1. Could not initialize shared directory on > NWICG_NotreDame > > > > > > > > Caused by: > > > > > > > > Could not set current directory to "null" > > > > > > > > Caused by: > > > > > > > > Required argument missing > > > > > > > > > > > > > > > > in simple-wf-fc06kzz28d880.log : > > > > > > > > 2007-09-09 09:31:39,911 DEBUG FileResourceCache Resource > > > > > > > > > (org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > > > > > > > successfully released > > > > > > > > 2007-09-09 09:31:39,911 DEBUG TaskImpl Task(type=4, > > > > > > > > identity=urn:0-0-1189348296362) setting status to Failed > Could not set > > > > > > > > current directory to "null" > > > > > > > > 2007-09-09 09:31:39,938 INFO vdl:mains Errors detected. > Cleanup not done. > > > > > > > > 2007-09-09 09:31:39,963 DEBUG VDL2ExecutionContext > Execution completed > > > > > > > > with errors > > > > > > > > Execution completed with errors > > > > > > > > > > > > > > > > There is nothing under > > > > > > > > > osg.hpcc.nd.edu/dscratch/osg/data/osg/jtie/simple-wf-fc06kzz28d880 > > > > > > > > except empty shared directory. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Jing > > > > > > > > > > > > > > > > On 9/7/07, Mihael Hategan < hategan at mcs.anl.gov> wrote: > > > > > > > > > Nevermind that. It's eating the empty stdin argument. > This doesn't make > > > > > > > > > sense. Fork used to behave. > > > > > > > > > > > > > > > > > > Can you add this to log4j.properties, run again, and > post the log? > > > > > > > > > > > > > > > > > > log4j.logger.org.globus.cog.abstraction=DEBUG > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 23:02 -0500, Mihael Hategan > wrote: > > > > > > > > > > Can you post the workflow? > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 21:56 -0500, Jing Tie wrote: > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > > > > > I checked the procedure, and found that the > filenames of the output > > > > > > > > > > > files of "myapp" are the same as "File f". So the > first option should > > > > > > > > > > > not be the problem. > > > > > > > > > > > > > > > > > > > > > > Then I tried a very simple swift script that > doesn't need R: > > > > > > > > > > > app function: add some lines to the input file to > generate the output file > > > > > > > > > > > input file name: simpleFile.txt > > > > > > > > > > > output file name: simpleFile.output > > > > > > > > > > > application script location: > $OSG_APP/osg/jtie/duplicate.sh > > > > > > > > > > > jobmanager: jobmanager-fork > > > > > > > > > > > > > > > > > > > > > > duplicate failed > > > > > > > > > > > The following errors have occurred: > > > > > > > > > > > 1. Application "duplicate" failed (Failed to link > input file > > > > > > > > > > > /dscratch/osg/app/osg/jtie/duplicate.sh) > > > > > > > > > > > Arguments: "simpleFile.txt" > > > > > > > > > > > Host: NWICG_NotreDame > > > > > > > > > > > Directory: > simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi > > > > > > > > > > > STDERR: > > > > > > > > > > > STDOUT: > > > > > > > > > > > > > > > > > > > > > > after execution, 3 empty directories (not files, > and also weird name) > > > > > > > > > > > were generated: > > > > > > > > > > > duplicate-gt7hyvgi-simpleFile.output > > > > > > > > > > > duplicate-ht7hyvgi-simpleFile.output > > > > > > > > > > > duplicate-it7hyvgi-simpleFile.output > > > > > > > > > > > > > > > > > > > > > > wrapper.log: > > > > > > > > > > > DIR=duplicate-gt7hyvgi > > > > > > > > > > > STDOUT=simpleFile.output > > > > > > > > > > > STDERR=stderr.txt > > > > > > > > > > > DIRS=simpleFile.output > > > > > > > > > > > LINKS=/dscratch/osg/app/osg/jtie/duplicate.sh > > > > > > > > > > > OUTS=simpleFile.txt > > > > > > > > > > > ln: creating symbolic link > > > > > > > > > > > > `duplicate-gt7hyvgi//dscratch/osg/app/osg/jtie/duplicate.sh' to > > > > > > > > > > > > `/dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared//dscratch/osg/app/osg/jtie/duplicate.sh': > > > > > > > > > > > No such file or directory > > > > > > > > > > > > > > > > > > > > > > under > dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared: > > > > > > > > > > > seq.sh, simpleFile.txt, wrapper.sh > > > > > > > > > > > under > dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi: > > > > > > > > > > > one directory - simpleFile.output > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi/simpleFile.output: > > > > > > > > > > > empty > > > > > > > > > > > > > > > > > > > > > > I think the swift program is right, since I run it > successfully using > > > > > > > > > > > localhost duplicate.sh. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Many thanks, > > > > > > > > > > > Jing > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 9/7/07, Mihael Hategan < hategan at mcs.anl.gov> > wrote: > > > > > > > > > > > > So when the output files are not created, there > can be two reasons: > > > > > > > > > > > > 1. The specification of what files should be > created is broken. This is, > > > > > > > > > > > > at this time, done by looking at the filenames > of the return values from > > > > > > > > > > > > the atomic procedure. Normally one passes those > file names to the > > > > > > > > > > > > application as output file parameters. Example: > > > > > > > > > > > > > > > > > > > > > > > > (File f) proc(...) { > > > > > > > > > > > > app { > > > > > > > > > > > > myapp ... "-o" @filename(f); > > > > > > > > > > > > } > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > 2. The specification is correct, but the > application doesn't behave. > > > > > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 12:35 -0500, Jing Tie > wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > > > > > I tried jobmanger-fork instead of > jobmanager-condor on osg.hpcc.nd.edu site: > > > > > > > > > > > > > kickstart executable > > > > > > > > > > > > > (101-FBchannel18_cwt- avgResults.Rdata) not > found> > > > > > > > > > > > > > > > > > > > > > > > > > > jobmanager-fork: > > > > > > > > > > > > > ------------------------ > > > > > > > > > > > > > Application exception: The following output > files were not created by > > > > > > > > > > > > > the > application: /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/cwtsmall-0bhnnvgi > > > > > > > > > > > > > lrwxrwxrwx 1 osg osgusers 93 Sep 7 12:42 > > > > > > > > > > > > > 101-FBchannel10_cwt-avgResults.Rdata -> > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared/101-FBchannel10_cwt- avgResults.Rdata > > > > > > > > > > > > > ... (total 28 links, the same number as the > number of the expected output files) > > > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:42 > 101_FB- epochs.Rdata > > > > > > > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 12:42 > scripts > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 58 Sep 7 12:42 > stderr.txt > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 > 12:41 101_FB- epochs.Rdata > > > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 > 12:41 scripts > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 > 12:41 seq.sh > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 > 12:41 wrapper.sh > > > > > > > > > > > > > > > > > > > > > > > > > > empty kickstart directory > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > osg.hpcc.nd.edu/jobmanager /bin/cat > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/status/cwtsmall-0bhnnvgi-error > > > > > > > > > > > > > The following output files were not created by > the application: > > > > > > > > > > > > > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > jobmanager-condor: > > > > > > > > > > > > > ------------------------------- > > > > > > > > > > > > > Application exception: The following output > files were not created by > > > > > > > > > > > > > the application: > 101-FBchannel20_cwt-avgResults.Rdata > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/cwtsmall-cpteovgi > > > > > > > > > > > > > lrwxrwxrwx 1 osg osgusers 76 Sep 7 13:01 > 101_FB-epochs.Rdata -> > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared/101_FB- > epochs.Rdata > > > > > > > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 13:01 > scripts > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 70 Sep 7 13:01 > stderr.txt > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 > 13:00 101_FB- epochs.Rdata > > > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 > 13:00 scripts > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 > 13:00 seq.sh > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 > 13:00 wrapper.sh > > > > > > > > > > > > > > > > > > > > > > > > > > I think the descriptions of exception are all > right now. The > > > > > > > > > > > > > difference between fork and condor was that > fork created the output > > > > > > > > > > > > > links to the shared directory, but condor > didn't. But the essential > > > > > > > > > > > > > problem is the output files not being created. > I will do more > > > > > > > > > > > > > experiments to see whether the problem of file > system or application. > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > Jing > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > > Swift-user mailing list > > > > > > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > Swift-user mailing list > > > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-user mailing list > > > > > Swift-user at ci.uchicago.edu > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > From tiejing at gmail.com Mon Sep 10 12:27:30 2007 From: tiejing at gmail.com (Jing Tie) Date: Mon, 10 Sep 2007 12:27:30 -0500 Subject: [Swift-user] follow up of kickstart executable not found problem In-Reply-To: <1189442106.14271.3.camel@blabla.mcs.anl.gov> References: <1189357391.21155.0.camel@blabla.mcs.anl.gov> <1189358977.21560.1.camel@blabla.mcs.anl.gov> <1189439244.11819.0.camel@blabla.mcs.anl.gov> <1189441360.14271.0.camel@blabla.mcs.anl.gov> <1189442106.14271.3.camel@blabla.mcs.anl.gov> Message-ID: Hi, The test job works fine with jobmanager-condor. But SID program has an exception: cwtsmall failed The following errors have occurred: 1. Application "cwtsmall" failed (Exit code 126) Arguments: "scripts/runWaveletsAvg.R, 102, FB" Host: NWICG_NotreDame Directory: sid-wf1-5blglq655nj21/cwtsmall-7hdhm1hi STDERR: shared/wrapper.sh: line 164: /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh: Text file busy STDOUT: log file is attached. I checked that wavelet.sh is in good condition under /dscratch/osg/app/osg/jtie/SIDGrid/. No wrapper.log was generated. Thanks, Jing On 9/10/07, Mihael Hategan wrote: > Can you try it with Condor? It should work now. > > On Mon, 2007-09-10 at 11:26 -0500, Jing Tie wrote: > > Hi, > > > > It works fine now (log is attached). I'll try sid program next. > > > > Many thanks, > > Jing > > > > On 9/10/07, Mihael Hategan wrote: > > > You need to do an SVN update for both CoG and Swift. > > > > > > On Mon, 2007-09-10 at 11:03 -0500, Jing Tie wrote: > > > > Hi, > > > > > > > > It has the same exception. Log is attached. > > > > > > > > Thanks, > > > > Jing > > > > > > > > On 9/10/07, Mihael Hategan wrote: > > > > > On Sun, 2007-09-09 at 12:29 -0500, Mihael Hategan wrote: > > > > > > It's definitely a bug in the code I put in a few days ago. But > > I don't > > > > > > quite see how it happens. Such simple code. Yet how complex. > > I'll have > > > > > > to get back to you on it. > > > > > > > > > > Fixed, I think. Can you try again? > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > On Sun, 2007-09-09 at 12:09 -0500, Jing Tie wrote: > > > > > > > Sure. > > > > > > > > > > > > > > On 9/9/07, Mihael Hategan < hategan at mcs.anl.gov> wrote: > > > > > > > > Please post the whole workflow and the whole log. > > > > > > > > > > > > > > > > On Sun, 2007-09-09 at 11:43 -0500, Jing Tie wrote: > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > I tried duplicate job again using the latest swift with > > > > > > > > > "log4j.logger.org.globus.cog.abstraction=DEBUG ". > > > > > > > > > > > > > > > > > > It didn't generated duplicate-*** directory under > > simple-wf-*** > > > > > > > > > directory. Details: > > > > > > > > > Resource > > ( org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > > > > > > > > successfully released > > > > > > > > > Task(type=4, identity=urn:0-0-1189348296362) setting > > status to Failed > > > > > > > > > Could not set current directory to "null" > > > > > > > > > duplicate failed > > > > > > > > > The following errors have occurred: > > > > > > > > > 1. Could not initialize shared directory on > > NWICG_NotreDame > > > > > > > > > Caused by: > > > > > > > > > Could not set current directory to "null" > > > > > > > > > Caused by: > > > > > > > > > Required argument missing > > > > > > > > > > > > > > > > > > in simple-wf-fc06kzz28d880.log : > > > > > > > > > 2007-09-09 09:31:39,911 DEBUG FileResourceCache Resource > > > > > > > > > > > (org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > > > > > > > > successfully released > > > > > > > > > 2007-09-09 09:31:39,911 DEBUG TaskImpl Task(type=4, > > > > > > > > > identity=urn:0-0-1189348296362) setting status to Failed > > Could not set > > > > > > > > > current directory to "null" > > > > > > > > > 2007-09-09 09:31:39,938 INFO vdl:mains Errors detected. > > Cleanup not done. > > > > > > > > > 2007-09-09 09:31:39,963 DEBUG VDL2ExecutionContext > > Execution completed > > > > > > > > > with errors > > > > > > > > > Execution completed with errors > > > > > > > > > > > > > > > > > > There is nothing under > > > > > > > > > > > osg.hpcc.nd.edu/dscratch/osg/data/osg/jtie/simple-wf-fc06kzz28d880 > > > > > > > > > except empty shared directory. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Jing > > > > > > > > > > > > > > > > > > On 9/7/07, Mihael Hategan < hategan at mcs.anl.gov> wrote: > > > > > > > > > > Nevermind that. It's eating the empty stdin argument. > > This doesn't make > > > > > > > > > > sense. Fork used to behave. > > > > > > > > > > > > > > > > > > > > Can you add this to log4j.properties, run again, and > > post the log? > > > > > > > > > > > > > > > > > > > > log4j.logger.org.globus.cog.abstraction=DEBUG > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 23:02 -0500, Mihael Hategan > > wrote: > > > > > > > > > > > Can you post the workflow? > > > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 21:56 -0500, Jing Tie wrote: > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > > > > > > > I checked the procedure, and found that the > > filenames of the output > > > > > > > > > > > > files of "myapp" are the same as "File f". So the > > first option should > > > > > > > > > > > > not be the problem. > > > > > > > > > > > > > > > > > > > > > > > > Then I tried a very simple swift script that > > doesn't need R: > > > > > > > > > > > > app function: add some lines to the input file to > > generate the output file > > > > > > > > > > > > input file name: simpleFile.txt > > > > > > > > > > > > output file name: simpleFile.output > > > > > > > > > > > > application script location: > > $OSG_APP/osg/jtie/duplicate.sh > > > > > > > > > > > > jobmanager: jobmanager-fork > > > > > > > > > > > > > > > > > > > > > > > > duplicate failed > > > > > > > > > > > > The following errors have occurred: > > > > > > > > > > > > 1. Application "duplicate" failed (Failed to link > > input file > > > > > > > > > > > > /dscratch/osg/app/osg/jtie/duplicate.sh) > > > > > > > > > > > > Arguments: "simpleFile.txt" > > > > > > > > > > > > Host: NWICG_NotreDame > > > > > > > > > > > > Directory: > > simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi > > > > > > > > > > > > STDERR: > > > > > > > > > > > > STDOUT: > > > > > > > > > > > > > > > > > > > > > > > > after execution, 3 empty directories (not files, > > and also weird name) > > > > > > > > > > > > were generated: > > > > > > > > > > > > duplicate-gt7hyvgi-simpleFile.output > > > > > > > > > > > > duplicate-ht7hyvgi-simpleFile.output > > > > > > > > > > > > duplicate-it7hyvgi-simpleFile.output > > > > > > > > > > > > > > > > > > > > > > > > wrapper.log: > > > > > > > > > > > > DIR=duplicate-gt7hyvgi > > > > > > > > > > > > STDOUT=simpleFile.output > > > > > > > > > > > > STDERR=stderr.txt > > > > > > > > > > > > DIRS=simpleFile.output > > > > > > > > > > > > LINKS=/dscratch/osg/app/osg/jtie/duplicate.sh > > > > > > > > > > > > OUTS=simpleFile.txt > > > > > > > > > > > > ln: creating symbolic link > > > > > > > > > > > > > > `duplicate-gt7hyvgi//dscratch/osg/app/osg/jtie/duplicate.sh' to > > > > > > > > > > > > > > `/dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared//dscratch/osg/app/osg/jtie/duplicate.sh': > > > > > > > > > > > > No such file or directory > > > > > > > > > > > > > > > > > > > > > > > > under > > dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared: > > > > > > > > > > > > seq.sh, simpleFile.txt, wrapper.sh > > > > > > > > > > > > under > > dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi: > > > > > > > > > > > > one directory - simpleFile.output > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi/simpleFile.output: > > > > > > > > > > > > empty > > > > > > > > > > > > > > > > > > > > > > > > I think the swift program is right, since I run it > > successfully using > > > > > > > > > > > > localhost duplicate.sh. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Many thanks, > > > > > > > > > > > > Jing > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 9/7/07, Mihael Hategan < hategan at mcs.anl.gov> > > wrote: > > > > > > > > > > > > > So when the output files are not created, there > > can be two reasons: > > > > > > > > > > > > > 1. The specification of what files should be > > created is broken. This is, > > > > > > > > > > > > > at this time, done by looking at the filenames > > of the return values from > > > > > > > > > > > > > the atomic procedure. Normally one passes those > > file names to the > > > > > > > > > > > > > application as output file parameters. Example: > > > > > > > > > > > > > > > > > > > > > > > > > > (File f) proc(...) { > > > > > > > > > > > > > app { > > > > > > > > > > > > > myapp ... "-o" @filename(f); > > > > > > > > > > > > > } > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > 2. The specification is correct, but the > > application doesn't behave. > > > > > > > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 12:35 -0500, Jing Tie > > wrote: > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > > > > > > > I tried jobmanger-fork instead of > > jobmanager-condor on osg.hpcc.nd.edu site: > > > > > > > > > > > > > > > kickstart executable > > > > > > > > > > > > > > (101-FBchannel18_cwt- avgResults.Rdata) not > > found> > > > > > > > > > > > > > > > > > > > > > > > > > > > > jobmanager-fork: > > > > > > > > > > > > > > ------------------------ > > > > > > > > > > > > > > Application exception: The following output > > files were not created by > > > > > > > > > > > > > > the > > application: /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > > osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/cwtsmall-0bhnnvgi > > > > > > > > > > > > > > lrwxrwxrwx 1 osg osgusers 93 Sep 7 12:42 > > > > > > > > > > > > > > 101-FBchannel10_cwt-avgResults.Rdata -> > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared/101-FBchannel10_cwt- avgResults.Rdata > > > > > > > > > > > > > > ... (total 28 links, the same number as the > > number of the expected output files) > > > > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:42 > > 101_FB- epochs.Rdata > > > > > > > > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 12:42 > > scripts > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 58 Sep 7 12:42 > > stderr.txt > > > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > > osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 > > 12:41 101_FB- epochs.Rdata > > > > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 > > 12:41 scripts > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 > > 12:41 seq.sh > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 > > 12:41 wrapper.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > empty kickstart directory > > > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > > osg.hpcc.nd.edu/jobmanager /bin/cat > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/status/cwtsmall-0bhnnvgi-error > > > > > > > > > > > > > > The following output files were not created by > > the application: > > > > > > > > > > > > > > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > jobmanager-condor: > > > > > > > > > > > > > > ------------------------------- > > > > > > > > > > > > > > Application exception: The following output > > files were not created by > > > > > > > > > > > > > > the application: > > 101-FBchannel20_cwt-avgResults.Rdata > > > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > > osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/cwtsmall-cpteovgi > > > > > > > > > > > > > > lrwxrwxrwx 1 osg osgusers 76 Sep 7 13:01 > > 101_FB-epochs.Rdata -> > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared/101_FB- > > epochs.Rdata > > > > > > > > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 13:01 > > scripts > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 70 Sep 7 13:01 > > stderr.txt > > > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > > osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 > > 13:00 101_FB- epochs.Rdata > > > > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 > > 13:00 scripts > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 > > 13:00 seq.sh > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 > > 13:00 wrapper.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think the descriptions of exception are all > > right now. The > > > > > > > > > > > > > > difference between fork and condor was that > > fork created the output > > > > > > > > > > > > > > links to the shared directory, but condor > > didn't. But the essential > > > > > > > > > > > > > > problem is the output files not being created. > > I will do more > > > > > > > > > > > > > > experiments to see whether the problem of file > > system or application. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > Jing > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > > > Swift-user mailing list > > > > > > > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > Swift-user mailing list > > > > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > Swift-user mailing list > > > > > > Swift-user at ci.uchicago.edu > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- A non-text attachment was scrubbed... Name: sid-wf1-5blglq655nj21.log Type: application/octet-stream Size: 60860 bytes Desc: not available URL: From hategan at mcs.anl.gov Mon Sep 10 13:37:45 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 10 Sep 2007 13:37:45 -0500 Subject: [Swift-user] follow up of kickstart executable not found problem In-Reply-To: References: <1189357391.21155.0.camel@blabla.mcs.anl.gov> <1189358977.21560.1.camel@blabla.mcs.anl.gov> <1189439244.11819.0.camel@blabla.mcs.anl.gov> <1189441360.14271.0.camel@blabla.mcs.anl.gov> <1189442106.14271.3.camel@blabla.mcs.anl.gov> Message-ID: <1189449465.16438.1.camel@blabla.mcs.anl.gov> On Mon, 2007-09-10 at 12:27 -0500, Jing Tie wrote: > Hi, > > The test job works fine with jobmanager-condor. But SID program has an > exception: > > cwtsmall failed > The following errors have occurred: > 1. Application "cwtsmall" failed (Exit code 126) > Arguments: "scripts/runWaveletsAvg.R, 102, FB" > Host: NWICG_NotreDame > Directory: sid-wf1-5blglq655nj21/cwtsmall-7hdhm1hi > STDERR: shared/wrapper.sh: line 164: > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh: Text file busy > STDOUT: > > log file is attached. > > I checked that wavelet.sh is in good condition under > /dscratch/osg/app/osg/jtie/SIDGrid/. No wrapper.log was generated. The wrapper log is no more. Relevant information about a job (including what its wrapper does) can be found in info/-info. In this case info/cwtsmall-7hdhm1hi-info. Can you post that file? Mihael > > Thanks, > Jing > > On 9/10/07, Mihael Hategan wrote: > > Can you try it with Condor? It should work now. > > > > On Mon, 2007-09-10 at 11:26 -0500, Jing Tie wrote: > > > Hi, > > > > > > It works fine now (log is attached). I'll try sid program next. > > > > > > Many thanks, > > > Jing > > > > > > On 9/10/07, Mihael Hategan wrote: > > > > You need to do an SVN update for both CoG and Swift. > > > > > > > > On Mon, 2007-09-10 at 11:03 -0500, Jing Tie wrote: > > > > > Hi, > > > > > > > > > > It has the same exception. Log is attached. > > > > > > > > > > Thanks, > > > > > Jing > > > > > > > > > > On 9/10/07, Mihael Hategan wrote: > > > > > > On Sun, 2007-09-09 at 12:29 -0500, Mihael Hategan wrote: > > > > > > > It's definitely a bug in the code I put in a few days ago. But > > > I don't > > > > > > > quite see how it happens. Such simple code. Yet how complex. > > > I'll have > > > > > > > to get back to you on it. > > > > > > > > > > > > Fixed, I think. Can you try again? > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > On Sun, 2007-09-09 at 12:09 -0500, Jing Tie wrote: > > > > > > > > Sure. > > > > > > > > > > > > > > > > On 9/9/07, Mihael Hategan < hategan at mcs.anl.gov> wrote: > > > > > > > > > Please post the whole workflow and the whole log. > > > > > > > > > > > > > > > > > > On Sun, 2007-09-09 at 11:43 -0500, Jing Tie wrote: > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > I tried duplicate job again using the latest swift with > > > > > > > > > > "log4j.logger.org.globus.cog.abstraction=DEBUG ". > > > > > > > > > > > > > > > > > > > > It didn't generated duplicate-*** directory under > > > simple-wf-*** > > > > > > > > > > directory. Details: > > > > > > > > > > Resource > > > ( org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > > > > > > > > > successfully released > > > > > > > > > > Task(type=4, identity=urn:0-0-1189348296362) setting > > > status to Failed > > > > > > > > > > Could not set current directory to "null" > > > > > > > > > > duplicate failed > > > > > > > > > > The following errors have occurred: > > > > > > > > > > 1. Could not initialize shared directory on > > > NWICG_NotreDame > > > > > > > > > > Caused by: > > > > > > > > > > Could not set current directory to "null" > > > > > > > > > > Caused by: > > > > > > > > > > Required argument missing > > > > > > > > > > > > > > > > > > > > in simple-wf-fc06kzz28d880.log : > > > > > > > > > > 2007-09-09 09:31:39,911 DEBUG FileResourceCache Resource > > > > > > > > > > > > > (org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > > > > > > > > > successfully released > > > > > > > > > > 2007-09-09 09:31:39,911 DEBUG TaskImpl Task(type=4, > > > > > > > > > > identity=urn:0-0-1189348296362) setting status to Failed > > > Could not set > > > > > > > > > > current directory to "null" > > > > > > > > > > 2007-09-09 09:31:39,938 INFO vdl:mains Errors detected. > > > Cleanup not done. > > > > > > > > > > 2007-09-09 09:31:39,963 DEBUG VDL2ExecutionContext > > > Execution completed > > > > > > > > > > with errors > > > > > > > > > > Execution completed with errors > > > > > > > > > > > > > > > > > > > > There is nothing under > > > > > > > > > > > > > osg.hpcc.nd.edu/dscratch/osg/data/osg/jtie/simple-wf-fc06kzz28d880 > > > > > > > > > > except empty shared directory. > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Jing > > > > > > > > > > > > > > > > > > > > On 9/7/07, Mihael Hategan < hategan at mcs.anl.gov> wrote: > > > > > > > > > > > Nevermind that. It's eating the empty stdin argument. > > > This doesn't make > > > > > > > > > > > sense. Fork used to behave. > > > > > > > > > > > > > > > > > > > > > > Can you add this to log4j.properties, run again, and > > > post the log? > > > > > > > > > > > > > > > > > > > > > > log4j.logger.org.globus.cog.abstraction=DEBUG > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 23:02 -0500, Mihael Hategan > > > wrote: > > > > > > > > > > > > Can you post the workflow? > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 21:56 -0500, Jing Tie wrote: > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > > > > > > > > > I checked the procedure, and found that the > > > filenames of the output > > > > > > > > > > > > > files of "myapp" are the same as "File f". So the > > > first option should > > > > > > > > > > > > > not be the problem. > > > > > > > > > > > > > > > > > > > > > > > > > > Then I tried a very simple swift script that > > > doesn't need R: > > > > > > > > > > > > > app function: add some lines to the input file to > > > generate the output file > > > > > > > > > > > > > input file name: simpleFile.txt > > > > > > > > > > > > > output file name: simpleFile.output > > > > > > > > > > > > > application script location: > > > $OSG_APP/osg/jtie/duplicate.sh > > > > > > > > > > > > > jobmanager: jobmanager-fork > > > > > > > > > > > > > > > > > > > > > > > > > > duplicate failed > > > > > > > > > > > > > The following errors have occurred: > > > > > > > > > > > > > 1. Application "duplicate" failed (Failed to link > > > input file > > > > > > > > > > > > > /dscratch/osg/app/osg/jtie/duplicate.sh) > > > > > > > > > > > > > Arguments: "simpleFile.txt" > > > > > > > > > > > > > Host: NWICG_NotreDame > > > > > > > > > > > > > Directory: > > > simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi > > > > > > > > > > > > > STDERR: > > > > > > > > > > > > > STDOUT: > > > > > > > > > > > > > > > > > > > > > > > > > > after execution, 3 empty directories (not files, > > > and also weird name) > > > > > > > > > > > > > were generated: > > > > > > > > > > > > > duplicate-gt7hyvgi-simpleFile.output > > > > > > > > > > > > > duplicate-ht7hyvgi-simpleFile.output > > > > > > > > > > > > > duplicate-it7hyvgi-simpleFile.output > > > > > > > > > > > > > > > > > > > > > > > > > > wrapper.log: > > > > > > > > > > > > > DIR=duplicate-gt7hyvgi > > > > > > > > > > > > > STDOUT=simpleFile.output > > > > > > > > > > > > > STDERR=stderr.txt > > > > > > > > > > > > > DIRS=simpleFile.output > > > > > > > > > > > > > LINKS=/dscratch/osg/app/osg/jtie/duplicate.sh > > > > > > > > > > > > > OUTS=simpleFile.txt > > > > > > > > > > > > > ln: creating symbolic link > > > > > > > > > > > > > > > > `duplicate-gt7hyvgi//dscratch/osg/app/osg/jtie/duplicate.sh' to > > > > > > > > > > > > > > > > `/dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared//dscratch/osg/app/osg/jtie/duplicate.sh': > > > > > > > > > > > > > No such file or directory > > > > > > > > > > > > > > > > > > > > > > > > > > under > > > dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared: > > > > > > > > > > > > > seq.sh, simpleFile.txt, wrapper.sh > > > > > > > > > > > > > under > > > dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi: > > > > > > > > > > > > > one directory - simpleFile.output > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi/simpleFile.output: > > > > > > > > > > > > > empty > > > > > > > > > > > > > > > > > > > > > > > > > > I think the swift program is right, since I run it > > > successfully using > > > > > > > > > > > > > localhost duplicate.sh. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Many thanks, > > > > > > > > > > > > > Jing > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 9/7/07, Mihael Hategan < hategan at mcs.anl.gov> > > > wrote: > > > > > > > > > > > > > > So when the output files are not created, there > > > can be two reasons: > > > > > > > > > > > > > > 1. The specification of what files should be > > > created is broken. This is, > > > > > > > > > > > > > > at this time, done by looking at the filenames > > > of the return values from > > > > > > > > > > > > > > the atomic procedure. Normally one passes those > > > file names to the > > > > > > > > > > > > > > application as output file parameters. Example: > > > > > > > > > > > > > > > > > > > > > > > > > > > > (File f) proc(...) { > > > > > > > > > > > > > > app { > > > > > > > > > > > > > > myapp ... "-o" @filename(f); > > > > > > > > > > > > > > } > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. The specification is correct, but the > > > application doesn't behave. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 12:35 -0500, Jing Tie > > > wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I tried jobmanger-fork instead of > > > jobmanager-condor on osg.hpcc.nd.edu site: > > > > > > > > > > > > > > > > > kickstart executable > > > > > > > > > > > > > > > (101-FBchannel18_cwt- avgResults.Rdata) not > > > found> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > jobmanager-fork: > > > > > > > > > > > > > > > ------------------------ > > > > > > > > > > > > > > > Application exception: The following output > > > files were not created by > > > > > > > > > > > > > > > the > > > application: /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > > > osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/cwtsmall-0bhnnvgi > > > > > > > > > > > > > > > lrwxrwxrwx 1 osg osgusers 93 Sep 7 12:42 > > > > > > > > > > > > > > > 101-FBchannel10_cwt-avgResults.Rdata -> > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared/101-FBchannel10_cwt- avgResults.Rdata > > > > > > > > > > > > > > > ... (total 28 links, the same number as the > > > number of the expected output files) > > > > > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:42 > > > 101_FB- epochs.Rdata > > > > > > > > > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 12:42 > > > scripts > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 58 Sep 7 12:42 > > > stderr.txt > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > > > osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 > > > 12:41 101_FB- epochs.Rdata > > > > > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 > > > 12:41 scripts > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 > > > 12:41 seq.sh > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 > > > 12:41 wrapper.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > empty kickstart directory > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > > > osg.hpcc.nd.edu/jobmanager /bin/cat > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/status/cwtsmall-0bhnnvgi-error > > > > > > > > > > > > > > > The following output files were not created by > > > the application: > > > > > > > > > > > > > > > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > jobmanager-condor: > > > > > > > > > > > > > > > ------------------------------- > > > > > > > > > > > > > > > Application exception: The following output > > > files were not created by > > > > > > > > > > > > > > > the application: > > > 101-FBchannel20_cwt-avgResults.Rdata > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > > > osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/cwtsmall-cpteovgi > > > > > > > > > > > > > > > lrwxrwxrwx 1 osg osgusers 76 Sep 7 13:01 > > > 101_FB-epochs.Rdata -> > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared/101_FB- > > > epochs.Rdata > > > > > > > > > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 13:01 > > > scripts > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 70 Sep 7 13:01 > > > stderr.txt > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > > > osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 > > > 13:00 101_FB- epochs.Rdata > > > > > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 > > > 13:00 scripts > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 > > > 13:00 seq.sh > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 > > > 13:00 wrapper.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think the descriptions of exception are all > > > right now. The > > > > > > > > > > > > > > > difference between fork and condor was that > > > fork created the output > > > > > > > > > > > > > > > links to the shared directory, but condor > > > didn't. But the essential > > > > > > > > > > > > > > > problem is the output files not being created. > > > I will do more > > > > > > > > > > > > > > > experiments to see whether the problem of file > > > system or application. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > Jing > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > > > > Swift-user mailing list > > > > > > > > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > Swift-user mailing list > > > > > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Swift-user mailing list > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From tiejing at gmail.com Mon Sep 10 13:46:49 2007 From: tiejing at gmail.com (Jing Tie) Date: Mon, 10 Sep 2007 13:46:49 -0500 Subject: [Swift-user] follow up of kickstart executable not found problem In-Reply-To: <1189449465.16438.1.camel@blabla.mcs.anl.gov> References: <1189358977.21560.1.camel@blabla.mcs.anl.gov> <1189439244.11819.0.camel@blabla.mcs.anl.gov> <1189441360.14271.0.camel@blabla.mcs.anl.gov> <1189442106.14271.3.camel@blabla.mcs.anl.gov> <1189449465.16438.1.camel@blabla.mcs.anl.gov> Message-ID: Sure. On 9/10/07, Mihael Hategan wrote: > On Mon, 2007-09-10 at 12:27 -0500, Jing Tie wrote: > > Hi, > > > > The test job works fine with jobmanager-condor. But SID program has an > > exception: > > > > cwtsmall failed > > The following errors have occurred: > > 1. Application "cwtsmall" failed (Exit code 126) > > Arguments: "scripts/runWaveletsAvg.R, 102, FB" > > Host: NWICG_NotreDame > > Directory: sid-wf1-5blglq655nj21/cwtsmall-7hdhm1hi > > STDERR: shared/wrapper.sh: line 164: > > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh: Text file busy > > STDOUT: > > > > log file is attached. > > > > I checked that wavelet.sh is in good condition under > > /dscratch/osg/app/osg/jtie/SIDGrid/. No wrapper.log was generated. > > The wrapper log is no more. Relevant information about a job (including > what its wrapper does) can be found in info/-info. In this case > info/cwtsmall-7hdhm1hi-info. > > Can you post that file? > > Mihael > > > > > > > Thanks, > > Jing > > > > On 9/10/07, Mihael Hategan wrote: > > > Can you try it with Condor? It should work now. > > > > > > On Mon, 2007-09-10 at 11:26 -0500, Jing Tie wrote: > > > > Hi, > > > > > > > > It works fine now (log is attached). I'll try sid program next. > > > > > > > > Many thanks, > > > > Jing > > > > > > > > On 9/10/07, Mihael Hategan wrote: > > > > > You need to do an SVN update for both CoG and Swift. > > > > > > > > > > On Mon, 2007-09-10 at 11:03 -0500, Jing Tie wrote: > > > > > > Hi, > > > > > > > > > > > > It has the same exception. Log is attached. > > > > > > > > > > > > Thanks, > > > > > > Jing > > > > > > > > > > > > On 9/10/07, Mihael Hategan wrote: > > > > > > > On Sun, 2007-09-09 at 12:29 -0500, Mihael Hategan wrote: > > > > > > > > It's definitely a bug in the code I put in a few days ago. But > > > > I don't > > > > > > > > quite see how it happens. Such simple code. Yet how complex. > > > > I'll have > > > > > > > > to get back to you on it. > > > > > > > > > > > > > > Fixed, I think. Can you try again? > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > > On Sun, 2007-09-09 at 12:09 -0500, Jing Tie wrote: > > > > > > > > > Sure. > > > > > > > > > > > > > > > > > > On 9/9/07, Mihael Hategan < hategan at mcs.anl.gov> wrote: > > > > > > > > > > Please post the whole workflow and the whole log. > > > > > > > > > > > > > > > > > > > > On Sun, 2007-09-09 at 11:43 -0500, Jing Tie wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > I tried duplicate job again using the latest swift with > > > > > > > > > > > "log4j.logger.org.globus.cog.abstraction=DEBUG ". > > > > > > > > > > > > > > > > > > > > > > It didn't generated duplicate-*** directory under > > > > simple-wf-*** > > > > > > > > > > > directory. Details: > > > > > > > > > > > Resource > > > > ( org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > > > > > > > > > > successfully released > > > > > > > > > > > Task(type=4, identity=urn:0-0-1189348296362) setting > > > > status to Failed > > > > > > > > > > > Could not set current directory to "null" > > > > > > > > > > > duplicate failed > > > > > > > > > > > The following errors have occurred: > > > > > > > > > > > 1. Could not initialize shared directory on > > > > NWICG_NotreDame > > > > > > > > > > > Caused by: > > > > > > > > > > > Could not set current directory to "null" > > > > > > > > > > > Caused by: > > > > > > > > > > > Required argument missing > > > > > > > > > > > > > > > > > > > > > > in simple-wf-fc06kzz28d880.log : > > > > > > > > > > > 2007-09-09 09:31:39,911 DEBUG FileResourceCache Resource > > > > > > > > > > > > > > > (org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > > > > > > > > > > successfully released > > > > > > > > > > > 2007-09-09 09:31:39,911 DEBUG TaskImpl Task(type=4, > > > > > > > > > > > identity=urn:0-0-1189348296362) setting status to Failed > > > > Could not set > > > > > > > > > > > current directory to "null" > > > > > > > > > > > 2007-09-09 09:31:39,938 INFO vdl:mains Errors detected. > > > > Cleanup not done. > > > > > > > > > > > 2007-09-09 09:31:39,963 DEBUG VDL2ExecutionContext > > > > Execution completed > > > > > > > > > > > with errors > > > > > > > > > > > Execution completed with errors > > > > > > > > > > > > > > > > > > > > > > There is nothing under > > > > > > > > > > > > > > > osg.hpcc.nd.edu/dscratch/osg/data/osg/jtie/simple-wf-fc06kzz28d880 > > > > > > > > > > > except empty shared directory. > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > Jing > > > > > > > > > > > > > > > > > > > > > > On 9/7/07, Mihael Hategan < hategan at mcs.anl.gov> wrote: > > > > > > > > > > > > Nevermind that. It's eating the empty stdin argument. > > > > This doesn't make > > > > > > > > > > > > sense. Fork used to behave. > > > > > > > > > > > > > > > > > > > > > > > > Can you add this to log4j.properties, run again, and > > > > post the log? > > > > > > > > > > > > > > > > > > > > > > > > log4j.logger.org.globus.cog.abstraction=DEBUG > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 23:02 -0500, Mihael Hategan > > > > wrote: > > > > > > > > > > > > > Can you post the workflow? > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 21:56 -0500, Jing Tie wrote: > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > > > > > > > > > > > I checked the procedure, and found that the > > > > filenames of the output > > > > > > > > > > > > > > files of "myapp" are the same as "File f". So the > > > > first option should > > > > > > > > > > > > > > not be the problem. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Then I tried a very simple swift script that > > > > doesn't need R: > > > > > > > > > > > > > > app function: add some lines to the input file to > > > > generate the output file > > > > > > > > > > > > > > input file name: simpleFile.txt > > > > > > > > > > > > > > output file name: simpleFile.output > > > > > > > > > > > > > > application script location: > > > > $OSG_APP/osg/jtie/duplicate.sh > > > > > > > > > > > > > > jobmanager: jobmanager-fork > > > > > > > > > > > > > > > > > > > > > > > > > > > > duplicate failed > > > > > > > > > > > > > > The following errors have occurred: > > > > > > > > > > > > > > 1. Application "duplicate" failed (Failed to link > > > > input file > > > > > > > > > > > > > > /dscratch/osg/app/osg/jtie/duplicate.sh) > > > > > > > > > > > > > > Arguments: "simpleFile.txt" > > > > > > > > > > > > > > Host: NWICG_NotreDame > > > > > > > > > > > > > > Directory: > > > > simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi > > > > > > > > > > > > > > STDERR: > > > > > > > > > > > > > > STDOUT: > > > > > > > > > > > > > > > > > > > > > > > > > > > > after execution, 3 empty directories (not files, > > > > and also weird name) > > > > > > > > > > > > > > were generated: > > > > > > > > > > > > > > duplicate-gt7hyvgi-simpleFile.output > > > > > > > > > > > > > > duplicate-ht7hyvgi-simpleFile.output > > > > > > > > > > > > > > duplicate-it7hyvgi-simpleFile.output > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrapper.log: > > > > > > > > > > > > > > DIR=duplicate-gt7hyvgi > > > > > > > > > > > > > > STDOUT=simpleFile.output > > > > > > > > > > > > > > STDERR=stderr.txt > > > > > > > > > > > > > > DIRS=simpleFile.output > > > > > > > > > > > > > > LINKS=/dscratch/osg/app/osg/jtie/duplicate.sh > > > > > > > > > > > > > > OUTS=simpleFile.txt > > > > > > > > > > > > > > ln: creating symbolic link > > > > > > > > > > > > > > > > > > `duplicate-gt7hyvgi//dscratch/osg/app/osg/jtie/duplicate.sh' to > > > > > > > > > > > > > > > > > > `/dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared//dscratch/osg/app/osg/jtie/duplicate.sh': > > > > > > > > > > > > > > No such file or directory > > > > > > > > > > > > > > > > > > > > > > > > > > > > under > > > > dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared: > > > > > > > > > > > > > > seq.sh, simpleFile.txt, wrapper.sh > > > > > > > > > > > > > > under > > > > dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi: > > > > > > > > > > > > > > one directory - simpleFile.output > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi/simpleFile.output: > > > > > > > > > > > > > > empty > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think the swift program is right, since I run it > > > > successfully using > > > > > > > > > > > > > > localhost duplicate.sh. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Many thanks, > > > > > > > > > > > > > > Jing > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 9/7/07, Mihael Hategan < hategan at mcs.anl.gov> > > > > wrote: > > > > > > > > > > > > > > > So when the output files are not created, there > > > > can be two reasons: > > > > > > > > > > > > > > > 1. The specification of what files should be > > > > created is broken. This is, > > > > > > > > > > > > > > > at this time, done by looking at the filenames > > > > of the return values from > > > > > > > > > > > > > > > the atomic procedure. Normally one passes those > > > > file names to the > > > > > > > > > > > > > > > application as output file parameters. Example: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > (File f) proc(...) { > > > > > > > > > > > > > > > app { > > > > > > > > > > > > > > > myapp ... "-o" @filename(f); > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. The specification is correct, but the > > > > application doesn't behave. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 12:35 -0500, Jing Tie > > > > wrote: > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I tried jobmanger-fork instead of > > > > jobmanager-condor on osg.hpcc.nd.edu site: > > > > > > > > > > > > > > > > > > > kickstart executable > > > > > > > > > > > > > > > > (101-FBchannel18_cwt- avgResults.Rdata) not > > > > found> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > jobmanager-fork: > > > > > > > > > > > > > > > > ------------------------ > > > > > > > > > > > > > > > > Application exception: The following output > > > > files were not created by > > > > > > > > > > > > > > > > the > > > > application: /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > > > > osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/cwtsmall-0bhnnvgi > > > > > > > > > > > > > > > > lrwxrwxrwx 1 osg osgusers 93 Sep 7 12:42 > > > > > > > > > > > > > > > > 101-FBchannel10_cwt-avgResults.Rdata -> > > > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared/101-FBchannel10_cwt- avgResults.Rdata > > > > > > > > > > > > > > > > ... (total 28 links, the same number as the > > > > number of the expected output files) > > > > > > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:42 > > > > 101_FB- epochs.Rdata > > > > > > > > > > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 12:42 > > > > scripts > > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 58 Sep 7 12:42 > > > > stderr.txt > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > > > > osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared > > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 > > > > 12:41 101_FB- epochs.Rdata > > > > > > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 > > > > 12:41 scripts > > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 > > > > 12:41 seq.sh > > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 > > > > 12:41 wrapper.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > empty kickstart directory > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > > > > osg.hpcc.nd.edu/jobmanager /bin/cat > > > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/status/cwtsmall-0bhnnvgi-error > > > > > > > > > > > > > > > > The following output files were not created by > > > > the application: > > > > > > > > > > > > > > > > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > jobmanager-condor: > > > > > > > > > > > > > > > > ------------------------------- > > > > > > > > > > > > > > > > Application exception: The following output > > > > files were not created by > > > > > > > > > > > > > > > > the application: > > > > 101-FBchannel20_cwt-avgResults.Rdata > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > > > > osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/cwtsmall-cpteovgi > > > > > > > > > > > > > > > > lrwxrwxrwx 1 osg osgusers 76 Sep 7 13:01 > > > > 101_FB-epochs.Rdata -> > > > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared/101_FB- > > > > epochs.Rdata > > > > > > > > > > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 13:01 > > > > scripts > > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 70 Sep 7 13:01 > > > > stderr.txt > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > > > > osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared > > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 > > > > 13:00 101_FB- epochs.Rdata > > > > > > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 > > > > 13:00 scripts > > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 > > > > 13:00 seq.sh > > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 > > > > 13:00 wrapper.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think the descriptions of exception are all > > > > right now. The > > > > > > > > > > > > > > > > difference between fork and condor was that > > > > fork created the output > > > > > > > > > > > > > > > > links to the shared directory, but condor > > > > didn't. But the essential > > > > > > > > > > > > > > > > problem is the output files not being created. > > > > I will do more > > > > > > > > > > > > > > > > experiments to see whether the problem of file > > > > system or application. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > Jing > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > > > > > Swift-user mailing list > > > > > > > > > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > > Swift-user mailing list > > > > > > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > Swift-user mailing list > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- A non-text attachment was scrubbed... Name: cwtsmall-7hdhm1hi-info Type: application/octet-stream Size: 8245 bytes Desc: not available URL: From tiejing at gmail.com Mon Sep 10 13:48:31 2007 From: tiejing at gmail.com (Jing Tie) Date: Mon, 10 Sep 2007 13:48:31 -0500 Subject: [Swift-user] follow up of kickstart executable not found problem In-Reply-To: <1189449465.16438.1.camel@blabla.mcs.anl.gov> References: <1189358977.21560.1.camel@blabla.mcs.anl.gov> <1189439244.11819.0.camel@blabla.mcs.anl.gov> <1189441360.14271.0.camel@blabla.mcs.anl.gov> <1189442106.14271.3.camel@blabla.mcs.anl.gov> <1189449465.16438.1.camel@blabla.mcs.anl.gov> Message-ID: Sure. On 9/10/07, Mihael Hategan wrote: > On Mon, 2007-09-10 at 12:27 -0500, Jing Tie wrote: > > Hi, > > > > The test job works fine with jobmanager-condor. But SID program has an > > exception: > > > > cwtsmall failed > > The following errors have occurred: > > 1. Application "cwtsmall" failed (Exit code 126) > > Arguments: "scripts/runWaveletsAvg.R, 102, FB" > > Host: NWICG_NotreDame > > Directory: sid-wf1-5blglq655nj21/cwtsmall-7hdhm1hi > > STDERR: shared/wrapper.sh: line 164: > > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh: Text file busy > > STDOUT: > > > > log file is attached. > > > > I checked that wavelet.sh is in good condition under > > /dscratch/osg/app/osg/jtie/SIDGrid/. No wrapper.log was generated. > > The wrapper log is no more. Relevant information about a job (including > what its wrapper does) can be found in info/-info. In this case > info/cwtsmall-7hdhm1hi-info. > > Can you post that file? > > Mihael > > > > > > > Thanks, > > Jing > > > > On 9/10/07, Mihael Hategan wrote: > > > Can you try it with Condor? It should work now. > > > > > > On Mon, 2007-09-10 at 11:26 -0500, Jing Tie wrote: > > > > Hi, > > > > > > > > It works fine now (log is attached). I'll try sid program next. > > > > > > > > Many thanks, > > > > Jing > > > > > > > > On 9/10/07, Mihael Hategan wrote: > > > > > You need to do an SVN update for both CoG and Swift. > > > > > > > > > > On Mon, 2007-09-10 at 11:03 -0500, Jing Tie wrote: > > > > > > Hi, > > > > > > > > > > > > It has the same exception. Log is attached. > > > > > > > > > > > > Thanks, > > > > > > Jing > > > > > > > > > > > > On 9/10/07, Mihael Hategan wrote: > > > > > > > On Sun, 2007-09-09 at 12:29 -0500, Mihael Hategan wrote: > > > > > > > > It's definitely a bug in the code I put in a few days ago. But > > > > I don't > > > > > > > > quite see how it happens. Such simple code. Yet how complex. > > > > I'll have > > > > > > > > to get back to you on it. > > > > > > > > > > > > > > Fixed, I think. Can you try again? > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > > On Sun, 2007-09-09 at 12:09 -0500, Jing Tie wrote: > > > > > > > > > Sure. > > > > > > > > > > > > > > > > > > On 9/9/07, Mihael Hategan < hategan at mcs.anl.gov> wrote: > > > > > > > > > > Please post the whole workflow and the whole log. > > > > > > > > > > > > > > > > > > > > On Sun, 2007-09-09 at 11:43 -0500, Jing Tie wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > I tried duplicate job again using the latest swift with > > > > > > > > > > > "log4j.logger.org.globus.cog.abstraction=DEBUG ". > > > > > > > > > > > > > > > > > > > > > > It didn't generated duplicate-*** directory under > > > > simple-wf-*** > > > > > > > > > > > directory. Details: > > > > > > > > > > > Resource > > > > ( org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > > > > > > > > > > successfully released > > > > > > > > > > > Task(type=4, identity=urn:0-0-1189348296362) setting > > > > status to Failed > > > > > > > > > > > Could not set current directory to "null" > > > > > > > > > > > duplicate failed > > > > > > > > > > > The following errors have occurred: > > > > > > > > > > > 1. Could not initialize shared directory on > > > > NWICG_NotreDame > > > > > > > > > > > Caused by: > > > > > > > > > > > Could not set current directory to "null" > > > > > > > > > > > Caused by: > > > > > > > > > > > Required argument missing > > > > > > > > > > > > > > > > > > > > > > in simple-wf-fc06kzz28d880.log : > > > > > > > > > > > 2007-09-09 09:31:39,911 DEBUG FileResourceCache Resource > > > > > > > > > > > > > > > (org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > > > > > > > > > > successfully released > > > > > > > > > > > 2007-09-09 09:31:39,911 DEBUG TaskImpl Task(type=4, > > > > > > > > > > > identity=urn:0-0-1189348296362) setting status to Failed > > > > Could not set > > > > > > > > > > > current directory to "null" > > > > > > > > > > > 2007-09-09 09:31:39,938 INFO vdl:mains Errors detected. > > > > Cleanup not done. > > > > > > > > > > > 2007-09-09 09:31:39,963 DEBUG VDL2ExecutionContext > > > > Execution completed > > > > > > > > > > > with errors > > > > > > > > > > > Execution completed with errors > > > > > > > > > > > > > > > > > > > > > > There is nothing under > > > > > > > > > > > > > > > osg.hpcc.nd.edu/dscratch/osg/data/osg/jtie/simple-wf-fc06kzz28d880 > > > > > > > > > > > except empty shared directory. > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > Jing > > > > > > > > > > > > > > > > > > > > > > On 9/7/07, Mihael Hategan < hategan at mcs.anl.gov> wrote: > > > > > > > > > > > > Nevermind that. It's eating the empty stdin argument. > > > > This doesn't make > > > > > > > > > > > > sense. Fork used to behave. > > > > > > > > > > > > > > > > > > > > > > > > Can you add this to log4j.properties, run again, and > > > > post the log? > > > > > > > > > > > > > > > > > > > > > > > > log4j.logger.org.globus.cog.abstraction=DEBUG > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 23:02 -0500, Mihael Hategan > > > > wrote: > > > > > > > > > > > > > Can you post the workflow? > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 21:56 -0500, Jing Tie wrote: > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > > > > > > > > > > > I checked the procedure, and found that the > > > > filenames of the output > > > > > > > > > > > > > > files of "myapp" are the same as "File f". So the > > > > first option should > > > > > > > > > > > > > > not be the problem. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Then I tried a very simple swift script that > > > > doesn't need R: > > > > > > > > > > > > > > app function: add some lines to the input file to > > > > generate the output file > > > > > > > > > > > > > > input file name: simpleFile.txt > > > > > > > > > > > > > > output file name: simpleFile.output > > > > > > > > > > > > > > application script location: > > > > $OSG_APP/osg/jtie/duplicate.sh > > > > > > > > > > > > > > jobmanager: jobmanager-fork > > > > > > > > > > > > > > > > > > > > > > > > > > > > duplicate failed > > > > > > > > > > > > > > The following errors have occurred: > > > > > > > > > > > > > > 1. Application "duplicate" failed (Failed to link > > > > input file > > > > > > > > > > > > > > /dscratch/osg/app/osg/jtie/duplicate.sh) > > > > > > > > > > > > > > Arguments: "simpleFile.txt" > > > > > > > > > > > > > > Host: NWICG_NotreDame > > > > > > > > > > > > > > Directory: > > > > simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi > > > > > > > > > > > > > > STDERR: > > > > > > > > > > > > > > STDOUT: > > > > > > > > > > > > > > > > > > > > > > > > > > > > after execution, 3 empty directories (not files, > > > > and also weird name) > > > > > > > > > > > > > > were generated: > > > > > > > > > > > > > > duplicate-gt7hyvgi-simpleFile.output > > > > > > > > > > > > > > duplicate-ht7hyvgi-simpleFile.output > > > > > > > > > > > > > > duplicate-it7hyvgi-simpleFile.output > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrapper.log: > > > > > > > > > > > > > > DIR=duplicate-gt7hyvgi > > > > > > > > > > > > > > STDOUT=simpleFile.output > > > > > > > > > > > > > > STDERR=stderr.txt > > > > > > > > > > > > > > DIRS=simpleFile.output > > > > > > > > > > > > > > LINKS=/dscratch/osg/app/osg/jtie/duplicate.sh > > > > > > > > > > > > > > OUTS=simpleFile.txt > > > > > > > > > > > > > > ln: creating symbolic link > > > > > > > > > > > > > > > > > > `duplicate-gt7hyvgi//dscratch/osg/app/osg/jtie/duplicate.sh' to > > > > > > > > > > > > > > > > > > `/dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared//dscratch/osg/app/osg/jtie/duplicate.sh': > > > > > > > > > > > > > > No such file or directory > > > > > > > > > > > > > > > > > > > > > > > > > > > > under > > > > dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared: > > > > > > > > > > > > > > seq.sh, simpleFile.txt, wrapper.sh > > > > > > > > > > > > > > under > > > > dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi: > > > > > > > > > > > > > > one directory - simpleFile.output > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi/simpleFile.output: > > > > > > > > > > > > > > empty > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think the swift program is right, since I run it > > > > successfully using > > > > > > > > > > > > > > localhost duplicate.sh. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Many thanks, > > > > > > > > > > > > > > Jing > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 9/7/07, Mihael Hategan < hategan at mcs.anl.gov> > > > > wrote: > > > > > > > > > > > > > > > So when the output files are not created, there > > > > can be two reasons: > > > > > > > > > > > > > > > 1. The specification of what files should be > > > > created is broken. This is, > > > > > > > > > > > > > > > at this time, done by looking at the filenames > > > > of the return values from > > > > > > > > > > > > > > > the atomic procedure. Normally one passes those > > > > file names to the > > > > > > > > > > > > > > > application as output file parameters. Example: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > (File f) proc(...) { > > > > > > > > > > > > > > > app { > > > > > > > > > > > > > > > myapp ... "-o" @filename(f); > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. The specification is correct, but the > > > > application doesn't behave. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 12:35 -0500, Jing Tie > > > > wrote: > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I tried jobmanger-fork instead of > > > > jobmanager-condor on osg.hpcc.nd.edu site: > > > > > > > > > > > > > > > > > > > kickstart executable > > > > > > > > > > > > > > > > (101-FBchannel18_cwt- avgResults.Rdata) not > > > > found> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > jobmanager-fork: > > > > > > > > > > > > > > > > ------------------------ > > > > > > > > > > > > > > > > Application exception: The following output > > > > files were not created by > > > > > > > > > > > > > > > > the > > > > application: /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > > > > osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/cwtsmall-0bhnnvgi > > > > > > > > > > > > > > > > lrwxrwxrwx 1 osg osgusers 93 Sep 7 12:42 > > > > > > > > > > > > > > > > 101-FBchannel10_cwt-avgResults.Rdata -> > > > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared/101-FBchannel10_cwt- avgResults.Rdata > > > > > > > > > > > > > > > > ... (total 28 links, the same number as the > > > > number of the expected output files) > > > > > > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:42 > > > > 101_FB- epochs.Rdata > > > > > > > > > > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 12:42 > > > > scripts > > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 58 Sep 7 12:42 > > > > stderr.txt > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > > > > osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared > > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 > > > > 12:41 101_FB- epochs.Rdata > > > > > > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 > > > > 12:41 scripts > > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 > > > > 12:41 seq.sh > > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 > > > > 12:41 wrapper.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > empty kickstart directory > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > > > > osg.hpcc.nd.edu/jobmanager /bin/cat > > > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/status/cwtsmall-0bhnnvgi-error > > > > > > > > > > > > > > > > The following output files were not created by > > > > the application: > > > > > > > > > > > > > > > > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > jobmanager-condor: > > > > > > > > > > > > > > > > ------------------------------- > > > > > > > > > > > > > > > > Application exception: The following output > > > > files were not created by > > > > > > > > > > > > > > > > the application: > > > > 101-FBchannel20_cwt-avgResults.Rdata > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > > > > osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/cwtsmall-cpteovgi > > > > > > > > > > > > > > > > lrwxrwxrwx 1 osg osgusers 76 Sep 7 13:01 > > > > 101_FB-epochs.Rdata -> > > > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared/101_FB- > > > > epochs.Rdata > > > > > > > > > > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 13:01 > > > > scripts > > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 70 Sep 7 13:01 > > > > stderr.txt > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > > > > osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared > > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 > > > > 13:00 101_FB- epochs.Rdata > > > > > > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 > > > > 13:00 scripts > > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 > > > > 13:00 seq.sh > > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 > > > > 13:00 wrapper.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think the descriptions of exception are all > > > > right now. The > > > > > > > > > > > > > > > > difference between fork and condor was that > > > > fork created the output > > > > > > > > > > > > > > > > links to the shared directory, but condor > > > > didn't. But the essential > > > > > > > > > > > > > > > > problem is the output files not being created. > > > > I will do more > > > > > > > > > > > > > > > > experiments to see whether the problem of file > > > > system or application. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > Jing > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > > > > > Swift-user mailing list > > > > > > > > > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > > Swift-user mailing list > > > > > > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > Swift-user mailing list > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- A non-text attachment was scrubbed... Name: cwtsmall-7hdhm1hi-info Type: application/octet-stream Size: 8245 bytes Desc: not available URL: From hategan at mcs.anl.gov Mon Sep 10 14:07:07 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 10 Sep 2007 14:07:07 -0500 Subject: [Swift-user] follow up of kickstart executable not found problem In-Reply-To: References: <1189358977.21560.1.camel@blabla.mcs.anl.gov> <1189439244.11819.0.camel@blabla.mcs.anl.gov> <1189441360.14271.0.camel@blabla.mcs.anl.gov> <1189442106.14271.3.camel@blabla.mcs.anl.gov> <1189449465.16438.1.camel@blabla.mcs.anl.gov> Message-ID: <1189451227.17221.3.camel@blabla.mcs.anl.gov> It looks like something is writing to wavelet.sh. On Mon, 2007-09-10 at 13:48 -0500, Jing Tie wrote: > Sure. > > On 9/10/07, Mihael Hategan wrote: > > On Mon, 2007-09-10 at 12:27 -0500, Jing Tie wrote: > > > Hi, > > > > > > The test job works fine with jobmanager-condor. But SID program has an > > > exception: > > > > > > cwtsmall failed > > > The following errors have occurred: > > > 1. Application "cwtsmall" failed (Exit code 126) > > > Arguments: "scripts/runWaveletsAvg.R, 102, FB" > > > Host: NWICG_NotreDame > > > Directory: sid-wf1-5blglq655nj21/cwtsmall-7hdhm1hi > > > STDERR: shared/wrapper.sh: line 164: > > > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh: Text file busy > > > STDOUT: > > > > > > log file is attached. > > > > > > I checked that wavelet.sh is in good condition under > > > /dscratch/osg/app/osg/jtie/SIDGrid/. No wrapper.log was generated. > > > > The wrapper log is no more. Relevant information about a job (including > > what its wrapper does) can be found in info/-info. In this case > > info/cwtsmall-7hdhm1hi-info. > > > > Can you post that file? > > > > Mihael > > > > > > > > > > > > Thanks, > > > Jing > > > > > > On 9/10/07, Mihael Hategan wrote: > > > > Can you try it with Condor? It should work now. > > > > > > > > On Mon, 2007-09-10 at 11:26 -0500, Jing Tie wrote: > > > > > Hi, > > > > > > > > > > It works fine now (log is attached). I'll try sid program next. > > > > > > > > > > Many thanks, > > > > > Jing > > > > > > > > > > On 9/10/07, Mihael Hategan wrote: > > > > > > You need to do an SVN update for both CoG and Swift. > > > > > > > > > > > > On Mon, 2007-09-10 at 11:03 -0500, Jing Tie wrote: > > > > > > > Hi, > > > > > > > > > > > > > > It has the same exception. Log is attached. > > > > > > > > > > > > > > Thanks, > > > > > > > Jing > > > > > > > > > > > > > > On 9/10/07, Mihael Hategan wrote: > > > > > > > > On Sun, 2007-09-09 at 12:29 -0500, Mihael Hategan wrote: > > > > > > > > > It's definitely a bug in the code I put in a few days ago. But > > > > > I don't > > > > > > > > > quite see how it happens. Such simple code. Yet how complex. > > > > > I'll have > > > > > > > > > to get back to you on it. > > > > > > > > > > > > > > > > Fixed, I think. Can you try again? > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > > > > > On Sun, 2007-09-09 at 12:09 -0500, Jing Tie wrote: > > > > > > > > > > Sure. > > > > > > > > > > > > > > > > > > > > On 9/9/07, Mihael Hategan < hategan at mcs.anl.gov> wrote: > > > > > > > > > > > Please post the whole workflow and the whole log. > > > > > > > > > > > > > > > > > > > > > > On Sun, 2007-09-09 at 11:43 -0500, Jing Tie wrote: > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > > > I tried duplicate job again using the latest swift with > > > > > > > > > > > > "log4j.logger.org.globus.cog.abstraction=DEBUG ". > > > > > > > > > > > > > > > > > > > > > > > > It didn't generated duplicate-*** directory under > > > > > simple-wf-*** > > > > > > > > > > > > directory. Details: > > > > > > > > > > > > Resource > > > > > ( org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > > > > > > > > > > > successfully released > > > > > > > > > > > > Task(type=4, identity=urn:0-0-1189348296362) setting > > > > > status to Failed > > > > > > > > > > > > Could not set current directory to "null" > > > > > > > > > > > > duplicate failed > > > > > > > > > > > > The following errors have occurred: > > > > > > > > > > > > 1. Could not initialize shared directory on > > > > > NWICG_NotreDame > > > > > > > > > > > > Caused by: > > > > > > > > > > > > Could not set current directory to "null" > > > > > > > > > > > > Caused by: > > > > > > > > > > > > Required argument missing > > > > > > > > > > > > > > > > > > > > > > > > in simple-wf-fc06kzz28d880.log : > > > > > > > > > > > > 2007-09-09 09:31:39,911 DEBUG FileResourceCache Resource > > > > > > > > > > > > > > > > > (org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl at cbc2d3) > > > > > > > > > > > > successfully released > > > > > > > > > > > > 2007-09-09 09:31:39,911 DEBUG TaskImpl Task(type=4, > > > > > > > > > > > > identity=urn:0-0-1189348296362) setting status to Failed > > > > > Could not set > > > > > > > > > > > > current directory to "null" > > > > > > > > > > > > 2007-09-09 09:31:39,938 INFO vdl:mains Errors detected. > > > > > Cleanup not done. > > > > > > > > > > > > 2007-09-09 09:31:39,963 DEBUG VDL2ExecutionContext > > > > > Execution completed > > > > > > > > > > > > with errors > > > > > > > > > > > > Execution completed with errors > > > > > > > > > > > > > > > > > > > > > > > > There is nothing under > > > > > > > > > > > > > > > > > osg.hpcc.nd.edu/dscratch/osg/data/osg/jtie/simple-wf-fc06kzz28d880 > > > > > > > > > > > > except empty shared directory. > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Jing > > > > > > > > > > > > > > > > > > > > > > > > On 9/7/07, Mihael Hategan < hategan at mcs.anl.gov> wrote: > > > > > > > > > > > > > Nevermind that. It's eating the empty stdin argument. > > > > > This doesn't make > > > > > > > > > > > > > sense. Fork used to behave. > > > > > > > > > > > > > > > > > > > > > > > > > > Can you add this to log4j.properties, run again, and > > > > > post the log? > > > > > > > > > > > > > > > > > > > > > > > > > > log4j.logger.org.globus.cog.abstraction=DEBUG > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 23:02 -0500, Mihael Hategan > > > > > wrote: > > > > > > > > > > > > > > Can you post the workflow? > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 21:56 -0500, Jing Tie wrote: > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I checked the procedure, and found that the > > > > > filenames of the output > > > > > > > > > > > > > > > files of "myapp" are the same as "File f". So the > > > > > first option should > > > > > > > > > > > > > > > not be the problem. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Then I tried a very simple swift script that > > > > > doesn't need R: > > > > > > > > > > > > > > > app function: add some lines to the input file to > > > > > generate the output file > > > > > > > > > > > > > > > input file name: simpleFile.txt > > > > > > > > > > > > > > > output file name: simpleFile.output > > > > > > > > > > > > > > > application script location: > > > > > $OSG_APP/osg/jtie/duplicate.sh > > > > > > > > > > > > > > > jobmanager: jobmanager-fork > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > duplicate failed > > > > > > > > > > > > > > > The following errors have occurred: > > > > > > > > > > > > > > > 1. Application "duplicate" failed (Failed to link > > > > > input file > > > > > > > > > > > > > > > /dscratch/osg/app/osg/jtie/duplicate.sh) > > > > > > > > > > > > > > > Arguments: "simpleFile.txt" > > > > > > > > > > > > > > > Host: NWICG_NotreDame > > > > > > > > > > > > > > > Directory: > > > > > simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi > > > > > > > > > > > > > > > STDERR: > > > > > > > > > > > > > > > STDOUT: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > after execution, 3 empty directories (not files, > > > > > and also weird name) > > > > > > > > > > > > > > > were generated: > > > > > > > > > > > > > > > duplicate-gt7hyvgi-simpleFile.output > > > > > > > > > > > > > > > duplicate-ht7hyvgi-simpleFile.output > > > > > > > > > > > > > > > duplicate-it7hyvgi-simpleFile.output > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrapper.log: > > > > > > > > > > > > > > > DIR=duplicate-gt7hyvgi > > > > > > > > > > > > > > > STDOUT=simpleFile.output > > > > > > > > > > > > > > > STDERR=stderr.txt > > > > > > > > > > > > > > > DIRS=simpleFile.output > > > > > > > > > > > > > > > LINKS=/dscratch/osg/app/osg/jtie/duplicate.sh > > > > > > > > > > > > > > > OUTS=simpleFile.txt > > > > > > > > > > > > > > > ln: creating symbolic link > > > > > > > > > > > > > > > > > > > > `duplicate-gt7hyvgi//dscratch/osg/app/osg/jtie/duplicate.sh' to > > > > > > > > > > > > > > > > > > > > `/dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared//dscratch/osg/app/osg/jtie/duplicate.sh': > > > > > > > > > > > > > > > No such file or directory > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > under > > > > > dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/shared: > > > > > > > > > > > > > > > seq.sh, simpleFile.txt, wrapper.sh > > > > > > > > > > > > > > > under > > > > > dir /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi: > > > > > > > > > > > > > > > one directory - simpleFile.output > > > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/simple-wf-xgh9e9q1z0af2/duplicate-it7hyvgi/simpleFile.output: > > > > > > > > > > > > > > > empty > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think the swift program is right, since I run it > > > > > successfully using > > > > > > > > > > > > > > > localhost duplicate.sh. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Many thanks, > > > > > > > > > > > > > > > Jing > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 9/7/07, Mihael Hategan < hategan at mcs.anl.gov> > > > > > wrote: > > > > > > > > > > > > > > > > So when the output files are not created, there > > > > > can be two reasons: > > > > > > > > > > > > > > > > 1. The specification of what files should be > > > > > created is broken. This is, > > > > > > > > > > > > > > > > at this time, done by looking at the filenames > > > > > of the return values from > > > > > > > > > > > > > > > > the atomic procedure. Normally one passes those > > > > > file names to the > > > > > > > > > > > > > > > > application as output file parameters. Example: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > (File f) proc(...) { > > > > > > > > > > > > > > > > app { > > > > > > > > > > > > > > > > myapp ... "-o" @filename(f); > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. The specification is correct, but the > > > > > application doesn't behave. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 2007-09-07 at 12:35 -0500, Jing Tie > > > > > wrote: > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I tried jobmanger-fork instead of > > > > > jobmanager-condor on osg.hpcc.nd.edu site: > > > > > > > > > > > > > > > > > > > > > kickstart executable > > > > > > > > > > > > > > > > > (101-FBchannel18_cwt- avgResults.Rdata) not > > > > > found> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > jobmanager-fork: > > > > > > > > > > > > > > > > > ------------------------ > > > > > > > > > > > > > > > > > Application exception: The following output > > > > > files were not created by > > > > > > > > > > > > > > > > > the > > > > > application: /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > > > > > osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/cwtsmall-0bhnnvgi > > > > > > > > > > > > > > > > > lrwxrwxrwx 1 osg osgusers 93 Sep 7 12:42 > > > > > > > > > > > > > > > > > 101-FBchannel10_cwt-avgResults.Rdata -> > > > > > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared/101-FBchannel10_cwt- avgResults.Rdata > > > > > > > > > > > > > > > > > ... (total 28 links, the same number as the > > > > > number of the expected output files) > > > > > > > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 12:42 > > > > > 101_FB- epochs.Rdata > > > > > > > > > > > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 12:42 > > > > > scripts > > > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 58 Sep 7 12:42 > > > > > stderr.txt > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > > > > > osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/shared > > > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 > > > > > 12:41 101_FB- epochs.Rdata > > > > > > > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 > > > > > 12:41 scripts > > > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 > > > > > 12:41 seq.sh > > > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 > > > > > 12:41 wrapper.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > empty kickstart directory > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > > > > > osg.hpcc.nd.edu/jobmanager /bin/cat > > > > > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-2g48a936yixu1/status/cwtsmall-0bhnnvgi-error > > > > > > > > > > > > > > > > > The following output files were not created by > > > > > the application: > > > > > > > > > > > > > > > > > /dscratch/osg/app/osg/jtie/SIDGrid/wavelet.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > jobmanager-condor: > > > > > > > > > > > > > > > > > ------------------------------- > > > > > > > > > > > > > > > > > Application exception: The following output > > > > > files were not created by > > > > > > > > > > > > > > > > > the application: > > > > > 101-FBchannel20_cwt-avgResults.Rdata > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > > > > > osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/cwtsmall-cpteovgi > > > > > > > > > > > > > > > > > lrwxrwxrwx 1 osg osgusers 76 Sep 7 13:01 > > > > > 101_FB-epochs.Rdata -> > > > > > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared/101_FB- > > > > > epochs.Rdata > > > > > > > > > > > > > > > > > drwxr-xr-x 3 osg osgusers 4096 Sep 7 13:01 > > > > > scripts > > > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 70 Sep 7 13:01 > > > > > stderr.txt > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > globus-job-run > > > > > osg.hpcc.nd.edu/jobmanager /bin/ls -al > > > > > > > > > > > > > > > > > > > > > > /dscratch/osg/data/osg/jtie/sid-wf1-3my5pn01t3ov0/shared > > > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 4109752 Sep 7 > > > > > 13:00 101_FB- epochs.Rdata > > > > > > > > > > > > > > > > > drwxr-xr-x 2 osg osgusers 4096 Sep 7 > > > > > 13:00 scripts > > > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 571 Sep 7 > > > > > 13:00 seq.sh > > > > > > > > > > > > > > > > > -rw-r--r-- 1 osg osgusers 3278 Sep 7 > > > > > 13:00 wrapper.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think the descriptions of exception are all > > > > > right now. The > > > > > > > > > > > > > > > > > difference between fork and condor was that > > > > > fork created the output > > > > > > > > > > > > > > > > > links to the shared directory, but condor > > > > > didn't. But the essential > > > > > > > > > > > > > > > > > problem is the output files not being created. > > > > > I will do more > > > > > > > > > > > > > > > > > experiments to see whether the problem of file > > > > > system or application. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > Jing > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > > > > > > Swift-user mailing list > > > > > > > > > > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > > > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > > > Swift-user mailing list > > > > > > > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > Swift-user mailing list > > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From tiejing at gmail.com Mon Sep 10 16:42:39 2007 From: tiejing at gmail.com (Jing Tie) Date: Mon, 10 Sep 2007 16:42:39 -0500 Subject: [Swift-user] Success with fork, but exception in getFile with condor Message-ID: Hi, I think there is a problem running swift script with jobmanager-condor on some OSG sites. I run simple-wf.dtm (very simple swift script to copy content of input file to output file) and SID script on GLOW site separately. Everything is great when running by jobmanager-fork, but "exception in getFile" happened with jobmanager-condor. The log from swift client is attached. However, no log/info/output files were generated in the swift work cache, neither was any duplicate-*** directory, though in the log file the directory seemed had been created. The site GLOW (cmsgrid01.hep.wisc.edu) can successfully run globus-url-copy, copy files between OSG_DATA and OSG_WN_TMP. Exception: Task(type=2, identity=urn:0-0-1189455037519) setting status to Failed Exception in getFile File transfer failed duplicate failed The following errors have occurred: 1. Application "duplicate" failed (No status file was found. Check the shared filesystem on GLOW) Arguments: "simpleFile.txt" Host: GLOW Directory: simple-wf-7l8vqstrkud90/duplicate-7niqt1hi STDERR: STDOUT: Thanks, Jing -------------- next part -------------- A non-text attachment was scrubbed... Name: simple-wf-8fzfp19rn1in0.log Type: application/octet-stream Size: 77141 bytes Desc: not available URL: From hategan at mcs.anl.gov Mon Sep 10 17:46:45 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 10 Sep 2007 17:46:45 -0500 Subject: [Swift-user] Success with fork, but exception in getFile with condor In-Reply-To: References: Message-ID: <1189464405.22457.1.camel@blabla.mcs.anl.gov> The wrapper produces exactly one status file: -success or -error. If none is present it means that either the very unlikely thing that the wrapper didn't write any of them, due to some weird thing I'm missing, or that GridFTP on the head node doesn't see what the wrapper has written. On Mon, 2007-09-10 at 16:42 -0500, Jing Tie wrote: > Hi, > > I think there is a problem running swift script with jobmanager-condor > on some OSG sites. I run simple-wf.dtm (very simple swift script to > copy content of input file to output file) and SID script on GLOW site > separately. Everything is great when running by jobmanager-fork, but > "exception in getFile" happened with jobmanager-condor. The log from > swift client is attached. However, no log/info/output files were > generated in the swift work cache, neither was any duplicate-*** > directory, though in the log file the directory seemed had been > created. > > The site GLOW (cmsgrid01.hep.wisc.edu) can successfully run > globus-url-copy, copy files between OSG_DATA and OSG_WN_TMP. > > Exception: > Task(type=2, identity=urn:0-0-1189455037519) setting status to Failed > Exception in getFile > File transfer failed > duplicate failed > The following errors have occurred: > 1. Application "duplicate" failed (No status file was found. Check the > shared filesystem on GLOW) > Arguments: "simpleFile.txt" > Host: GLOW > Directory: simple-wf-7l8vqstrkud90/duplicate-7niqt1hi > STDERR: > STDOUT: > > Thanks, > Jing > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From tiejing at gmail.com Tue Sep 11 00:09:07 2007 From: tiejing at gmail.com (Jing Tie) Date: Tue, 11 Sep 2007 00:09:07 -0500 Subject: [Swift-user] Success with fork, but exception in getFile with condor In-Reply-To: <1189464405.22457.1.camel@blabla.mcs.anl.gov> References: <1189464405.22457.1.camel@blabla.mcs.anl.gov> Message-ID: Hi, Thanks! Is it possible that the status file was generated in an unexpected directory? I run SID application on another site atlas.dpcc.uta.edu (jobmanager-pbs), and it succeed! But on site u2-grid.ccr.buffalo.edu (jobmanager-pbs), there was an execution error after task submitting: "FileResourceCache Maximum idle time exceeded. Removing resource for gsiftp://u2-grid.ccr.buffalo.edu". logs are attached (sid*.log --- u2-grid.ccr.buffalo.edu, simple*.log --- cmsgrid01.hep.wisc.edu). Thanks, Jing On 9/10/07, Mihael Hategan wrote: > The wrapper produces exactly one status file: -success or > -error. If none is present it means that either the very unlikely > thing that the wrapper didn't write any of them, due to some weird thing > I'm missing, or that GridFTP on the head node doesn't see what the > wrapper has written. > > On Mon, 2007-09-10 at 16:42 -0500, Jing Tie wrote: > > Hi, > > > > I think there is a problem running swift script with jobmanager-condor > > on some OSG sites. I run simple-wf.dtm (very simple swift script to > > copy content of input file to output file) and SID script on GLOW site > > separately. Everything is great when running by jobmanager-fork, but > > "exception in getFile" happened with jobmanager-condor. The log from > > swift client is attached. However, no log/info/output files were > > generated in the swift work cache, neither was any duplicate-*** > > directory, though in the log file the directory seemed had been > > created. > > > > The site GLOW (cmsgrid01.hep.wisc.edu) can successfully run > > globus-url-copy, copy files between OSG_DATA and OSG_WN_TMP. > > > > Exception: > > Task(type=2, identity=urn:0-0-1189455037519) setting status to Failed > > Exception in getFile > > File transfer failed > > duplicate failed > > The following errors have occurred: > > 1. Application "duplicate" failed (No status file was found. Check the > > shared filesystem on GLOW) > > Arguments: "simpleFile.txt" > > Host: GLOW > > Directory: simple-wf-7l8vqstrkud90/duplicate-7niqt1hi > > STDERR: > > STDOUT: > > > > Thanks, > > Jing > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > -------------- next part -------------- A non-text attachment was scrubbed... Name: sid-wf1-ryatce3d38vg1.log Type: application/octet-stream Size: 216808 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: simple-wf-8fzfp19rn1in0.log Type: application/octet-stream Size: 77141 bytes Desc: not available URL: From hategan at mcs.anl.gov Tue Sep 11 09:42:26 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 11 Sep 2007 09:42:26 -0500 Subject: [Swift-user] Success with fork, but exception in getFile with condor In-Reply-To: References: <1189464405.22457.1.camel@blabla.mcs.anl.gov> Message-ID: <1189521746.22539.3.camel@blabla.mcs.anl.gov> On Tue, 2007-09-11 at 00:09 -0500, Jing Tie wrote: > Hi, > > Thanks! Is it possible that the status file was generated in an > unexpected directory? Very unlikely. > > I run SID application on another site atlas.dpcc.uta.edu > (jobmanager-pbs), and it succeed! But on site u2-grid.ccr.buffalo.edu > (jobmanager-pbs), there was an execution error after task submitting: > "FileResourceCache Maximum idle time exceeded. Removing resource for > gsiftp://u2-grid.ccr.buffalo.edu". That's not an error. Idle GridFTP connections are removed from the cache after a while. Your log shows simply that nothing is happening. Mihael > logs are attached (sid*.log --- > u2-grid.ccr.buffalo.edu, simple*.log --- cmsgrid01.hep.wisc.edu). > > Thanks, > Jing > > On 9/10/07, Mihael Hategan wrote: > > The wrapper produces exactly one status file: -success or > > -error. If none is present it means that either the very unlikely > > thing that the wrapper didn't write any of them, due to some weird thing > > I'm missing, or that GridFTP on the head node doesn't see what the > > wrapper has written. > > > > On Mon, 2007-09-10 at 16:42 -0500, Jing Tie wrote: > > > Hi, > > > > > > I think there is a problem running swift script with jobmanager-condor > > > on some OSG sites. I run simple-wf.dtm (very simple swift script to > > > copy content of input file to output file) and SID script on GLOW site > > > separately. Everything is great when running by jobmanager-fork, but > > > "exception in getFile" happened with jobmanager-condor. The log from > > > swift client is attached. However, no log/info/output files were > > > generated in the swift work cache, neither was any duplicate-*** > > > directory, though in the log file the directory seemed had been > > > created. > > > > > > The site GLOW (cmsgrid01.hep.wisc.edu) can successfully run > > > globus-url-copy, copy files between OSG_DATA and OSG_WN_TMP. > > > > > > Exception: > > > Task(type=2, identity=urn:0-0-1189455037519) setting status to Failed > > > Exception in getFile > > > File transfer failed > > > duplicate failed > > > The following errors have occurred: > > > 1. Application "duplicate" failed (No status file was found. Check the > > > shared filesystem on GLOW) > > > Arguments: "simpleFile.txt" > > > Host: GLOW > > > Directory: simple-wf-7l8vqstrkud90/duplicate-7niqt1hi > > > STDERR: > > > STDOUT: > > > > > > Thanks, > > > Jing > > > _______________________________________________ > > > Swift-user mailing list > > > Swift-user at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > From hategan at mcs.anl.gov Wed Sep 12 15:09:12 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 12 Sep 2007 15:09:12 -0500 Subject: [Swift-user] Success with fork, but exception in getFile with condor In-Reply-To: <46E84198.7050009@uiowa.edu> References: <1189464405.22457.1.camel@blabla.mcs.anl.gov> <1189521746.22539.3.camel@blabla.mcs.anl.gov> <46E84198.7050009@uiowa.edu> Message-ID: <1189627752.10401.8.camel@blabla.mcs.anl.gov> On Wed, 2007-09-12 at 14:44 -0500, Anand Padmanabhan wrote: > Hi Michael, > > The OSG troubleshooting team has been working with Jing to identify and > correct the problems she is having when running on the OSG infrastructure. > > Looking at the logs Jing sent us on one of the OSG site (possibly few > more) she is getting the following information in the log: > ... > 2007-09-10 15:46:37,446 DEBUG vdl:execute2 Application exception: No > status file was found. Check the shared filesystem on GLOW > ... > 2007-09-10 15:46:37,498 DEBUG DelegatedFileTransferHandler File transfer > with resource remote->tmp > 2007-09-10 15:46:37,730 DEBUG DelegatedFileTransferHandler Exception in > transfer > org.globus.cog.abstraction.impl.file.FileResourceException: Exception in > getFile > ... > > I don't think I have a clear understanding of what this error means. > Does this mean that there was an application error because it did not > find the files it was expecting or do you think this some problem > related with to the OSG infrastructure. If so, could you tell me what > exactly swift was trying to do in at these steps when it failed. Every application is run by a wrapper on the worker node. When the application is done, the wrapper produces either an error file or a success file. It should always produce exactly one of the two (which one depends on whether the run was successful or not). This is on the worker node, and is assumed to be happening on a share file system. After the job is done, Swift, from the comfort of the submit host, checks, through GridFTP, first whether the success file is there, and if not whether the error file is there. It finds none, which means that these files, although presumably written by the wrapper on the worker node, cannot be seen on the head node through GridFTP. So it looks to me like there might be something wrong with the file system? Mihael > > Thanks, > Anand > > Mihael Hategan wrote: > > On Tue, 2007-09-11 at 00:09 -0500, Jing Tie wrote: > >> Hi, > >> > >> Thanks! Is it possible that the status file was generated in an > >> unexpected directory? > > > > Very unlikely. > > DelegatedFileTransferHandler Exception in transfer > org.globus.cog.abstraction.impl.file.FileResourceException: Exception in > getFile > >> I run SID application on another site atlas.dpcc.uta.edu > >> (jobmanager-pbs), and it succeed! But on site u2-grid.ccr.buffalo.edu > >> (jobmanager-pbs), there was an execution error after task submitting: > >> "FileResourceCache Maximum idle time exceeded. Removing resource for > >> gsiftp://u2-grid.ccr.buffalo.edu". > > > > That's not an error. Idle GridFTP connections are removed from the cache > > after a while. Your log shows simply that nothing is happening. > > > > Mihael > > > >> logs are attached (sid*.log --- > >> u2-grid.ccr.buffalo.edu, simple*.log --- cmsgrid01.hep.wisc.edu). > >> > >> Thanks, > >> Jing > >> > >> On 9/10/07, Mihael Hategan wrote: > >>> The wrapper produces exactly one status file: -success or > >>> -error. If none is present it means that either the very unlikely > >>> thing that the wrapper didn't write any of them, due to some weird thing > >>> I'm missing, or that GridFTP on the head node doesn't see what the > >>> wrapper has written. > >>> > >>> On Mon, 2007-09-10 at 16:42 -0500, Jing Tie wrote: > >>>> Hi, > >>>> > >>>> I think there is a problem running swift script with jobmanager-condor > >>>> on some OSG sites. I run simple-wf.dtm (very simple swift script to > >>>> copy content of input file to output file) and SID script on GLOW site > >>>> separately. Everything is great when running by jobmanager-fork, but > >>>> "exception in getFile" happened with jobmanager-condor. The log from > >>>> swift client is attached. However, no log/info/output files were > >>>> generated in the swift work cache, neither was any duplicate-*** > >>>> directory, though in the log file the directory seemed had been > >>>> created. > >>>> > >>>> The site GLOW (cmsgrid01.hep.wisc.edu) can successfully run > >>>> globus-url-copy, copy files between OSG_DATA and OSG_WN_TMP. > >>>> > >>>> Exception: > >>>> Task(type=2, identity=urn:0-0-1189455037519) setting status to Failed > >>>> Exception in getFile > >>>> File transfer failed > >>>> duplicate failed > >>>> The following errors have occurred: > >>>> 1. Application "duplicate" failed (No status file was found. Check the > >>>> shared filesystem on GLOW) > >>>> Arguments: "simpleFile.txt" > >>>> Host: GLOW > >>>> Directory: simple-wf-7l8vqstrkud90/duplicate-7niqt1hi > >>>> STDERR: > >>>> STDOUT: > >>>> > >>>> Thanks, > >>>> Jing > >>>> _______________________________________________ > >>>> Swift-user mailing list > >>>> Swift-user at ci.uchicago.edu > >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > >>> > > > From anand-padmanabhan-1 at uiowa.edu Wed Sep 12 14:44:24 2007 From: anand-padmanabhan-1 at uiowa.edu (Anand Padmanabhan) Date: Wed, 12 Sep 2007 14:44:24 -0500 Subject: [Swift-user] Success with fork, but exception in getFile with condor In-Reply-To: <1189521746.22539.3.camel@blabla.mcs.anl.gov> References: <1189464405.22457.1.camel@blabla.mcs.anl.gov> <1189521746.22539.3.camel@blabla.mcs.anl.gov> Message-ID: <46E84198.7050009@uiowa.edu> Hi Michael, The OSG troubleshooting team has been working with Jing to identify and correct the problems she is having when running on the OSG infrastructure. Looking at the logs Jing sent us on one of the OSG site (possibly few more) she is getting the following information in the log: ... 2007-09-10 15:46:37,446 DEBUG vdl:execute2 Application exception: No status file was found. Check the shared filesystem on GLOW ... 2007-09-10 15:46:37,498 DEBUG DelegatedFileTransferHandler File transfer with resource remote->tmp 2007-09-10 15:46:37,730 DEBUG DelegatedFileTransferHandler Exception in transfer org.globus.cog.abstraction.impl.file.FileResourceException: Exception in getFile ... I don't think I have a clear understanding of what this error means. Does this mean that there was an application error because it did not find the files it was expecting or do you think this some problem related with to the OSG infrastructure. If so, could you tell me what exactly swift was trying to do in at these steps when it failed. Thanks, Anand Mihael Hategan wrote: > On Tue, 2007-09-11 at 00:09 -0500, Jing Tie wrote: >> Hi, >> >> Thanks! Is it possible that the status file was generated in an >> unexpected directory? > > Very unlikely. > DelegatedFileTransferHandler Exception in transfer org.globus.cog.abstraction.impl.file.FileResourceException: Exception in getFile >> I run SID application on another site atlas.dpcc.uta.edu >> (jobmanager-pbs), and it succeed! But on site u2-grid.ccr.buffalo.edu >> (jobmanager-pbs), there was an execution error after task submitting: >> "FileResourceCache Maximum idle time exceeded. Removing resource for >> gsiftp://u2-grid.ccr.buffalo.edu". > > That's not an error. Idle GridFTP connections are removed from the cache > after a while. Your log shows simply that nothing is happening. > > Mihael > >> logs are attached (sid*.log --- >> u2-grid.ccr.buffalo.edu, simple*.log --- cmsgrid01.hep.wisc.edu). >> >> Thanks, >> Jing >> >> On 9/10/07, Mihael Hategan wrote: >>> The wrapper produces exactly one status file: -success or >>> -error. If none is present it means that either the very unlikely >>> thing that the wrapper didn't write any of them, due to some weird thing >>> I'm missing, or that GridFTP on the head node doesn't see what the >>> wrapper has written. >>> >>> On Mon, 2007-09-10 at 16:42 -0500, Jing Tie wrote: >>>> Hi, >>>> >>>> I think there is a problem running swift script with jobmanager-condor >>>> on some OSG sites. I run simple-wf.dtm (very simple swift script to >>>> copy content of input file to output file) and SID script on GLOW site >>>> separately. Everything is great when running by jobmanager-fork, but >>>> "exception in getFile" happened with jobmanager-condor. The log from >>>> swift client is attached. However, no log/info/output files were >>>> generated in the swift work cache, neither was any duplicate-*** >>>> directory, though in the log file the directory seemed had been >>>> created. >>>> >>>> The site GLOW (cmsgrid01.hep.wisc.edu) can successfully run >>>> globus-url-copy, copy files between OSG_DATA and OSG_WN_TMP. >>>> >>>> Exception: >>>> Task(type=2, identity=urn:0-0-1189455037519) setting status to Failed >>>> Exception in getFile >>>> File transfer failed >>>> duplicate failed >>>> The following errors have occurred: >>>> 1. Application "duplicate" failed (No status file was found. Check the >>>> shared filesystem on GLOW) >>>> Arguments: "simpleFile.txt" >>>> Host: GLOW >>>> Directory: simple-wf-7l8vqstrkud90/duplicate-7niqt1hi >>>> STDERR: >>>> STDOUT: >>>> >>>> Thanks, >>>> Jing >>>> _______________________________________________ >>>> Swift-user mailing list >>>> Swift-user at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >>> > From dmallen at mitre.org Thu Sep 13 13:36:36 2007 From: dmallen at mitre.org (Allen, M. David) Date: Thu, 13 Sep 2007 14:36:36 -0400 Subject: [Swift-user] Error: Attempted to close nonexistent channel buffers Message-ID: Hello, I'm just getting started with Swift, and trying to program a fairly trivial sample to get started. My swiftscript fails with the message: "Execution failed: grep started Attempted to close nonexistent channel buffers" Can anyone point me to the documentation that describes such errors? This is referring to a spot in my code that is executing a very vanilla grep operation. My input file is just 14 lines long, and this error consistently happens towards the end of the overall workflow execution. The code: type blog { string name; string feedURL; } type file { } (file headlines) getHeadlines(blog b) { app { feeder @b.feedURL stdout=@filename(headlines); } } (file results[]) processBlogs(blog blogs[]) { foreach blog el, index in blogs { results[index] = getHeadlines( el ) ; } } (string matches) findSingleMatch(file input, string searchTerm) { app { grep "-i" searchTerm @filename(input) stdout=@matches; } } (file matches) findMatches(file inputs[], string searchTerm) { string final; foreach input, index in inputs { string intermed = findSingleMatch(input, searchTerm); final = strcat(final, intermed); } matches = dumpString(final); } (int retVal) debug(string m) { app { echo m ; } } (file t) dumpString(string m) { app { echo m stdout=@filename(t); } } blog blogs[] ; file output[] ; file final ; output = processBlogs(blogs); final = findMatches(output, "ARG 0"); Any help would be greatly appreciated. -- M. David Allen -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Sep 13 15:22:17 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 13 Sep 2007 15:22:17 -0500 Subject: [Swift-user] Error: Attempted to close nonexistent channel buffers In-Reply-To: References: Message-ID: <1189714937.29364.3.camel@blabla.mcs.anl.gov> What version are you using? That error shows a bug in the Swift implementation, and it should have been fixed, at least in SVN. Mihael On Thu, 2007-09-13 at 14:36 -0400, Allen, M. David wrote: > Hello, > > I'm just getting started with Swift, and trying to program a fairly > trivial sample to get started. > > My swiftscript fails with the message: > "Execution failed: > grep started > Attempted to close nonexistent channel buffers" > > Can anyone point me to the documentation that describes such errors? > This is referring to a spot in my code that is executing a very > vanilla grep operation. My input file is just 14 lines long, and this > error consistently happens towards the end of the overall workflow > execution. > > The code: > > type blog { > string name; > string feedURL; > } > > type file { } > > (file headlines) getHeadlines(blog b) { > app { > feeder @b.feedURL stdout=@filename(headlines); > } > } > > (file results[]) processBlogs(blog blogs[]) { > > foreach blog el, index in blogs { > results[index] = getHeadlines( el ) ; > } > } > > (string matches) findSingleMatch(file input, string searchTerm) { > app { > grep "-i" searchTerm @filename(input) stdout=@matches; > } > } > > (file matches) findMatches(file inputs[], string searchTerm) { > string final; > > foreach input, index in inputs { > string intermed = findSingleMatch(input, searchTerm); > final = strcat(final, intermed); > } > > matches = dumpString(final); > } > > (int retVal) debug(string m) { > app { > echo m ; > } > } > > (file t) dumpString(string m) { > app { > echo m stdout=@filename(t); > } > } > > blog blogs[] header="true">; > file output[] prefix="output/blogHeadlines",suffix=".txt">; > file final ; > > output = processBlogs(blogs); > final = findMatches(output, "ARG 0"); > > Any help would be greatly appreciated. > -- > M. David Allen > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From dmallen at mitre.org Thu Sep 13 15:24:24 2007 From: dmallen at mitre.org (Allen, M. David) Date: Thu, 13 Sep 2007 16:24:24 -0400 Subject: [Swift-user] Error: Attempted to close nonexistent channelbuffers In-Reply-To: <1189714937.29364.3.camel@blabla.mcs.anl.gov> References: <1189714937.29364.3.camel@blabla.mcs.anl.gov> Message-ID: I'm using vdsk-0.2, downloaded yesterday. -----Original Message----- From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Thursday, September 13, 2007 4:22 PM To: Allen, M. David Cc: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Error: Attempted to close nonexistent channelbuffers What version are you using? That error shows a bug in the Swift implementation, and it should have been fixed, at least in SVN. Mihael On Thu, 2007-09-13 at 14:36 -0400, Allen, M. David wrote: > Hello, > > I'm just getting started with Swift, and trying to program a fairly > trivial sample to get started. > > My swiftscript fails with the message: > "Execution failed: > grep started > Attempted to close nonexistent channel buffers" > > Can anyone point me to the documentation that describes such errors? > This is referring to a spot in my code that is executing a very > vanilla grep operation. My input file is just 14 lines long, and this > error consistently happens towards the end of the overall workflow > execution. > > The code: > > type blog { > string name; > string feedURL; > } > > type file { } > > (file headlines) getHeadlines(blog b) { > app { > feeder @b.feedURL stdout=@filename(headlines); > } > } > > (file results[]) processBlogs(blog blogs[]) { > > foreach blog el, index in blogs { > results[index] = getHeadlines( el ) ; > } > } > > (string matches) findSingleMatch(file input, string searchTerm) { > app { > grep "-i" searchTerm @filename(input) stdout=@matches; > } > } > > (file matches) findMatches(file inputs[], string searchTerm) { > string final; > > foreach input, index in inputs { > string intermed = findSingleMatch(input, searchTerm); > final = strcat(final, intermed); > } > > matches = dumpString(final); > } > > (int retVal) debug(string m) { > app { > echo m ; > } > } > > (file t) dumpString(string m) { > app { > echo m stdout=@filename(t); > } > } > > blog blogs[] header="true">; > file output[] prefix="output/blogHeadlines",suffix=".txt">; > file final ; > > output = processBlogs(blogs); > final = findMatches(output, "ARG 0"); > > Any help would be greatly appreciated. > -- > M. David Allen > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From nefedova at mcs.anl.gov Thu Sep 13 15:29:40 2007 From: nefedova at mcs.anl.gov (Veronika Nefedova) Date: Thu, 13 Sep 2007 15:29:40 -0500 Subject: [Swift-user] Error: Attempted to close nonexistent channel buffers In-Reply-To: <1189714937.29364.3.camel@blabla.mcs.anl.gov> References: <1189714937.29364.3.camel@blabla.mcs.anl.gov> Message-ID: actually, this error shows up when there is a mismatch between input/ output lists in declaration and in the function call. I've seen this error today - the number of parameters didn't match in declaration and actual function call. Once it was fixed, the error went away. Nika On Sep 13, 2007, at 3:22 PM, Mihael Hategan wrote: > What version are you using? > That error shows a bug in the Swift implementation, and it should have > been fixed, at least in SVN. > > Mihael > > On Thu, 2007-09-13 at 14:36 -0400, Allen, M. David wrote: >> Hello, >> >> I'm just getting started with Swift, and trying to program a fairly >> trivial sample to get started. >> >> My swiftscript fails with the message: >> "Execution failed: >> grep started >> Attempted to close nonexistent channel buffers" >> >> Can anyone point me to the documentation that describes such errors? >> This is referring to a spot in my code that is executing a very >> vanilla grep operation. My input file is just 14 lines long, and >> this >> error consistently happens towards the end of the overall workflow >> execution. >> >> The code: >> >> type blog { >> string name; >> string feedURL; >> } >> >> type file { } >> >> (file headlines) getHeadlines(blog b) { >> app { >> feeder @b.feedURL stdout=@filename(headlines); >> } >> } >> >> (file results[]) processBlogs(blog blogs[]) { >> >> foreach blog el, index in blogs { >> results[index] = getHeadlines( el ) ; >> } >> } >> >> (string matches) findSingleMatch(file input, string searchTerm) { >> app { >> grep "-i" searchTerm @filename(input) >> stdout=@matches; >> } >> } >> >> (file matches) findMatches(file inputs[], string searchTerm) { >> string final; >> >> foreach input, index in inputs { >> string intermed = findSingleMatch(input, searchTerm); >> final = strcat(final, intermed); >> } >> >> matches = dumpString(final); >> } >> >> (int retVal) debug(string m) { >> app { >> echo m ; >> } >> } >> >> (file t) dumpString(string m) { >> app { >> echo m stdout=@filename(t); >> } >> } >> >> blog blogs[] > header="true">; >> file output[] > prefix="output/blogHeadlines",suffix=".txt">; >> file final ; >> >> output = processBlogs(blogs); >> final = findMatches(output, "ARG 0"); >> >> Any help would be greatly appreciated. >> -- >> M. David Allen >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > From hategan at mcs.anl.gov Thu Sep 13 15:32:40 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 13 Sep 2007 15:32:40 -0500 Subject: [Swift-user] Error: Attempted to close nonexistent channel buffers In-Reply-To: References: <1189714937.29364.3.camel@blabla.mcs.anl.gov> Message-ID: <1189715560.7979.0.camel@blabla.mcs.anl.gov> Hmm. Do you have a simple example that triggers it? On Thu, 2007-09-13 at 15:29 -0500, Veronika Nefedova wrote: > actually, this error shows up when there is a mismatch between input/ > output lists in declaration and in the function call. I've seen this > error today - the number of parameters didn't match in declaration > and actual function call. Once it was fixed, the error went away. > > Nika > > On Sep 13, 2007, at 3:22 PM, Mihael Hategan wrote: > > > What version are you using? > > That error shows a bug in the Swift implementation, and it should have > > been fixed, at least in SVN. > > > > Mihael > > > > On Thu, 2007-09-13 at 14:36 -0400, Allen, M. David wrote: > >> Hello, > >> > >> I'm just getting started with Swift, and trying to program a fairly > >> trivial sample to get started. > >> > >> My swiftscript fails with the message: > >> "Execution failed: > >> grep started > >> Attempted to close nonexistent channel buffers" > >> > >> Can anyone point me to the documentation that describes such errors? > >> This is referring to a spot in my code that is executing a very > >> vanilla grep operation. My input file is just 14 lines long, and > >> this > >> error consistently happens towards the end of the overall workflow > >> execution. > >> > >> The code: > >> > >> type blog { > >> string name; > >> string feedURL; > >> } > >> > >> type file { } > >> > >> (file headlines) getHeadlines(blog b) { > >> app { > >> feeder @b.feedURL stdout=@filename(headlines); > >> } > >> } > >> > >> (file results[]) processBlogs(blog blogs[]) { > >> > >> foreach blog el, index in blogs { > >> results[index] = getHeadlines( el ) ; > >> } > >> } > >> > >> (string matches) findSingleMatch(file input, string searchTerm) { > >> app { > >> grep "-i" searchTerm @filename(input) > >> stdout=@matches; > >> } > >> } > >> > >> (file matches) findMatches(file inputs[], string searchTerm) { > >> string final; > >> > >> foreach input, index in inputs { > >> string intermed = findSingleMatch(input, searchTerm); > >> final = strcat(final, intermed); > >> } > >> > >> matches = dumpString(final); > >> } > >> > >> (int retVal) debug(string m) { > >> app { > >> echo m ; > >> } > >> } > >> > >> (file t) dumpString(string m) { > >> app { > >> echo m stdout=@filename(t); > >> } > >> } > >> > >> blog blogs[] >> header="true">; > >> file output[] >> prefix="output/blogHeadlines",suffix=".txt">; > >> file final ; > >> > >> output = processBlogs(blogs); > >> final = findMatches(output, "ARG 0"); > >> > >> Any help would be greatly appreciated. > >> -- > >> M. David Allen > >> _______________________________________________ > >> Swift-user mailing list > >> Swift-user at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > From benc at hawaga.org.uk Thu Sep 13 16:17:00 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 13 Sep 2007 21:17:00 +0000 (GMT) Subject: [Swift-user] Error: Attempted to close nonexistent channel buffers In-Reply-To: <1189715560.7979.0.camel@blabla.mcs.anl.gov> References: <1189714937.29364.3.camel@blabla.mcs.anl.gov> <1189715560.7979.0.camel@blabla.mcs.anl.gov> Message-ID: I just tried two different kinds of argument mismatch with r1213 and I don't get this error - I get different ones: i) with too many args in the function declaration / not enough in the invocation: type messagefile; (messagefile t) greeting(int monkeys) { app { echo "hello" stdout=@filename(t); } } messagefile outfile <"001-echo.out">; outfile = greeting(); gives: $ swift nec.swift Swift v0.2-dev r1213 (modified locally) RunID: pi3drupzxfh12 Execution failed: Missing argument monkeys for sys:element(t, monkeys) and the other way round, too many in the function declaration, not enough in the invocation: type messagefile; (messagefile t) greeting() {. app { echo "hello" stdout=@filename(t); } } messagefile outfile <"001-echo.out">; outfile = greeting("monkeys"); gives $swift nec.swift Swift v0.2-dev r1213 (modified locally) RunID: 7q7d6rygrl180 Execution failed: Illegal extra argument to greeting @ nec.kml, line: 44 On Thu, 13 Sep 2007, Mihael Hategan wrote: > Hmm. Do you have a simple example that triggers it? > > On Thu, 2007-09-13 at 15:29 -0500, Veronika Nefedova wrote: > > actually, this error shows up when there is a mismatch between input/ > > output lists in declaration and in the function call. I've seen this > > error today - the number of parameters didn't match in declaration > > and actual function call. Once it was fixed, the error went away. > > > > Nika > > > > On Sep 13, 2007, at 3:22 PM, Mihael Hategan wrote: > > > > > What version are you using? > > > That error shows a bug in the Swift implementation, and it should have > > > been fixed, at least in SVN. > > > > > > Mihael > > > > > > On Thu, 2007-09-13 at 14:36 -0400, Allen, M. David wrote: > > >> Hello, > > >> > > >> I'm just getting started with Swift, and trying to program a fairly > > >> trivial sample to get started. > > >> > > >> My swiftscript fails with the message: > > >> "Execution failed: > > >> grep started > > >> Attempted to close nonexistent channel buffers" > > >> > > >> Can anyone point me to the documentation that describes such errors? > > >> This is referring to a spot in my code that is executing a very > > >> vanilla grep operation. My input file is just 14 lines long, and > > >> this > > >> error consistently happens towards the end of the overall workflow > > >> execution. > > >> > > >> The code: > > >> > > >> type blog { > > >> string name; > > >> string feedURL; > > >> } > > >> > > >> type file { } > > >> > > >> (file headlines) getHeadlines(blog b) { > > >> app { > > >> feeder @b.feedURL stdout=@filename(headlines); > > >> } > > >> } > > >> > > >> (file results[]) processBlogs(blog blogs[]) { > > >> > > >> foreach blog el, index in blogs { > > >> results[index] = getHeadlines( el ) ; > > >> } > > >> } > > >> > > >> (string matches) findSingleMatch(file input, string searchTerm) { > > >> app { > > >> grep "-i" searchTerm @filename(input) > > >> stdout=@matches; > > >> } > > >> } > > >> > > >> (file matches) findMatches(file inputs[], string searchTerm) { > > >> string final; > > >> > > >> foreach input, index in inputs { > > >> string intermed = findSingleMatch(input, searchTerm); > > >> final = strcat(final, intermed); > > >> } > > >> > > >> matches = dumpString(final); > > >> } > > >> > > >> (int retVal) debug(string m) { > > >> app { > > >> echo m ; > > >> } > > >> } > > >> > > >> (file t) dumpString(string m) { > > >> app { > > >> echo m stdout=@filename(t); > > >> } > > >> } > > >> > > >> blog blogs[] > >> header="true">; > > >> file output[] > >> prefix="output/blogHeadlines",suffix=".txt">; > > >> file final ; > > >> > > >> output = processBlogs(blogs); > > >> final = findMatches(output, "ARG 0"); > > >> > > >> Any help would be greatly appreciated. > > >> -- > > >> M. David Allen > > >> _______________________________________________ > > >> Swift-user mailing list > > >> Swift-user at ci.uchicago.edu > > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > _______________________________________________ > > > Swift-user mailing list > > > Swift-user at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > From benc at hawaga.org.uk Thu Sep 13 16:18:45 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 13 Sep 2007 21:18:45 +0000 (GMT) Subject: [Swift-user] Error: Attempted to close nonexistent channel buffers In-Reply-To: References: <1189714937.29364.3.camel@blabla.mcs.anl.gov> Message-ID: On Thu, 13 Sep 2007, Veronika Nefedova wrote: > I've seen this error today - when you say 'today', do you mean with code executed today or code >=r1207 ? -- From benc at hawaga.org.uk Thu Sep 13 16:38:36 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 13 Sep 2007 21:38:36 +0000 (GMT) Subject: [Swift-user] Error: Attempted to close nonexistent channel buffers In-Reply-To: References: Message-ID: On Thu, 13 Sep 2007, Allen, M. David wrote: > (file headlines) getHeadlines(blog b) { > app { > feeder @b.feedURL stdout=@filename(headlines); > } > } probably unrelated, but b.feedURL is a string not a file, so you should probably say "b.feedURL" rather than "@b.feedURL" - that will pass the feedURL string as a parameter to the feeder program. @foo means the same as @filename(foo) which only makes sense when foo is a file. > (string matches) findSingleMatch(file input, string searchTerm) { > app { > grep "-i" searchTerm @filename(input) stdout=@matches; > } > } more of a problem is here: data in Swift is either 'in memory' - like ints or strings or floats, or 'in files' - you can't really mix the two. The stdout=@matches line treats matches as a file, but then later on you treat it as an 'in memory' value. That is not allowed. > (file matches) findMatches(file inputs[], string searchTerm) { > string final; > > foreach input, index in inputs { > string intermed = findSingleMatch(input, searchTerm); > final = strcat(final, intermed); > } > > matches = dumpString(final); > } Another problem here is trying to assemble the values into a final collection in the way you are trying to. Here's a program which does similar, but using only files, rather than swiftscript in-memory values: type messagefile; (messagefile t) greeting(string s) {. app { echo s stdout=@filename(t); } } (messagefile o) join(messagefile s[]) { app { cat @filenames(s) stdout=@o; } } messagefile outfile <"001-echo.out">; messagefile qqqfile <"b001-echo.out">; outfile = greeting("monkeys"); qqqfile = greeting("monkeys22"); messagefile everything <"output.txt">; everything = join([outfile, qqqfile]); -- From nefedova at mcs.anl.gov Thu Sep 13 17:14:51 2007 From: nefedova at mcs.anl.gov (Veronika Nefedova) Date: Thu, 13 Sep 2007 17:14:51 -0500 Subject: [Swift-user] Error: Attempted to close nonexistent channel buffers In-Reply-To: <1189715560.7979.0.camel@blabla.mcs.anl.gov> References: <1189714937.29364.3.camel@blabla.mcs.anl.gov> <1189715560.7979.0.camel@blabla.mcs.anl.gov> Message-ID: <3C831E5D-A2C2-47CE-BE6B-853594470BE4@mcs.anl.gov> No, I do not have an example since its been fixed. I think the function definition had one output parameters as (string foo) BLA (file f1) while the function was called with like 5 output files: (f1,f2,f3,f4,f5)= BLA(fff); Nika On Sep 13, 2007, at 3:32 PM, Mihael Hategan wrote: > Hmm. Do you have a simple example that triggers it? > > On Thu, 2007-09-13 at 15:29 -0500, Veronika Nefedova wrote: >> actually, this error shows up when there is a mismatch between input/ >> output lists in declaration and in the function call. I've seen this >> error today - the number of parameters didn't match in declaration >> and actual function call. Once it was fixed, the error went away. >> >> Nika >> >> On Sep 13, 2007, at 3:22 PM, Mihael Hategan wrote: >> >>> What version are you using? >>> That error shows a bug in the Swift implementation, and it should >>> have >>> been fixed, at least in SVN. >>> >>> Mihael >>> >>> On Thu, 2007-09-13 at 14:36 -0400, Allen, M. David wrote: >>>> Hello, >>>> >>>> I'm just getting started with Swift, and trying to program a fairly >>>> trivial sample to get started. >>>> >>>> My swiftscript fails with the message: >>>> "Execution failed: >>>> grep started >>>> Attempted to close nonexistent channel buffers" >>>> >>>> Can anyone point me to the documentation that describes such >>>> errors? >>>> This is referring to a spot in my code that is executing a very >>>> vanilla grep operation. My input file is just 14 lines long, and >>>> this >>>> error consistently happens towards the end of the overall workflow >>>> execution. >>>> >>>> The code: >>>> >>>> type blog { >>>> string name; >>>> string feedURL; >>>> } >>>> >>>> type file { } >>>> >>>> (file headlines) getHeadlines(blog b) { >>>> app { >>>> feeder @b.feedURL stdout=@filename(headlines); >>>> } >>>> } >>>> >>>> (file results[]) processBlogs(blog blogs[]) { >>>> >>>> foreach blog el, index in blogs { >>>> results[index] = getHeadlines( el ) ; >>>> } >>>> } >>>> >>>> (string matches) findSingleMatch(file input, string searchTerm) { >>>> app { >>>> grep "-i" searchTerm @filename(input) >>>> stdout=@matches; >>>> } >>>> } >>>> >>>> (file matches) findMatches(file inputs[], string searchTerm) { >>>> string final; >>>> >>>> foreach input, index in inputs { >>>> string intermed = findSingleMatch(input, >>>> searchTerm); >>>> final = strcat(final, intermed); >>>> } >>>> >>>> matches = dumpString(final); >>>> } >>>> >>>> (int retVal) debug(string m) { >>>> app { >>>> echo m ; >>>> } >>>> } >>>> >>>> (file t) dumpString(string m) { >>>> app { >>>> echo m stdout=@filename(t); >>>> } >>>> } >>>> >>>> blog blogs[] >>> header="true">; >>>> file output[] >>> prefix="output/blogHeadlines",suffix=".txt">; >>>> file final ; >>>> >>>> output = processBlogs(blogs); >>>> final = findMatches(output, "ARG 0"); >>>> >>>> Any help would be greatly appreciated. >>>> -- >>>> M. David Allen >>>> _______________________________________________ >>>> Swift-user mailing list >>>> Swift-user at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >>> >>> _______________________________________________ >>> Swift-user mailing list >>> Swift-user at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >>> >> > From hategan at mcs.anl.gov Thu Sep 13 19:31:47 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 13 Sep 2007 19:31:47 -0500 Subject: [Swift-user] Error: Attempted to close nonexistent channelbuffers In-Reply-To: References: <1189714937.29364.3.camel@blabla.mcs.anl.gov> Message-ID: <1189729907.19790.3.camel@blabla.mcs.anl.gov> What Ben said applies. You should not see this error if the workflow is otherwise correct. You can also try one of the latest nightly builds. Mihael On Thu, 2007-09-13 at 16:24 -0400, Allen, M. David wrote: > I'm using vdsk-0.2, downloaded yesterday. > > > > > -----Original Message----- > From: Mihael Hategan [mailto:hategan at mcs.anl.gov] > Sent: Thursday, September 13, 2007 4:22 PM > To: Allen, M. David > Cc: swift-user at ci.uchicago.edu > Subject: Re: [Swift-user] Error: Attempted to close nonexistent > channelbuffers > > What version are you using? > That error shows a bug in the Swift implementation, and it should have > been fixed, at least in SVN. > > Mihael > > On Thu, 2007-09-13 at 14:36 -0400, Allen, M. David wrote: > > Hello, > > > > I'm just getting started with Swift, and trying to program a fairly > > trivial sample to get started. > > > > My swiftscript fails with the message: > > "Execution failed: > > grep started > > Attempted to close nonexistent channel buffers" > > > > Can anyone point me to the documentation that describes such errors? > > This is referring to a spot in my code that is executing a very > > vanilla grep operation. My input file is just 14 lines long, and > this > > error consistently happens towards the end of the overall workflow > > execution. > > > > The code: > > > > type blog { > > string name; > > string feedURL; > > } > > > > type file { } > > > > (file headlines) getHeadlines(blog b) { > > app { > > feeder @b.feedURL stdout=@filename(headlines); > > } > > } > > > > (file results[]) processBlogs(blog blogs[]) { > > > > foreach blog el, index in blogs { > > results[index] = getHeadlines( el ) ; > > } > > } > > > > (string matches) findSingleMatch(file input, string searchTerm) { > > app { > > grep "-i" searchTerm @filename(input) > stdout=@matches; > > } > > } > > > > (file matches) findMatches(file inputs[], string searchTerm) { > > string final; > > > > foreach input, index in inputs { > > string intermed = findSingleMatch(input, searchTerm); > > final = strcat(final, intermed); > > } > > > > matches = dumpString(final); > > } > > > > (int retVal) debug(string m) { > > app { > > echo m ; > > } > > } > > > > (file t) dumpString(string m) { > > app { > > echo m stdout=@filename(t); > > } > > } > > > > blog blogs[] > header="true">; > > file output[] > prefix="output/blogHeadlines",suffix=".txt">; > > file final ; > > > > output = processBlogs(blogs); > > final = findMatches(output, "ARG 0"); > > > > Any help would be greatly appreciated. > > -- > > M. David Allen > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > From anand-padmanabhan-1 at uiowa.edu Thu Sep 13 21:58:13 2007 From: anand-padmanabhan-1 at uiowa.edu (Anand Padmanabhan) Date: Thu, 13 Sep 2007 21:58:13 -0500 Subject: [Swift-user] Success with fork, but exception in getFile with condor In-Reply-To: <1189627752.10401.8.camel@blabla.mcs.anl.gov> References: <1189464405.22457.1.camel@blabla.mcs.anl.gov> <1189521746.22539.3.camel@blabla.mcs.anl.gov> <46E84198.7050009@uiowa.edu> <1189627752.10401.8.camel@blabla.mcs.anl.gov> Message-ID: <46E9F8C5.8070709@uiowa.edu> > Every application is run by a wrapper on the worker node. When the > application is done, the wrapper produces either an error file or a > success file. It should always produce exactly one of the two (which one > depends on whether the run was successful or not). This is on the worker > node, and is assumed to be happening on a share file system. Thanks for your clarification. The shared file system requirement might be problem at some sites, but Jing is seeing errors even on sites with shared file system, so I will try to ignore that for the moment. > > After the job is done, Swift, from the comfort of the submit host, > checks, through GridFTP, first whether the success file is there, and if > not whether the error file is there. It finds none, which means that > these files, although presumably written by the wrapper on the worker > node, cannot be seen on the head node through GridFTP. > > So it looks to me like there might be something wrong with the file > system? Is there some logs that the Swift/application write on the server side, that might record if it had some problem writing these output/error files. Also I know some condor systems, job executables get dumped a temporary directory on a worker node's local file system. Would this have any effect on Swift? I will try to setup a debug session with one of the site admins, so we can trace in real time, what exactly is happening at the WNs. Thanks Anand > Mihael > >> Thanks, >> Anand >> >> Mihael Hategan wrote: >>> On Tue, 2007-09-11 at 00:09 -0500, Jing Tie wrote: >>>> Hi, >>>> >>>> Thanks! Is it possible that the status file was generated in an >>>> unexpected directory? >>> Very unlikely. >>> DelegatedFileTransferHandler Exception in transfer >> org.globus.cog.abstraction.impl.file.FileResourceException: Exception in >> getFile >>>> I run SID application on another site atlas.dpcc.uta.edu >>>> (jobmanager-pbs), and it succeed! But on site u2-grid.ccr.buffalo.edu >>>> (jobmanager-pbs), there was an execution error after task submitting: >>>> "FileResourceCache Maximum idle time exceeded. Removing resource for >>>> gsiftp://u2-grid.ccr.buffalo.edu". >>> That's not an error. Idle GridFTP connections are removed from the cache >>> after a while. Your log shows simply that nothing is happening. >>> >>> Mihael >>> >>>> logs are attached (sid*.log --- >>>> u2-grid.ccr.buffalo.edu, simple*.log --- cmsgrid01.hep.wisc.edu). >>>> >>>> Thanks, >>>> Jing >>>> >>>> On 9/10/07, Mihael Hategan wrote: >>>>> The wrapper produces exactly one status file: -success or >>>>> -error. If none is present it means that either the very unlikely >>>>> thing that the wrapper didn't write any of them, due to some weird thing >>>>> I'm missing, or that GridFTP on the head node doesn't see what the >>>>> wrapper has written. >>>>> >>>>> On Mon, 2007-09-10 at 16:42 -0500, Jing Tie wrote: >>>>>> Hi, >>>>>> >>>>>> I think there is a problem running swift script with jobmanager-condor >>>>>> on some OSG sites. I run simple-wf.dtm (very simple swift script to >>>>>> copy content of input file to output file) and SID script on GLOW site >>>>>> separately. Everything is great when running by jobmanager-fork, but >>>>>> "exception in getFile" happened with jobmanager-condor. The log from >>>>>> swift client is attached. However, no log/info/output files were >>>>>> generated in the swift work cache, neither was any duplicate-*** >>>>>> directory, though in the log file the directory seemed had been >>>>>> created. >>>>>> >>>>>> The site GLOW (cmsgrid01.hep.wisc.edu) can successfully run >>>>>> globus-url-copy, copy files between OSG_DATA and OSG_WN_TMP. >>>>>> >>>>>> Exception: >>>>>> Task(type=2, identity=urn:0-0-1189455037519) setting status to Failed >>>>>> Exception in getFile >>>>>> File transfer failed >>>>>> duplicate failed >>>>>> The following errors have occurred: >>>>>> 1. Application "duplicate" failed (No status file was found. Check the >>>>>> shared filesystem on GLOW) >>>>>> Arguments: "simpleFile.txt" >>>>>> Host: GLOW >>>>>> Directory: simple-wf-7l8vqstrkud90/duplicate-7niqt1hi >>>>>> STDERR: >>>>>> STDOUT: >>>>>> >>>>>> Thanks, >>>>>> Jing >>>>>> _______________________________________________ >>>>>> Swift-user mailing list >>>>>> Swift-user at ci.uchicago.edu >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > From hategan at mcs.anl.gov Thu Sep 13 22:05:22 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 13 Sep 2007 22:05:22 -0500 Subject: [Swift-user] Success with fork, but exception in getFile with condor In-Reply-To: <46E9F8C5.8070709@uiowa.edu> References: <1189464405.22457.1.camel@blabla.mcs.anl.gov> <1189521746.22539.3.camel@blabla.mcs.anl.gov> <46E84198.7050009@uiowa.edu> <1189627752.10401.8.camel@blabla.mcs.anl.gov> <46E9F8C5.8070709@uiowa.edu> Message-ID: <1189739122.15474.3.camel@blabla.mcs.anl.gov> On Thu, 2007-09-13 at 21:58 -0500, Anand Padmanabhan wrote: > > > Every application is run by a wrapper on the worker node. When the > > application is done, the wrapper produces either an error file or a > > success file. It should always produce exactly one of the two (which one > > depends on whether the run was successful or not). This is on the worker > > node, and is assumed to be happening on a share file system. > Thanks for your clarification. The shared file system requirement might > be problem at some sites, but Jing is seeing errors even on sites with > shared file system, so I will try to ignore that for the moment. Although the fact that there is a shared file system does not necessarily mean it works properly. > > > > > After the job is done, Swift, from the comfort of the submit host, > > checks, through GridFTP, first whether the success file is there, and if > > not whether the error file is there. It finds none, which means that > > these files, although presumably written by the wrapper on the worker > > node, cannot be seen on the head node through GridFTP. > > > > So it looks to me like there might be something wrong with the file > > system? > Is there some logs that the Swift/application write on the server side, > that might record if it had some problem writing these output/error > files. Yes. Jing can help you with finding these. Basically they are /info/-info > Also I know some condor systems, job executables get dumped a > temporary directory on a worker node's local file system. Would this > have any effect on Swift? As long as Condor/the job manager honor the directory rls setting, this shouldn't make any difference. > > I will try to setup a debug session with one of the site admins, so we > can trace in real time, what exactly is happening at the WNs. That could help. Mihael > > Thanks > Anand > > Mihael > > > >> Thanks, > >> Anand > >> > >> Mihael Hategan wrote: > >>> On Tue, 2007-09-11 at 00:09 -0500, Jing Tie wrote: > >>>> Hi, > >>>> > >>>> Thanks! Is it possible that the status file was generated in an > >>>> unexpected directory? > >>> Very unlikely. > >>> DelegatedFileTransferHandler Exception in transfer > >> org.globus.cog.abstraction.impl.file.FileResourceException: Exception in > >> getFile > >>>> I run SID application on another site atlas.dpcc.uta.edu > >>>> (jobmanager-pbs), and it succeed! But on site u2-grid.ccr.buffalo.edu > >>>> (jobmanager-pbs), there was an execution error after task submitting: > >>>> "FileResourceCache Maximum idle time exceeded. Removing resource for > >>>> gsiftp://u2-grid.ccr.buffalo.edu". > >>> That's not an error. Idle GridFTP connections are removed from the cache > >>> after a while. Your log shows simply that nothing is happening. > >>> > >>> Mihael > >>> > >>>> logs are attached (sid*.log --- > >>>> u2-grid.ccr.buffalo.edu, simple*.log --- cmsgrid01.hep.wisc.edu). > >>>> > >>>> Thanks, > >>>> Jing > >>>> > >>>> On 9/10/07, Mihael Hategan wrote: > >>>>> The wrapper produces exactly one status file: -success or > >>>>> -error. If none is present it means that either the very unlikely > >>>>> thing that the wrapper didn't write any of them, due to some weird thing > >>>>> I'm missing, or that GridFTP on the head node doesn't see what the > >>>>> wrapper has written. > >>>>> > >>>>> On Mon, 2007-09-10 at 16:42 -0500, Jing Tie wrote: > >>>>>> Hi, > >>>>>> > >>>>>> I think there is a problem running swift script with jobmanager-condor > >>>>>> on some OSG sites. I run simple-wf.dtm (very simple swift script to > >>>>>> copy content of input file to output file) and SID script on GLOW site > >>>>>> separately. Everything is great when running by jobmanager-fork, but > >>>>>> "exception in getFile" happened with jobmanager-condor. The log from > >>>>>> swift client is attached. However, no log/info/output files were > >>>>>> generated in the swift work cache, neither was any duplicate-*** > >>>>>> directory, though in the log file the directory seemed had been > >>>>>> created. > >>>>>> > >>>>>> The site GLOW (cmsgrid01.hep.wisc.edu) can successfully run > >>>>>> globus-url-copy, copy files between OSG_DATA and OSG_WN_TMP. > >>>>>> > >>>>>> Exception: > >>>>>> Task(type=2, identity=urn:0-0-1189455037519) setting status to Failed > >>>>>> Exception in getFile > >>>>>> File transfer failed > >>>>>> duplicate failed > >>>>>> The following errors have occurred: > >>>>>> 1. Application "duplicate" failed (No status file was found. Check the > >>>>>> shared filesystem on GLOW) > >>>>>> Arguments: "simpleFile.txt" > >>>>>> Host: GLOW > >>>>>> Directory: simple-wf-7l8vqstrkud90/duplicate-7niqt1hi > >>>>>> STDERR: > >>>>>> STDOUT: > >>>>>> > >>>>>> Thanks, > >>>>>> Jing > >>>>>> _______________________________________________ > >>>>>> Swift-user mailing list > >>>>>> Swift-user at ci.uchicago.edu > >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > From dmallen at mitre.org Fri Sep 14 07:16:41 2007 From: dmallen at mitre.org (Allen, M. David) Date: Fri, 14 Sep 2007 08:16:41 -0400 Subject: [Swift-user] Virtual data schema / catalog? Message-ID: Hello, I'm just trying to find my way through the background on this software, and thought see if anyone could point me in the right direction. I first came to swift through reading the original Chimera paper from 2002. The ability to distribute jobs across different machines is less interesting to me than the idea of tracking provenance for complex creations. Is there a way that I can get this kind of structured provenance information out of swift, or a related tool? Ideally, it would be structured similarly to the virtual data schema I saw in the 2002 Chimera paper. I have taken a look at the KML & XML files, and they seem to be more or less fit the bill. (KML seems to have derivations - invocation-specific information that's more or less a direct translation of the swiftscript, and the XML file seems to have generic procedure metadata, the transformations) Are there any tools available that can manipulate this information for other purposes (or insert it into a relational database with a proper schema?) If I'm overlooking some documentation, please provide a pointer. I'm more than happy to RTFM. :) Thanks -- M. David Allen -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Fri Sep 14 08:21:02 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 14 Sep 2007 13:21:02 +0000 (GMT) Subject: [Swift-user] Virtual data schema / catalog? In-Reply-To: References: Message-ID: There's a long term plan to have something like a VDC but no implementation. Tibi has worked on an experiment management database for one specific project - maybe he will comment on how that overlaps here. The origins of the XML and KML files are roughly analogous to VDS1's XML form of VDL and to Condor DAGman graphs (respectively). The XML files are more or less the same description as the SwiftScript .swift files, but in an XML form. The KML files are lower-level, describing various things that need to happen inside the execution environemnt itself. The intention (or at least my intention) was/is that the XML would be stuff that would go into a VDC, with the KML files not being standardised/shareable. In addition, there's a utility called kickstart which will generate invocation information for runs on the actual sites that the jobs run on. Swift can be configured to always run that and bring the files back to the submitting system. kickstart was also present in VDS1 - its the same executable for both. The invocation records from this are also in XML form. So the short answer is: its easy to get a bunch of XML descriptions of both the high level workflow and actual invocations dumped into files in a directory. We haven't got anything in Swift that will do anything with those files. If I was hacking round with this, then given the XML nature of the invocation records and the XML intermediate for of SwiftScript, I'd be inclined to make something that would import them all into an XML database like Xindice and then play around making XPath queries against that. -- From dmallen at mitre.org Fri Sep 14 08:39:17 2007 From: dmallen at mitre.org (Allen, M. David) Date: Fri, 14 Sep 2007 09:39:17 -0400 Subject: [Swift-user] Virtual data schema / catalog? In-Reply-To: References: Message-ID: Thanks for the information. I think someplace like Xindice might be a decent starting point. The XML rather than the KML is more interesting to me. The paper already has a decent starting point for a relational schema to store this kind of information. Figuring out the Xpath to massage the information from a hierarchical form into that schema would be a pain, but it's doable. This may be something that I'll explore. If I'm able to pull something together, I'll post the results. Being able to look through a provenance chain would be good because it could allow selective regeneration of data sets. I.e. in some cases I don't want to run the entire workflow, I just want software that will figure out which datasets in a workflow are missing, and then recompute only those pieces (and any others that depend on them). Thanks again -- David -----Original Message----- From: Ben Clifford [mailto:benc at hawaga.org.uk] Sent: Friday, September 14, 2007 9:21 AM To: Allen, M. David Cc: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Virtual data schema / catalog? There's a long term plan to have something like a VDC but no implementation. Tibi has worked on an experiment management database for one specific project - maybe he will comment on how that overlaps here. The origins of the XML and KML files are roughly analogous to VDS1's XML form of VDL and to Condor DAGman graphs (respectively). The XML files are more or less the same description as the SwiftScript .swift files, but in an XML form. The KML files are lower-level, describing various things that need to happen inside the execution environemnt itself. The intention (or at least my intention) was/is that the XML would be stuff that would go into a VDC, with the KML files not being standardised/shareable. In addition, there's a utility called kickstart which will generate invocation information for runs on the actual sites that the jobs run on. Swift can be configured to always run that and bring the files back to the submitting system. kickstart was also present in VDS1 - its the same executable for both. The invocation records from this are also in XML form. So the short answer is: its easy to get a bunch of XML descriptions of both the high level workflow and actual invocations dumped into files in a directory. We haven't got anything in Swift that will do anything with those files. If I was hacking round with this, then given the XML nature of the invocation records and the XML intermediate for of SwiftScript, I'd be inclined to make something that would import them all into an XML database like Xindice and then play around making XPath queries against that. -- From benc at hawaga.org.uk Fri Sep 14 08:46:35 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 14 Sep 2007 13:46:35 +0000 (GMT) Subject: [Swift-user] Virtual data schema / catalog? In-Reply-To: References: Message-ID: > Being able to look through a provenance chain would be good because it > could allow selective regeneration of data sets. I.e. in some cases I > don't want to run the entire workflow, I just want software that will > figure out which datasets in a workflow are missing, and then recompute > only those pieces (and any others that depend on them). Also related to this, then: Swift (or rather the Karajan workflow engine underneath it) has this concept of restart logs. These are implemented at a lower level than the XML. Briefly, if you have a KML file, you can run part of it, have a failure, let the system abort and write out a restart log; and then you can run again using the restart log to ignore work already done. Because this happens at the KML level, its suspect its not something that you can really put in a database and come back to with a different (version of the) workflow - its more intended for "i set this day long workflow running; it died overnight; tomorrow I will restart it". There's a brief section on this in the tutorial, at http://www.ci.uchicago.edu/swift/guides/tutorial.php#id2860757 "16. Starting and restarting" -- From benc at hawaga.org.uk Fri Sep 14 08:52:08 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 14 Sep 2007 13:52:08 +0000 (GMT) Subject: [Swift-user] Virtual data schema / catalog? In-Reply-To: References: Message-ID: Also, if you're going to work with the XML intermediate form - make sure you use a recent swift nightly build, not the v0.2 release. There was a substantial rewriting of the XML schema between v0.2 and now. The schema is in the source tree (which you can get from SVN) or here: http://www.ci.uchicago.edu/trac/swift/browser/trunk/resources/swiftscript.xsd -- From dmallen at mitre.org Fri Sep 14 09:00:26 2007 From: dmallen at mitre.org (Allen, M. David) Date: Fri, 14 Sep 2007 10:00:26 -0400 Subject: [Swift-user] Virtual data schema / catalog? In-Reply-To: References: Message-ID: I noticed the stop/restart feature when I read the user guide, and it's something I'm very interested in. One of the things on my to do list is to think about developing a perl script that can allow workflows to be intentionally stopped to allow human intervention. In other words, right now I can use swift to chain a bunch of programs together, but what if I want to take a work product, assign it to a human to do something with it, and then use the output of the human's effort to act as the input to the next step? One way might be to intentionally stop the workflow at the point where the human should take the input. (Alternately, to let it fail because the human work product doesn't exist yet) The human would then go off and do whatever they're supposed to do, and subsequently upload the result to some website. The act of uploading that result would then restart the workflow to continue processing with the human's result as an intermediate work product. Viola. Humans can be tasked to do arbitrarily complex things that computers can't do, just like they were an invocation of "grep". :) On a separate note, (the XML schema) - I will be using some of the nightly builds. Is the XML schema fairly stable? How often does it change? -- David -----Original Message----- From: Ben Clifford [mailto:benc at hawaga.org.uk] Sent: Friday, September 14, 2007 9:47 AM To: Allen, M. David Cc: swift-user at ci.uchicago.edu Subject: RE: [Swift-user] Virtual data schema / catalog? > Being able to look through a provenance chain would be good because it > could allow selective regeneration of data sets. I.e. in some cases I > don't want to run the entire workflow, I just want software that will > figure out which datasets in a workflow are missing, and then recompute > only those pieces (and any others that depend on them). Also related to this, then: Swift (or rather the Karajan workflow engine underneath it) has this concept of restart logs. These are implemented at a lower level than the XML. Briefly, if you have a KML file, you can run part of it, have a failure, let the system abort and write out a restart log; and then you can run again using the restart log to ignore work already done. Because this happens at the KML level, its suspect its not something that you can really put in a database and come back to with a different (version of the) workflow - its more intended for "i set this day long workflow running; it died overnight; tomorrow I will restart it". There's a brief section on this in the tutorial, at http://www.ci.uchicago.edu/swift/guides/tutorial.php#id2860757 "16. Starting and restarting" -- From nefedova at mcs.anl.gov Fri Sep 14 09:07:31 2007 From: nefedova at mcs.anl.gov (Veronika Nefedova) Date: Fri, 14 Sep 2007 09:07:31 -0500 Subject: [Swift-user] Virtual data schema / catalog? In-Reply-To: References: Message-ID: <986DBC46-85CA-42D6-AC0C-A58C50E5F549@mcs.anl.gov> I am not sure if swift has such features at the present (Ben or Mihael should comment on that), but a very simple workaround would be to have 2 independent workflows. When the first one finishes, a researcher would work with the data, upload it wherever it should go, etc, and then with just one extra push of a button the second part of the workflow could be started. Of course this won't work if you need to stop the workflow each time at different places, but if you need to stop it after one particular step (every time the same) - the above solution would work. Nika On Sep 14, 2007, at 9:00 AM, Allen, M. David wrote: > I noticed the stop/restart feature when I read the user guide, and > it's > something I'm very interested in. One of the things on my to do list > is to think about developing a perl script that can allow workflows to > be intentionally stopped to allow human intervention. > > In other words, right now I can use swift to chain a bunch of programs > together, but what if I want to take a work product, assign it to a > human to do something with it, and then use the output of the human's > effort to act as the input to the next step? One way might be to > intentionally stop the workflow at the point where the human should > take the input. (Alternately, to let it fail because the human work > product doesn't exist yet) The human would then go off and do > whatever > they're supposed to do, and subsequently upload the result to some > website. The act of uploading that result would then restart the > workflow to continue processing with the human's result as an > intermediate work product. Viola. Humans can be tasked to do > arbitrarily complex things that computers can't do, just like they > were > an invocation of "grep". :) > > On a separate note, (the XML schema) - I will be using some of the > nightly builds. Is the XML schema fairly stable? How often does it > change? > > -- David > > -----Original Message----- > From: Ben Clifford [mailto:benc at hawaga.org.uk] > Sent: Friday, September 14, 2007 9:47 AM > To: Allen, M. David > Cc: swift-user at ci.uchicago.edu > Subject: RE: [Swift-user] Virtual data schema / catalog? > > >> Being able to look through a provenance chain would be good because > it >> could allow selective regeneration of data sets. I.e. in some cases > I >> don't want to run the entire workflow, I just want software that will >> figure out which datasets in a workflow are missing, and then > recompute >> only those pieces (and any others that depend on them). > > Also related to this, then: > > Swift (or rather the Karajan workflow engine underneath it) has this > concept of restart logs. These are implemented at a lower level than > the > XML. > > Briefly, if you have a KML file, you can run part of it, have a > failure, > let the system abort and write out a restart log; and then you can run > again using the restart log to ignore work already done. > > Because this happens at the KML level, its suspect its not something > that > you can really put in a database and come back to with a different > (version of the) workflow - its more intended for "i set this day long > workflow running; it died overnight; tomorrow I will restart it". > > There's a brief section on this in the tutorial, at > http://www.ci.uchicago.edu/swift/guides/tutorial.php#id2860757 > > "16. Starting and restarting" > > -- > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > From hategan at mcs.anl.gov Fri Sep 14 09:11:51 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 14 Sep 2007 09:11:51 -0500 Subject: [Swift-user] Virtual data schema / catalog? In-Reply-To: References: Message-ID: <1189779111.1345.8.camel@blabla.mcs.anl.gov> On Fri, 2007-09-14 at 10:00 -0400, Allen, M. David wrote: [...] > In other words, right now I can use swift to chain a bunch of programs > together, but what if I want to take a work product, assign it to a > human to do something with it, and then use the output of the human's > effort to act as the input to the next step? One way might be to > intentionally stop the workflow at the point where the human should > take the input. (Alternately, to let it fail because the human work > product doesn't exist yet) The human would then go off and do whatever > they're supposed to do, and subsequently upload the result to some > website. The act of uploading that result would then restart the > workflow to continue processing with the human's result as an > intermediate work product. Viola. Humans can be tasked to do > arbitrarily complex things that computers can't do, just like they were > an invocation of "grep". :) Human processes are, to a large extent, like all other processes (albeit somewhat nondeterministic). So you can model them as a process/app in Swift. You would have to take care of designing the bridging, but I don't think that differs much for Swift vs. other things. To be more specific, you can have an application String spellcheck(String), which is actually done by a person. The actual executable may display an editable box on the screen and write the output into a file when the user clicks "I'm done", or send email and expect a reply and write to the file or any other reasonable user interface. To swift it won't make a difference. In fact, this will work with any language/system. > [...] Mihael > > -- David > > -----Original Message----- > From: Ben Clifford [mailto:benc at hawaga.org.uk] > Sent: Friday, September 14, 2007 9:47 AM > To: Allen, M. David > Cc: swift-user at ci.uchicago.edu > Subject: RE: [Swift-user] Virtual data schema / catalog? > > > > Being able to look through a provenance chain would be good because > it > > could allow selective regeneration of data sets. I.e. in some cases > I > > don't want to run the entire workflow, I just want software that will > > figure out which datasets in a workflow are missing, and then > recompute > > only those pieces (and any others that depend on them). > > Also related to this, then: > > Swift (or rather the Karajan workflow engine underneath it) has this > concept of restart logs. These are implemented at a lower level than > the > XML. > > Briefly, if you have a KML file, you can run part of it, have a > failure, > let the system abort and write out a restart log; and then you can run > again using the restart log to ignore work already done. > > Because this happens at the KML level, its suspect its not something > that > you can really put in a database and come back to with a different > (version of the) workflow - its more intended for "i set this day long > workflow running; it died overnight; tomorrow I will restart it". > > There's a brief section on this in the tutorial, at > http://www.ci.uchicago.edu/swift/guides/tutorial.php#id2860757 > > "16. Starting and restarting" > From benc at hawaga.org.uk Fri Sep 14 09:12:29 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 14 Sep 2007 14:12:29 +0000 (GMT) Subject: [Swift-user] Virtual data schema / catalog? In-Reply-To: References: Message-ID: On Fri, 14 Sep 2007, Allen, M. David wrote: > In other words, right now I can use swift to chain a bunch of programs > together, but what if I want to take a work product, assign it to a > human to do something with it, and then use the output of the human's > effort to act as the input to the next step? One way might be to > intentionally stop the workflow at the point where the human should > take the input. (Alternately, to let it fail because the human work > product doesn't exist yet) The human would then go off and do whatever > they're supposed to do, and subsequently upload the result to some > website. The act of uploading that result would then restart the > workflow to continue processing with the human's result as an > intermediate work product. Viola. Humans can be tasked to do > arbitrarily complex things that computers can't do, just like they were > an invocation of "grep". :) I think what you describe there (at least the swift side of it) is pretty much doable now. Make your application be: #!/bin/bash if [ -f /tmp/userresponse$1 ]; then cp /tmp/userresponse$1 $2 exit 0; else exit 1; fi And have the first parameter be a string (not a mapped filename) that describes or identifies the request; and $2 be a mapped filename that will give the output when its ready. Another way you could implement that is write a custom execution provider (i.e. the abstraction that lets swift submit a job to run locally using the fork provider, through Globus using the globus provider, through PBS using the PBS provider). You could write a provider-human. In that model, Swift would keep running, treating your human job the same as any other long running queued job; that may or may not be desirable. > On a separate note, (the XML schema) - I will be using some of the > nightly builds. Is the XML schema fairly stable? How often does it > change? There was a massive rewrite that went in a week or so ago. The previous change before that was in Febuary and before that was the import into our SVN repository in December 2006. So historically its been fairly stable. Where its likely to change is when new language features are added - for example, I have a patch that provides a different kind of iteration loop. That adds a new language construct to the SwiftScript text syntax and a corresponding XML representation. -- From dmallen at mitre.org Fri Sep 14 09:13:35 2007 From: dmallen at mitre.org (Allen, M. David) Date: Fri, 14 Sep 2007 10:13:35 -0400 Subject: [Swift-user] Virtual data schema / catalog? In-Reply-To: <986DBC46-85CA-42D6-AC0C-A58C50E5F549@mcs.anl.gov> References: <986DBC46-85CA-42D6-AC0C-A58C50E5F549@mcs.anl.gov> Message-ID: Two workflows could work, but that really would be a workaround. The goal would be to have a single workflow so that it could be understood and maintained as a single workflow. I don't necessarily think that this even should be done within swift. It could probably be accomplished with a relatively simple perl script. It could use one of two approachs: (1) The script could intentionally cause an error to stop the workflow after having sent notification to the human; when the human responds, a separate script could restart the workflow to continue (2) The script could send notification to the human, and then just sleep indefinitely until signaled by some other thread to wake up when the human responds. All swift would know or care is that you're invoking a simple script with a few simple parameters. (That script might take 3 days to complete, but that's another story...) -- David -----Original Message----- From: Veronika Nefedova [mailto:nefedova at mcs.anl.gov] Sent: Friday, September 14, 2007 10:08 AM To: Allen, M. David Cc: Ben Clifford; swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Virtual data schema / catalog? I am not sure if swift has such features at the present (Ben or Mihael should comment on that), but a very simple workaround would be to have 2 independent workflows. When the first one finishes, a researcher would work with the data, upload it wherever it should go, etc, and then with just one extra push of a button the second part of the workflow could be started. Of course this won't work if you need to stop the workflow each time at different places, but if you need to stop it after one particular step (every time the same) - the above solution would work. Nika On Sep 14, 2007, at 9:00 AM, Allen, M. David wrote: > I noticed the stop/restart feature when I read the user guide, and > it's > something I'm very interested in. One of the things on my to do list > is to think about developing a perl script that can allow workflows to > be intentionally stopped to allow human intervention. > > In other words, right now I can use swift to chain a bunch of programs > together, but what if I want to take a work product, assign it to a > human to do something with it, and then use the output of the human's > effort to act as the input to the next step? One way might be to > intentionally stop the workflow at the point where the human should > take the input. (Alternately, to let it fail because the human work > product doesn't exist yet) The human would then go off and do > whatever > they're supposed to do, and subsequently upload the result to some > website. The act of uploading that result would then restart the > workflow to continue processing with the human's result as an > intermediate work product. Viola. Humans can be tasked to do > arbitrarily complex things that computers can't do, just like they > were > an invocation of "grep". :) > > On a separate note, (the XML schema) - I will be using some of the > nightly builds. Is the XML schema fairly stable? How often does it > change? > > -- David > > -----Original Message----- > From: Ben Clifford [mailto:benc at hawaga.org.uk] > Sent: Friday, September 14, 2007 9:47 AM > To: Allen, M. David > Cc: swift-user at ci.uchicago.edu > Subject: RE: [Swift-user] Virtual data schema / catalog? > > >> Being able to look through a provenance chain would be good because > it >> could allow selective regeneration of data sets. I.e. in some cases > I >> don't want to run the entire workflow, I just want software that will >> figure out which datasets in a workflow are missing, and then > recompute >> only those pieces (and any others that depend on them). > > Also related to this, then: > > Swift (or rather the Karajan workflow engine underneath it) has this > concept of restart logs. These are implemented at a lower level than > the > XML. > > Briefly, if you have a KML file, you can run part of it, have a > failure, > let the system abort and write out a restart log; and then you can run > again using the restart log to ignore work already done. > > Because this happens at the KML level, its suspect its not something > that > you can really put in a database and come back to with a different > (version of the) workflow - its more intended for "i set this day long > workflow running; it died overnight; tomorrow I will restart it". > > There's a brief section on this in the tutorial, at > http://www.ci.uchicago.edu/swift/guides/tutorial.php#id2860757 > > "16. Starting and restarting" > > -- > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > From tiberius at ci.uchicago.edu Fri Sep 14 09:52:59 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Fri, 14 Sep 2007 09:52:59 -0500 Subject: [Swift-user] Virtual data schema / catalog? In-Reply-To: References: Message-ID: One of the swift-related tools that I was working on is a "experiment tracking" utility. Essentially it's a python script that is executed by the user after the workflow has finished, and which parses the execution log file for useful information, such as parameters that have been passed to the atomic procedures, and the names of the files that were involved in the current execution. Since the swift script from which the workflow engine produces the log is up to the user, the log parser has been built with a lot of assumptions in mind, and still needs to be refined further. When all the information is gathered, there is a storage step, where the metadata-looking data is stored in a relational database, and the files are copied to a local cache in the user's home directory, set aside for this purpose The current code for this tool is here: http://www.ci.uchicago.edu/trac/swift/browser/SwiftApps/Aphasia/saveExperiment.py ... and an example of what ends up in the database is here: http://tp-neurodb.ci.uchicago.edu:8080/ExperimentManagement This tool was built so address urgent needs of some of the researchers using swift, and it will be superseded by the redesigned and reimplemented VDC (when it's done). Tibi On 9/14/07, Ben Clifford wrote: > > There's a long term plan to have something like a VDC but no > implementation. > > Tibi has worked on an experiment management database for one specific > project - maybe he will comment on how that overlaps here. > > The origins of the XML and KML files are roughly analogous to VDS1's XML > form of VDL and to Condor DAGman graphs (respectively). > > The XML files are more or less the same description as the SwiftScript > .swift files, but in an XML form. The KML files are lower-level, > describing various things that need to happen inside the execution > environemnt itself. > > The intention (or at least my intention) was/is that the XML would be > stuff that would go into a VDC, with the KML files not being > standardised/shareable. > > > > In addition, there's a utility called kickstart which will generate > invocation information for runs on the actual sites that the jobs run on. > Swift can be configured to always run that and bring the files back to > the submitting system. > > kickstart was also present in VDS1 - its the same executable for both. > > The invocation records from this are also in XML form. > > > So the short answer is: its easy to get a bunch of XML descriptions of > both the high level workflow and actual invocations dumped into files in a > directory. We haven't got anything in Swift that will do anything with > those files. > > If I was hacking round with this, then given the XML nature of the > invocation records and the XML intermediate for of SwiftScript, I'd be > inclined to make something that would import them all into an XML database > like Xindice and then play around making XPath queries against that. > > -- > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From yongzh at cs.uchicago.edu Fri Sep 14 11:41:19 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Fri, 14 Sep 2007 11:41:19 -0500 (CDT) Subject: [Swift-user] Virtual data schema / catalog? In-Reply-To: References: Message-ID: Allen, There is a in-depth paper that talks about provenance issues in VDS/Swift, you can find it here: http://people.cs.uchicago.edu/~yongzh/pub/IPAW_VDSProvenance.pdf I also want to mention that I had implemented VDC (for the former VDS) on top of eXist, a very good native XML database, which is far better than Xindice. The XML based implementation could potentially support all the queries we mentioned in the provenance paper. For swift, the organization of the workflow is a bit different, as there is not an explicit graph representation for a workflow, but the swift runtime can generate a graph for a specific workflow instance (i.e. the workflow is operating on a specific set of inputs). As for human intervention, there is a BPEL4PEOPLE spec that talks about integrating human interactions into workflows, but I don't know the details. Yong. On Fri, 14 Sep 2007, Allen, M. David wrote: > Hello, > > I'm just trying to find my way through the background on this software, > and thought see if anyone could point me in the right direction. > > I first came to swift through reading the original Chimera paper from > 2002. The ability to distribute jobs across different machines is less > interesting to me than the idea of tracking provenance for complex > creations. Is there a way that I can get this kind of structured > provenance information out of swift, or a related tool? Ideally, it > would be structured similarly to the virtual data schema I saw in the > 2002 Chimera paper. > > I have taken a look at the KML & XML files, and they seem to be more or > less fit the bill. (KML seems to have derivations - > invocation-specific information that's more or less a direct > translation of the swiftscript, and the XML file seems to have generic > procedure metadata, the transformations) Are there any tools available > that can manipulate this information for other purposes (or insert it > into a relational database with a proper schema?) > > If I'm overlooking some documentation, please provide a pointer. I'm > more than happy to RTFM. :) > > Thanks > -- > M. David Allen > > From anand-padmanabhan-1 at uiowa.edu Fri Sep 14 14:08:42 2007 From: anand-padmanabhan-1 at uiowa.edu (Anand Padmanabhan) Date: Fri, 14 Sep 2007 14:08:42 -0500 Subject: [Swift-user] Success with fork, but exception in getFile with condor In-Reply-To: <1189739122.15474.3.camel@blabla.mcs.anl.gov> References: <1189464405.22457.1.camel@blabla.mcs.anl.gov> <1189521746.22539.3.camel@blabla.mcs.anl.gov> <46E84198.7050009@uiowa.edu> <1189627752.10401.8.camel@blabla.mcs.anl.gov> <46E9F8C5.8070709@uiowa.edu> <1189739122.15474.3.camel@blabla.mcs.anl.gov> Message-ID: <46EADC3A.1030009@uiowa.edu> > >>> After the job is done, Swift, from the comfort of the submit host, >>> checks, through GridFTP, first whether the success file is there, and if >>> not whether the error file is there. It finds none, which means that >>> these files, although presumably written by the wrapper on the worker >>> node, cannot be seen on the head node through GridFTP. >>> >>> So it looks to me like there might be something wrong with the file >>> system? >> Is there some logs that the Swift/application write on the server side, >> that might record if it had some problem writing these output/error >> files. > > Yes. Jing can help you with finding these. Basically they are > /info/-info Jing Could you send me this file? Also is it possible to add logging statements to the application, so that if needed we can find more information from this file. > >> Also I know some condor systems, job executables get dumped a >> temporary directory on a worker node's local file system. Would this >> have any effect on Swift? > > As long as Condor/the job manager honor the directory rls setting, this > shouldn't make any difference. This is something we need to make sure this is the case. I know we had a earlier problem at FNAL_FERMIGRID on which the initial dir globus parameter was not respected. You can find details at https://twiki.grid.iu.edu/twiki/bin/view/Troubleshooting/NewUserRunningJobsFailureFNAL Is there a way to get the RSL that gets submitted to the site from Swift. Thanks Anand From anand-padmanabhan-1 at uiowa.edu Mon Sep 17 15:09:37 2007 From: anand-padmanabhan-1 at uiowa.edu (Anand Padmanabhan) Date: Mon, 17 Sep 2007 15:09:37 -0500 Subject: [Swift-user] Success with fork, but exception in getFile with condor In-Reply-To: <46EADC3A.1030009@uiowa.edu> References: <1189464405.22457.1.camel@blabla.mcs.anl.gov> <1189521746.22539.3.camel@blabla.mcs.anl.gov> <46E84198.7050009@uiowa.edu> <1189627752.10401.8.camel@blabla.mcs.anl.gov> <46E9F8C5.8070709@uiowa.edu> <1189739122.15474.3.camel@blabla.mcs.anl.gov> <46EADC3A.1030009@uiowa.edu> Message-ID: <46EEDF01.8030407@uiowa.edu> >> >>> Also I know some condor systems, job executables get dumped a >>> temporary directory on a worker node's local file system. Would this >>> have any effect on Swift? >> >> As long as Condor/the job manager honor the directory rls setting, this >> shouldn't make any difference. > This is something we need to make sure this is the case. I know we had a > earlier problem at FNAL_FERMIGRID on which the initial dir globus > parameter was not respected. You can find details at > https://twiki.grid.iu.edu/twiki/bin/view/Troubleshooting/NewUserRunningJobsFailureFNAL I checked with FNAL with 2 of the sites on which Jing was having problems with. As I suspected the siteadmin confirmed that the two gatekeepers in question were running NFS-lite and do not respect the RSL initial directory variable. The lack of support of initialdir on some OSG sites is known issue and it does break compatibility. Also, is there a way (some parameter we specify), so that swift not to set the initialdir parameter. This way the job can finish in what ever directory it gets dumped to by the batch system and then possibly copy files over to expected location in the $OSG_DATA directory. Thanks, Anand From iraicu at cs.uchicago.edu Tue Sep 18 00:05:49 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 18 Sep 2007 00:05:49 -0500 Subject: [Swift-user] test message Message-ID: <46EF5CAD.6040403@cs.uchicago.edu> Hi, I have not been receiving any messages from the Swift user mailing list, although I have been on it for months. Can someone confirm that this message goes through, both via the mailing list and via a personal email? Thanks, Ioan -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ From benc at hawaga.org.uk Tue Sep 18 02:43:33 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 18 Sep 2007 07:43:33 +0000 (GMT) Subject: [Swift-user] test message In-Reply-To: <46EF5CAD.6040403@cs.uchicago.edu> References: <46EF5CAD.6040403@cs.uchicago.edu> Message-ID: On Tue, 18 Sep 2007, Ioan Raicu wrote: > I have not been receiving any messages from the Swift user mailing list, > although I have been on it for months. Can someone confirm that this message > goes through, both via the mailing list and via a personal email? Thanks, > Ioan You are. You can see what you've missed in the archives: http://mail.ci.uchicago.edu/pipermail/swift-user/ -- From iraicu at cs.uchicago.edu Tue Sep 18 06:56:02 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 18 Sep 2007 06:56:02 -0500 Subject: [Swift-user] test message In-Reply-To: References: <46EF5CAD.6040403@cs.uchicago.edu> Message-ID: <46EFBCD2.8050600@cs.uchicago.edu> OK, I got this message, and also found the rest of the messages going in a different folder that I seldomly check :(... sorry about the confusion. Than again, Ioan Ben Clifford wrote: > On Tue, 18 Sep 2007, Ioan Raicu wrote: > > >> I have not been receiving any messages from the Swift user mailing list, >> although I have been on it for months. Can someone confirm that this message >> goes through, both via the mailing list and via a personal email? Thanks, >> Ioan >> > > You are. You can see what you've missed in the archives: > > http://mail.ci.uchicago.edu/pipermail/swift-user/ > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbk at fnal.gov Tue Sep 18 11:45:59 2007 From: jbk at fnal.gov (Jim Kowalkowski) Date: Tue, 18 Sep 2007 11:45:59 -0500 Subject: [Swift-user] question about SwiftScript Message-ID: <46F000C7.2090302@fnal.gov> Hello, I have a question about SwiftScript. Should I expect the following to work? If not, why should I not expect it to work? Please assume that the app works - I've tested this part on a simpler script. At the bottom is the test functions that actually work. Thanks in advance, Jim Kowalkowski Fermilab CD -------------------- type file { }; (file o) gen0() { app { echo2 @filename(o); } } (file o) genn(file i) { app { echo @filename(i) @filename(o); } } (file o) appl_0 () { o = gen0(); } (file all[]) appl_n (int j) { if (j == 0) { all[0] = gen0(); } else { int k = j - 1; all = appl_n (k); all[j] = genn(all[k]); } } string filenames = "0.txt 1.txt 2.txt 3.txt 4.txt 5.txt"; file ifiles[] ; ifiles = appl_n(5); #------ if I run the function below, it works fine ------- (file inputfiles[]) simple_run () { inputfiles[5] = genn (inputfiles[4]); inputfiles[4] = genn (inputfiles[3]); inputfiles[3] = genn (inputfiles[2]); inputfiles[2] = genn (inputfiles[1]); inputfiles[1] = genn (inputfiles[0]); inputfiles[0] = gen0 (); } files more[] = ; more = simple_run(); #-------- I ran also run this and it works ---------- run_dep (file fs[], int curr) { int prev = curr - 1; fs[curr] = genn( fs[prev] ); } (file inputfiles[]) other_run () { inputfiles = run_dep(5); inputfiles = run_dep(4); inputfiles = run_dep(3); inputfiles = run_dep(2); inputfiles = run_dep(1); inputfiles[0] = gen0(); } more = other_run(); From hategan at mcs.anl.gov Tue Sep 18 11:57:29 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 18 Sep 2007 11:57:29 -0500 Subject: [Swift-user] question about SwiftScript In-Reply-To: <46F000C7.2090302@fnal.gov> References: <46F000C7.2090302@fnal.gov> Message-ID: <1190134649.19051.10.camel@blabla.mcs.anl.gov> On Tue, 2007-09-18 at 11:45 -0500, Jim Kowalkowski wrote: > Hello, > > I have a question about SwiftScript. Should I expect the following to work? Nope, as far as I can tell. > If not, why should I not expect it to work? No recursion support in Swift. Which is somewhat silly. But Ben has been working on a way to do that kind of folding operation. Maybe he can provide some details. > > Please assume that the app works - I've tested this part on a simpler > script. > At the bottom is the test functions that actually work. > > Thanks in advance, > Jim Kowalkowski > Fermilab CD > -------------------- > type file { }; > > (file o) gen0() { > app { > echo2 @filename(o); > } > } > > (file o) genn(file i) > { > app { > echo @filename(i) @filename(o); > } > } > > (file o) appl_0 () > { > o = gen0(); > } > (file all[]) appl_n (int j) > { > if (j == 0) { > all[0] = gen0(); > } else { > int k = j - 1; > all = appl_n (k); > all[j] = genn(all[k]); > } > } > > string filenames = "0.txt 1.txt 2.txt 3.txt 4.txt 5.txt"; > file ifiles[] ; > > ifiles = appl_n(5); > > #------ if I run the function below, it works fine ------- > > (file inputfiles[]) simple_run () > { > inputfiles[5] = genn (inputfiles[4]); > inputfiles[4] = genn (inputfiles[3]); > inputfiles[3] = genn (inputfiles[2]); > inputfiles[2] = genn (inputfiles[1]); > inputfiles[1] = genn (inputfiles[0]); > inputfiles[0] = gen0 (); > } > > files more[] = ; > more = simple_run(); > > #-------- I ran also run this and it works ---------- > > run_dep (file fs[], int curr) > { > int prev = curr - 1; > fs[curr] = genn( fs[prev] ); > } > > (file inputfiles[]) other_run () > { > inputfiles = run_dep(5); > inputfiles = run_dep(4); > inputfiles = run_dep(3); > inputfiles = run_dep(2); > inputfiles = run_dep(1); > inputfiles[0] = gen0(); > } > > more = other_run(); > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > From benc at hawaga.org.uk Tue Sep 18 11:57:46 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 18 Sep 2007 16:57:46 +0000 (GMT) Subject: [Swift-user] question about SwiftScript In-Reply-To: <46F000C7.2090302@fnal.gov> References: <46F000C7.2090302@fnal.gov> Message-ID: I would expect you to have some problems there with some stuff that we haven't documented properly with regard to 'closing data sets'. Basically: > (file all[]) appl_n (int j) > { > if (j == 0) { > all[0] = gen0(); > } else { > int k = j - 1; > all = appl_n (k); > all[j] = genn(all[k]); > } > } when appl_n returns, you can no longer assign anything more to all. So you might get problems here: > all = appl_n (k); > all[j] = genn(all[k]); because you've assign a value to all in the first line and are now no longer allowed to change it. I just added some docs this morning about something that might do something similar to what you are trying to do, which is the iterate language feature. Look at this chapter: http://www.ci.uchicago.edu/swift/guides/tutorial.php#tutorial.iterate on the 'sequential interation construct'. I think that corresponds roughly with what you are trying to do, but does it in an iterative style rather than a recursive style. (if you want to use this, grab the latest from SVN because I only just committed it and you'd be the first user of this...) -- From hategan at mcs.anl.gov Tue Sep 18 12:02:40 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 18 Sep 2007 12:02:40 -0500 Subject: [Swift-user] question about SwiftScript In-Reply-To: References: <46F000C7.2090302@fnal.gov> Message-ID: <1190134960.19850.1.camel@blabla.mcs.anl.gov> On Tue, 2007-09-18 at 16:57 +0000, Ben Clifford wrote: > I would expect you to have some problems there with some stuff that we > haven't documented properly with regard to 'closing data sets'. > > Basically: > > > (file all[]) appl_n (int j) > > { > > if (j == 0) { > > all[0] = gen0(); > > } else { > > int k = j - 1; > > all = appl_n (k); > > all[j] = genn(all[k]); > > } > > } > > when appl_n returns, you can no longer assign anything more to all. > > So you might get problems here: > > > all = appl_n (k); > > all[j] = genn(all[k]); > > because you've assign a value to all in the first line and are now no > longer allowed to change it. That, I believe, is easy to solve: file[] tmp = appl_n(k); foreach i in [0:k] { all[i] = tmp[i]; } all[j] = genn(all[k]); Mihael > > I just added some docs this morning about something that might do > something similar to what you are trying to do, which is the iterate > language feature. > > Look at this chapter: > > http://www.ci.uchicago.edu/swift/guides/tutorial.php#tutorial.iterate > > on the 'sequential interation construct'. > > I think that corresponds roughly with what you are trying to do, but does > it in an iterative style rather than a recursive style. > > (if you want to use this, grab the latest from SVN because I only just > committed it and you'd be the first user of this...) > From hategan at mcs.anl.gov Tue Sep 18 16:08:18 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 18 Sep 2007 16:08:18 -0500 Subject: [Swift-user] Success with fork, but exception in getFile with condor In-Reply-To: <46EEDF01.8030407@uiowa.edu> References: <1189464405.22457.1.camel@blabla.mcs.anl.gov> <1189521746.22539.3.camel@blabla.mcs.anl.gov> <46E84198.7050009@uiowa.edu> <1189627752.10401.8.camel@blabla.mcs.anl.gov> <46E9F8C5.8070709@uiowa.edu> <1189739122.15474.3.camel@blabla.mcs.anl.gov> <46EADC3A.1030009@uiowa.edu> <46EEDF01.8030407@uiowa.edu> Message-ID: <1190149699.22749.29.camel@blabla.mcs.anl.gov> On Mon, 2007-09-17 at 15:09 -0500, Anand Padmanabhan wrote: > >> > >>> Also I know some condor systems, job executables get dumped a > >>> temporary directory on a worker node's local file system. Would this > >>> have any effect on Swift? > >> > >> As long as Condor/the job manager honor the directory rls setting, this > >> shouldn't make any difference. > > This is something we need to make sure this is the case. I know we had a > > earlier problem at FNAL_FERMIGRID on which the initial dir globus > > parameter was not respected. You can find details at > > https://twiki.grid.iu.edu/twiki/bin/view/Troubleshooting/NewUserRunningJobsFailureFNAL > I checked with FNAL with 2 of the sites on which Jing was having > problems with. As I suspected the siteadmin confirmed that the two > gatekeepers in question were running NFS-lite and do not respect the RSL > initial directory variable. The lack of support of initialdir on some > OSG sites is known issue and it does break compatibility. > > Also, is there a way (some parameter we specify), so that swift not to > set the initialdir parameter. This way the job can finish in what ever > directory it gets dumped to by the batch system and then possibly copy > files over to expected location in the $OSG_DATA directory. No. There is no such parameter yet. Is not implementing random bits of an otherwise standard interface an acceptable thing on OSG? Are there any other "surprises" we should be aware of? Mihael > > Thanks, > Anand > From anand-padmanabhan-1 at uiowa.edu Tue Sep 18 16:53:44 2007 From: anand-padmanabhan-1 at uiowa.edu (Anand Padmanabhan) Date: Tue, 18 Sep 2007 16:53:44 -0500 Subject: [Swift-user] Success with fork, but exception in getFile with condor In-Reply-To: <1190149699.22749.29.camel@blabla.mcs.anl.gov> References: <1189464405.22457.1.camel@blabla.mcs.anl.gov> <1189521746.22539.3.camel@blabla.mcs.anl.gov> <46E84198.7050009@uiowa.edu> <1189627752.10401.8.camel@blabla.mcs.anl.gov> <46E9F8C5.8070709@uiowa.edu> <1189739122.15474.3.camel@blabla.mcs.anl.gov> <46EADC3A.1030009@uiowa.edu> <46EEDF01.8030407@uiowa.edu> <1190149699.22749.29.camel@blabla.mcs.anl.gov> Message-ID: <46F048E8.7080508@uiowa.edu> >> Also, is there a way (some parameter we specify), so that swift not to >> set the initialdir parameter. This way the job can finish in what ever >> directory it gets dumped to by the batch system and then possibly copy >> files over to expected location in the $OSG_DATA directory. > > No. There is no such parameter yet. > > Is not implementing random bits of an otherwise standard interface an > acceptable thing on OSG? Are there any other "surprises" we should be > aware of? > In OSG it is upto the sites to decide what they want to support, this is not one of the mandatory requirement that is placed on the site. A large number of sites do support it (Jing is running successfully on around 15 sites now), though there are few that do not support it. Another thing you need to be aware that there is no requirement to have a shared file system in OSG, though again many sites do provide a DATA directory. Thanks Anand From benc at hawaga.org.uk Tue Sep 18 16:57:18 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 18 Sep 2007 21:57:18 +0000 (GMT) Subject: [Swift-user] Success with fork, but exception in getFile with condor In-Reply-To: <46F048E8.7080508@uiowa.edu> References: <1189464405.22457.1.camel@blabla.mcs.anl.gov> <1189521746.22539.3.camel@blabla.mcs.anl.gov> <46E84198.7050009@uiowa.edu> <1189627752.10401.8.camel@blabla.mcs.anl.gov> <46E9F8C5.8070709@uiowa.edu> <1189739122.15474.3.camel@blabla.mcs.anl.gov> <46EADC3A.1030009@uiowa.edu> <46EEDF01.8030407@uiowa.edu> <1190149699.22749.29.camel@blabla.mcs.anl.gov> <46F048E8.7080508@uiowa.edu> Message-ID: > In OSG it is upto the sites to decide what they want to support, this is not > one of the mandatory requirement that is placed on the site. A large number of > sites do support it (Jing is running successfully on around 15 sites now), > though there are few that do not support it. > > Another thing you need to be aware that there is no requirement to have a > shared file system in OSG, though again many sites do provide a DATA > directory. probably we (swift) should clarify that those are requirements that we make on sites and that attempting to run on sites that don't support that won't work. that way we get to solve the problem by writing a paragraph in the user manual rather than fixing anything. -- From anand-padmanabhan-1 at uiowa.edu Tue Sep 18 17:03:21 2007 From: anand-padmanabhan-1 at uiowa.edu (Anand Padmanabhan) Date: Tue, 18 Sep 2007 17:03:21 -0500 Subject: [Swift-user] Success with fork, but exception in getFile with condor In-Reply-To: References: <1189464405.22457.1.camel@blabla.mcs.anl.gov> <1189521746.22539.3.camel@blabla.mcs.anl.gov> <46E84198.7050009@uiowa.edu> <1189627752.10401.8.camel@blabla.mcs.anl.gov> <46E9F8C5.8070709@uiowa.edu> <1189739122.15474.3.camel@blabla.mcs.anl.gov> <46EADC3A.1030009@uiowa.edu> <46EEDF01.8030407@uiowa.edu> <1190149699.22749.29.camel@blabla.mcs.anl.gov> <46F048E8.7080508@uiowa.edu> Message-ID: <46F04B29.8040702@uiowa.edu> Ben Clifford wrote: >> In OSG it is upto the sites to decide what they want to support, this is not >> one of the mandatory requirement that is placed on the site. A large number of >> sites do support it (Jing is running successfully on around 15 sites now), >> though there are few that do not support it. >> >> Another thing you need to be aware that there is no requirement to have a >> shared file system in OSG, though again many sites do provide a DATA >> directory. > > probably we (swift) should clarify that those are requirements that we > make on sites and that attempting to run on sites that don't support that > won't work. that way we get to solve the problem by writing a paragraph in > the user manual rather than fixing anything. > Yes, I would agree and OSG also needs a better way of publishing which sites make non standard assumption. Thanks, Anand From hategan at mcs.anl.gov Tue Sep 18 17:10:18 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 18 Sep 2007 17:10:18 -0500 Subject: [Swift-user] Success with fork, but exception in getFile with condor In-Reply-To: References: <1189464405.22457.1.camel@blabla.mcs.anl.gov> <1189521746.22539.3.camel@blabla.mcs.anl.gov> <46E84198.7050009@uiowa.edu> <1189627752.10401.8.camel@blabla.mcs.anl.gov> <46E9F8C5.8070709@uiowa.edu> <1189739122.15474.3.camel@blabla.mcs.anl.gov> <46EADC3A.1030009@uiowa.edu> <46EEDF01.8030407@uiowa.edu> <1190149699.22749.29.camel@blabla.mcs.anl.gov> <46F048E8.7080508@uiowa.edu> Message-ID: <1190153418.30363.29.camel@blabla.mcs.anl.gov> On Tue, 2007-09-18 at 21:57 +0000, Ben Clifford wrote: > > In OSG it is upto the sites to decide what they want to support, this is not > > one of the mandatory requirement that is placed on the site. A large number of > > sites do support it (Jing is running successfully on around 15 sites now), > > though there are few that do not support it. > > > > Another thing you need to be aware that there is no requirement to have a > > shared file system in OSG, though again many sites do provide a DATA > > directory. > > probably we (swift) should clarify that those are requirements that we > make on sites and that attempting to run on sites that don't support that > won't work. that way we get to solve the problem by writing a paragraph in > the user manual rather than fixing anything. We would have to state the extent to which a Globus deployment on a site needs to be functional, which doesn't seem like a trivial thing. >