[Swift-user] Success with fork, but exception in getFile with condor

Mihael Hategan hategan at mcs.anl.gov
Wed Sep 12 15:09:12 CDT 2007


On Wed, 2007-09-12 at 14:44 -0500, Anand Padmanabhan wrote:
> Hi Michael,
> 
> The OSG troubleshooting team has been working with Jing to identify and 
> correct the problems she is having when running on the OSG infrastructure.
> 
> Looking at the logs Jing sent us on one of the OSG site (possibly few 
> more) she is getting the following information in the log:
> ...
> 2007-09-10 15:46:37,446 DEBUG vdl:execute2 Application exception: No 
> status file was found. Check the shared filesystem on GLOW
> ...
> 2007-09-10 15:46:37,498 DEBUG DelegatedFileTransferHandler File transfer 
> with resource remote->tmp
> 2007-09-10 15:46:37,730 DEBUG DelegatedFileTransferHandler Exception in 
> transfer
> org.globus.cog.abstraction.impl.file.FileResourceException: Exception in 
> getFile
> ...
> 
> I don't think I have a clear understanding of what this error means. 
> Does this mean that there was an application error because it did not 
> find the files it was expecting or do you think this some problem 
> related with to the OSG infrastructure. If so, could you tell me what 
> exactly swift was trying to do in at these steps when it failed.

Every application is run by a wrapper on the worker node. When the
application is done, the wrapper produces either an error file or a
success file. It should always produce exactly one of the two (which one
depends on whether the run was successful or not). This is on the worker
node, and is assumed to be happening on a share file system.

After the job is done, Swift, from the comfort of the submit host,
checks, through GridFTP, first whether the success file is there, and if
not whether the error file is there. It finds none, which means that
these files, although presumably written by the wrapper on the worker
node, cannot be seen on the head node through GridFTP.

So it looks to me like there might be something wrong with the file
system?

Mihael

> 
> Thanks,
> Anand
> 
> Mihael Hategan wrote:
> > On Tue, 2007-09-11 at 00:09 -0500, Jing Tie wrote:
> >> Hi,
> >>
> >> Thanks! Is it possible that the status file was generated in an
> >> unexpected directory?
> > 
> > Very unlikely.
> >  DelegatedFileTransferHandler Exception in transfer
> org.globus.cog.abstraction.impl.file.FileResourceException: Exception in 
> getFile
> >> I run SID application on another site atlas.dpcc.uta.edu
> >> (jobmanager-pbs), and it succeed! But on site u2-grid.ccr.buffalo.edu
> >> (jobmanager-pbs), there was an execution error after task submitting:
> >> "FileResourceCache Maximum idle time exceeded. Removing resource for
> >> gsiftp://u2-grid.ccr.buffalo.edu".
> > 
> > That's not an error. Idle GridFTP connections are removed from the cache
> > after a while. Your log shows simply that nothing is happening. 
> > 
> > Mihael
> > 
> >>  logs are attached (sid*.log ---
> >> u2-grid.ccr.buffalo.edu, simple*.log --- cmsgrid01.hep.wisc.edu).
> >>
> >> Thanks,
> >> Jing
> >>
> >> On 9/10/07, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> >>> The wrapper produces exactly one status file: <jobid>-success or
> >>> <jobid>-error. If none is present it means that either the very unlikely
> >>> thing that the wrapper didn't write any of them, due to some weird thing
> >>> I'm missing, or that GridFTP on the head node doesn't see what the
> >>> wrapper has written.
> >>>
> >>> On Mon, 2007-09-10 at 16:42 -0500, Jing Tie wrote:
> >>>> Hi,
> >>>>
> >>>> I think there is a problem running swift script with jobmanager-condor
> >>>> on some OSG sites. I run simple-wf.dtm (very simple swift script to
> >>>> copy content of input file to output file) and SID script on GLOW site
> >>>> separately. Everything is great when running by jobmanager-fork, but
> >>>> "exception in getFile" happened with jobmanager-condor. The log from
> >>>> swift client is attached. However, no log/info/output files were
> >>>> generated in the swift work cache, neither was any duplicate-***
> >>>> directory, though in the log file the directory seemed had been
> >>>> created.
> >>>>
> >>>> The site GLOW (cmsgrid01.hep.wisc.edu) can successfully run
> >>>> globus-url-copy, copy files between OSG_DATA and OSG_WN_TMP.
> >>>>
> >>>> Exception:
> >>>> Task(type=2, identity=urn:0-0-1189455037519) setting status to Failed
> >>>> Exception in getFile
> >>>> File transfer failed
> >>>> duplicate failed
> >>>> The following errors have occurred:
> >>>> 1. Application "duplicate" failed (No status file was found. Check the
> >>>> shared filesystem on GLOW)
> >>>>         Arguments: "simpleFile.txt"
> >>>>         Host: GLOW
> >>>>         Directory: simple-wf-7l8vqstrkud90/duplicate-7niqt1hi
> >>>>         STDERR:
> >>>>         STDOUT:
> >>>>
> >>>> Thanks,
> >>>> Jing
> >>>> _______________________________________________
> >>>> Swift-user mailing list
> >>>> Swift-user at ci.uchicago.edu
> >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> >>>
> > 
> 




More information about the Swift-user mailing list