[Swift-user] Success with fork, but exception in getFile with condor

Mihael Hategan hategan at mcs.anl.gov
Thu Sep 13 22:05:22 CDT 2007


On Thu, 2007-09-13 at 21:58 -0500, Anand Padmanabhan wrote:
> 
> > Every application is run by a wrapper on the worker node. When the
> > application is done, the wrapper produces either an error file or a
> > success file. It should always produce exactly one of the two (which one
> > depends on whether the run was successful or not). This is on the worker
> > node, and is assumed to be happening on a share file system.
> Thanks for your clarification. The shared file system requirement might 
> be problem at some sites, but Jing is seeing errors even on sites with 
> shared file system, so I will try to ignore that for the moment.

Although the fact that there is a shared file system does not
necessarily mean it works properly.

> 
> > 
> > After the job is done, Swift, from the comfort of the submit host,
> > checks, through GridFTP, first whether the success file is there, and if
> > not whether the error file is there. It finds none, which means that
> > these files, although presumably written by the wrapper on the worker
> > node, cannot be seen on the head node through GridFTP.
> > 
> > So it looks to me like there might be something wrong with the file
> > system?
> Is there some logs that the Swift/application write on the server side, 
> that might record if it had some problem writing these output/error 
> files.

Yes. Jing can help you with finding these. Basically they are
<workflow-id>/info/<job-id>-info

>  Also I know some condor systems, job executables get dumped a 
> temporary directory on a worker node's local file system. Would this 
> have any effect on Swift?

As long as Condor/the job manager honor the directory rls setting, this
shouldn't make any difference.

> 
> I will try to setup a debug session with one of the site admins, so we 
> can trace in real time, what exactly is happening at the WNs.

That could help.

Mihael

> 
> Thanks
> Anand
> > Mihael
> > 
> >> Thanks,
> >> Anand
> >>
> >> Mihael Hategan wrote:
> >>> On Tue, 2007-09-11 at 00:09 -0500, Jing Tie wrote:
> >>>> Hi,
> >>>>
> >>>> Thanks! Is it possible that the status file was generated in an
> >>>> unexpected directory?
> >>> Very unlikely.
> >>>  DelegatedFileTransferHandler Exception in transfer
> >> org.globus.cog.abstraction.impl.file.FileResourceException: Exception in 
> >> getFile
> >>>> I run SID application on another site atlas.dpcc.uta.edu
> >>>> (jobmanager-pbs), and it succeed! But on site u2-grid.ccr.buffalo.edu
> >>>> (jobmanager-pbs), there was an execution error after task submitting:
> >>>> "FileResourceCache Maximum idle time exceeded. Removing resource for
> >>>> gsiftp://u2-grid.ccr.buffalo.edu".
> >>> That's not an error. Idle GridFTP connections are removed from the cache
> >>> after a while. Your log shows simply that nothing is happening. 
> >>>
> >>> Mihael
> >>>
> >>>>  logs are attached (sid*.log ---
> >>>> u2-grid.ccr.buffalo.edu, simple*.log --- cmsgrid01.hep.wisc.edu).
> >>>>
> >>>> Thanks,
> >>>> Jing
> >>>>
> >>>> On 9/10/07, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> >>>>> The wrapper produces exactly one status file: <jobid>-success or
> >>>>> <jobid>-error. If none is present it means that either the very unlikely
> >>>>> thing that the wrapper didn't write any of them, due to some weird thing
> >>>>> I'm missing, or that GridFTP on the head node doesn't see what the
> >>>>> wrapper has written.
> >>>>>
> >>>>> On Mon, 2007-09-10 at 16:42 -0500, Jing Tie wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> I think there is a problem running swift script with jobmanager-condor
> >>>>>> on some OSG sites. I run simple-wf.dtm (very simple swift script to
> >>>>>> copy content of input file to output file) and SID script on GLOW site
> >>>>>> separately. Everything is great when running by jobmanager-fork, but
> >>>>>> "exception in getFile" happened with jobmanager-condor. The log from
> >>>>>> swift client is attached. However, no log/info/output files were
> >>>>>> generated in the swift work cache, neither was any duplicate-***
> >>>>>> directory, though in the log file the directory seemed had been
> >>>>>> created.
> >>>>>>
> >>>>>> The site GLOW (cmsgrid01.hep.wisc.edu) can successfully run
> >>>>>> globus-url-copy, copy files between OSG_DATA and OSG_WN_TMP.
> >>>>>>
> >>>>>> Exception:
> >>>>>> Task(type=2, identity=urn:0-0-1189455037519) setting status to Failed
> >>>>>> Exception in getFile
> >>>>>> File transfer failed
> >>>>>> duplicate failed
> >>>>>> The following errors have occurred:
> >>>>>> 1. Application "duplicate" failed (No status file was found. Check the
> >>>>>> shared filesystem on GLOW)
> >>>>>>         Arguments: "simpleFile.txt"
> >>>>>>         Host: GLOW
> >>>>>>         Directory: simple-wf-7l8vqstrkud90/duplicate-7niqt1hi
> >>>>>>         STDERR:
> >>>>>>         STDOUT:
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Jing
> >>>>>> _______________________________________________
> >>>>>> Swift-user mailing list
> >>>>>> Swift-user at ci.uchicago.edu
> >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> > 
> 




More information about the Swift-user mailing list