[Swift-user] Success with fork, but exception in getFile with condor

Anand Padmanabhan anand-padmanabhan-1 at uiowa.edu
Thu Sep 13 21:58:13 CDT 2007



> Every application is run by a wrapper on the worker node. When the
> application is done, the wrapper produces either an error file or a
> success file. It should always produce exactly one of the two (which one
> depends on whether the run was successful or not). This is on the worker
> node, and is assumed to be happening on a share file system.
Thanks for your clarification. The shared file system requirement might 
be problem at some sites, but Jing is seeing errors even on sites with 
shared file system, so I will try to ignore that for the moment.

> 
> After the job is done, Swift, from the comfort of the submit host,
> checks, through GridFTP, first whether the success file is there, and if
> not whether the error file is there. It finds none, which means that
> these files, although presumably written by the wrapper on the worker
> node, cannot be seen on the head node through GridFTP.
> 
> So it looks to me like there might be something wrong with the file
> system?
Is there some logs that the Swift/application write on the server side, 
that might record if it had some problem writing these output/error 
files. Also I know some condor systems, job executables get dumped a 
temporary directory on a worker node's local file system. Would this 
have any effect on Swift?

I will try to setup a debug session with one of the site admins, so we 
can trace in real time, what exactly is happening at the WNs.

Thanks
Anand
> Mihael
> 
>> Thanks,
>> Anand
>>
>> Mihael Hategan wrote:
>>> On Tue, 2007-09-11 at 00:09 -0500, Jing Tie wrote:
>>>> Hi,
>>>>
>>>> Thanks! Is it possible that the status file was generated in an
>>>> unexpected directory?
>>> Very unlikely.
>>>  DelegatedFileTransferHandler Exception in transfer
>> org.globus.cog.abstraction.impl.file.FileResourceException: Exception in 
>> getFile
>>>> I run SID application on another site atlas.dpcc.uta.edu
>>>> (jobmanager-pbs), and it succeed! But on site u2-grid.ccr.buffalo.edu
>>>> (jobmanager-pbs), there was an execution error after task submitting:
>>>> "FileResourceCache Maximum idle time exceeded. Removing resource for
>>>> gsiftp://u2-grid.ccr.buffalo.edu".
>>> That's not an error. Idle GridFTP connections are removed from the cache
>>> after a while. Your log shows simply that nothing is happening. 
>>>
>>> Mihael
>>>
>>>>  logs are attached (sid*.log ---
>>>> u2-grid.ccr.buffalo.edu, simple*.log --- cmsgrid01.hep.wisc.edu).
>>>>
>>>> Thanks,
>>>> Jing
>>>>
>>>> On 9/10/07, Mihael Hategan <hategan at mcs.anl.gov> wrote:
>>>>> The wrapper produces exactly one status file: <jobid>-success or
>>>>> <jobid>-error. If none is present it means that either the very unlikely
>>>>> thing that the wrapper didn't write any of them, due to some weird thing
>>>>> I'm missing, or that GridFTP on the head node doesn't see what the
>>>>> wrapper has written.
>>>>>
>>>>> On Mon, 2007-09-10 at 16:42 -0500, Jing Tie wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I think there is a problem running swift script with jobmanager-condor
>>>>>> on some OSG sites. I run simple-wf.dtm (very simple swift script to
>>>>>> copy content of input file to output file) and SID script on GLOW site
>>>>>> separately. Everything is great when running by jobmanager-fork, but
>>>>>> "exception in getFile" happened with jobmanager-condor. The log from
>>>>>> swift client is attached. However, no log/info/output files were
>>>>>> generated in the swift work cache, neither was any duplicate-***
>>>>>> directory, though in the log file the directory seemed had been
>>>>>> created.
>>>>>>
>>>>>> The site GLOW (cmsgrid01.hep.wisc.edu) can successfully run
>>>>>> globus-url-copy, copy files between OSG_DATA and OSG_WN_TMP.
>>>>>>
>>>>>> Exception:
>>>>>> Task(type=2, identity=urn:0-0-1189455037519) setting status to Failed
>>>>>> Exception in getFile
>>>>>> File transfer failed
>>>>>> duplicate failed
>>>>>> The following errors have occurred:
>>>>>> 1. Application "duplicate" failed (No status file was found. Check the
>>>>>> shared filesystem on GLOW)
>>>>>>         Arguments: "simpleFile.txt"
>>>>>>         Host: GLOW
>>>>>>         Directory: simple-wf-7l8vqstrkud90/duplicate-7niqt1hi
>>>>>>         STDERR:
>>>>>>         STDOUT:
>>>>>>
>>>>>> Thanks,
>>>>>> Jing
>>>>>> _______________________________________________
>>>>>> Swift-user mailing list
>>>>>> Swift-user at ci.uchicago.edu
>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> 



More information about the Swift-user mailing list