[Swift-devel] Re: Fwd: Considering Running errors

Wed Sep 5 18:09:19 CDT 2007

Well, it looks like OSG makes what I would think would be a relatively
simple process complicated. Maybe I'm missing something.

However, there is one thing that we should address in Swift. And that is
the separation between storage and temporary job directories. We
currently make the assumption that the latter are one level below the
former. But I don't think this restriction is necessary, and I don't
think there are many places in the code that make that assumption.

As for the lack of a shared file system and the necessity to address
this using an intricate and seemingly unstable procedure that needs to
happen on the head node, I don't know what to say. It looks like OSG
could, in a best case scenario, provide some form of service (not
necessarily in the TCP/IP/web service sense) with a well defined
interface that allows these things to be done in a uniform way across
sites. Any ideas?

Mihael

On Wed, 2007-09-05 at 17:40 -0500, Jing Tie wrote: 
> Hi Miheal,
> 
> I forward the email from osg troubleshooting group. I think the shared
> file system is not a common thing on osg sites.
> 
> Thanks,
> Jing
> 
> ---------- Forwarded message ----------
> From: Anand Padmanabhan <anand-padmanabhan-1 at uiowa.edu>
> Date: Aug 31, 2007 4:10 PM
> Subject: Re: Considering Running errors
> To: Jing Tie <tiejing at gmail.com>
> Cc: "Wang, Shaowen" <shaowen-wang at uiowa.edu>
> 
> 
> Hi Jing,
> Jing Tie wrote:
> > Hi Anand,
> >
> > I am sorry I can't explain why GLOW site can run the application
> > successfully, but others can't.
> >
> > I am using VDL to describe SID application workflow, and SWIFT
> > generates the submit file itself. I have sent email to ask the swift
> > developer for the submit file that swift generated.
> I am not quite familiar with VDL or SWIFT. Please send me the  submit
> files as soon as you hear from the developers. If possible submit a
> simple Condor-G/Globus job that replicates what VDL/SWIFT does so that
> you can replicate the failure.
> >
> > For the "Exception on getFile" error, it says that the result files
> > under $data_dir cannot be opened. Is the $data_dir on a shared file
> > system?
> I did not know you were expecting a shared $data, the requirements on
> OSG_DATA are much more complicated. I guess if $OSG_DATA is defined
> there is a requirement that it should be accessible from the CE, but I
> don't think this means that it to be a shared file system. In fact I
> believe $OSG_DATA can be UNAVAILABLE if OSG_SITE_READ and OSG_SITE_WRITE
> have been defined.
> https://twiki.grid.iu.edu/twiki/bin/view/Integration/ITB_0_7/LocalStorageRequirements
> should give you what needs to be defined and I know few sites don't have
> NFS mounted $DATA, but I know quite a few that have NFS mounted data.
> E.g. 1. osg.hpcc.nd.edu has a mounted $OSG_DATA
> 2. FNAL_FERMIGRID has it local to the Worker Node
> 
> Also you should follow the same directory structure for DATA as you do
> for APP (i.e. copy data to $OSG_DATA/osg/jtie, this will avoid any
> potential conflicts). Do you do this?
> 
> The why you should access data from $OSG_DATA is to copy it to $WN_TMP
> (more than just cp, see best practices) before you start using it from
> worker nodes in your job. I don't know if this is what is causing the
> issue, but it could explain why it is working for fork and not for
> Condor jobs. Also it is good practice to copy to WN_TMP.  The best
> practices guide we created for the use of an SE is available at
> https://twiki.grid.iu.edu/twiki/bin/view/Troubleshooting/StorageElementBestPractices
> and should give you some useful pointers.
> 
> Let me know if you have questions.
> 
> Thanks
> Anand
> >
> > Thanks,
> > Jing
> >
> > On 8/31/07, Anand Padmanabhan <anand-padmanabhan-1 at uiowa.edu> wrote:
> >> Hi Jing,
> >>
> >> I don't think I have heard from you on this thread.
> >>
> >> Thanks
> >> Anand
> >>
> >> Anand Padmanabhan wrote:
> >>> Hi Jing,
> >>>
> >>> Yesterday in the troubleshooting meeting you mention we should
> >>> concentrate on errors encountered during the running of your jobs. As
> >>> your document has identified we have 3 types of errors you see at run
> >>> time. I think we need to try and understand them one by one.
> >>>
> >>> The first kind of error affects the following eight sites
> >>> "osg.hpcc.nd.edu, osg.rcac.purdue.edu, cmsgrid02.hep.wisc.edu,
> >>> fiupg.ampath.net, hg.ihepa.ufl.edu, pg.ihepa.ufl.edu, nest.phys.uwm.edu,
> >>> gate01.aglt2.org". You list all of them failing with "Application
> >>> exception: Job cwtsmall failed with an exit code of 1024". In the
> >>> reasons you list "All of them use condor job managers. The condor job
> >>> manager has a bug. It does not properly quote arguments. So some strange
> >>> things might happen like this if using it". How exactly did you figure
> >>> out the problem was a "quoting"/"bug in Condor" issue? You had success
> >>> on the site "cmsgrid01.hep.wisc.edu" which also uses the Condor JM, how
> >>> would you explain this? Do they have some sort of patched Condor, which
> >>> we can also release to other sites?
> >>>
> >>> Also can you send me the JDL/Condor submit file that you are using to
> >>> submit your jobs. I would like to better understand the steps you are
> >>> taking so that we can determine if this is indeed an OSG issue. Also I
> >>> would like to submit few pilot jobs myself and see if I can reproduce
> >>> the errors.
> >>>
> >>> Thanks,
> >>> Anand
> >>>
>