[Swift-devel] Re: Fwd: Considering Running errors

Wed Sep 5 18:27:18 CDT 2007

On Wed, 2007-09-05 at 23:17 +0000, Ben Clifford wrote:
> The site model that VDS and swift ended up with is based around having 
> something like OSG_DATA available. I don't see enough explicitly listed in 
> the below model to see what the problem is there.

What model are you referring to?

> 
> As for removing the need for a shared file system on a site, we've talked 
> about that in the past - I don't see anything new that is introduced by 
> this.

OSG specific mumbo-jumbo? Which would imply OSG specific implementation.
Which may very well be what we talked about before.

> 
> On Wed, 5 Sep 2007, Mihael Hategan wrote:
> 
> > Well, it looks like OSG makes what I would think would be a relatively
> > simple process complicated. Maybe I'm missing something.
> > 
> > However, there is one thing that we should address in Swift. And that is
> > the separation between storage and temporary job directories. We
> > currently make the assumption that the latter are one level below the
> > former. But I don't think this restriction is necessary, and I don't
> > think there are many places in the code that make that assumption.
> > 
> > As for the lack of a shared file system and the necessity to address
> > this using an intricate and seemingly unstable procedure that needs to
> > happen on the head node, I don't know what to say. It looks like OSG
> > could, in a best case scenario, provide some form of service (not
> > necessarily in the TCP/IP/web service sense) with a well defined
> > interface that allows these things to be done in a uniform way across
> > sites. Any ideas?
> > 
> > Mihael
> > 
> > On Wed, 2007-09-05 at 17:40 -0500, Jing Tie wrote: 
> > > Hi Miheal,
> > > 
> > > I forward the email from osg troubleshooting group. I think the shared
> > > file system is not a common thing on osg sites.
> > > 
> > > Thanks,
> > > Jing
> > > 
> > > ---------- Forwarded message ----------
> > > From: Anand Padmanabhan <anand-padmanabhan-1 at uiowa.edu>
> > > Date: Aug 31, 2007 4:10 PM
> > > Subject: Re: Considering Running errors
> > > To: Jing Tie <tiejing at gmail.com>
> > > Cc: "Wang, Shaowen" <shaowen-wang at uiowa.edu>
> > > 
> > > 
> > > Hi Jing,
> > > Jing Tie wrote:
> > > > Hi Anand,
> > > >
> > > > I am sorry I can't explain why GLOW site can run the application
> > > > successfully, but others can't.
> > > >
> > > > I am using VDL to describe SID application workflow, and SWIFT
> > > > generates the submit file itself. I have sent email to ask the swift
> > > > developer for the submit file that swift generated.
> > > I am not quite familiar with VDL or SWIFT. Please send me the  submit
> > > files as soon as you hear from the developers. If possible submit a
> > > simple Condor-G/Globus job that replicates what VDL/SWIFT does so that
> > > you can replicate the failure.
> > > >
> > > > For the "Exception on getFile" error, it says that the result files
> > > > under $data_dir cannot be opened. Is the $data_dir on a shared file
> > > > system?
> > > I did not know you were expecting a shared $data, the requirements on
> > > OSG_DATA are much more complicated. I guess if $OSG_DATA is defined
> > > there is a requirement that it should be accessible from the CE, but I
> > > don't think this means that it to be a shared file system. In fact I
> > > believe $OSG_DATA can be UNAVAILABLE if OSG_SITE_READ and OSG_SITE_WRITE
> > > have been defined.
> > > https://twiki.grid.iu.edu/twiki/bin/view/Integration/ITB_0_7/LocalStorageRequirements
> > > should give you what needs to be defined and I know few sites don't have
> > > NFS mounted $DATA, but I know quite a few that have NFS mounted data.
> > > E.g. 1. osg.hpcc.nd.edu has a mounted $OSG_DATA
> > > 2. FNAL_FERMIGRID has it local to the Worker Node
> > > 
> > > Also you should follow the same directory structure for DATA as you do
> > > for APP (i.e. copy data to $OSG_DATA/osg/jtie, this will avoid any
> > > potential conflicts). Do you do this?
> > > 
> > > The why you should access data from $OSG_DATA is to copy it to $WN_TMP
> > > (more than just cp, see best practices) before you start using it from
> > > worker nodes in your job. I don't know if this is what is causing the
> > > issue, but it could explain why it is working for fork and not for
> > > Condor jobs. Also it is good practice to copy to WN_TMP.  The best
> > > practices guide we created for the use of an SE is available at
> > > https://twiki.grid.iu.edu/twiki/bin/view/Troubleshooting/StorageElementBestPractices
> > > and should give you some useful pointers.
> > > 
> > > Let me know if you have questions.
> > > 
> > > Thanks
> > > Anand
> > > >
> > > > Thanks,
> > > > Jing
> > > >
> > > > On 8/31/07, Anand Padmanabhan <anand-padmanabhan-1 at uiowa.edu> wrote:
> > > >> Hi Jing,
> > > >>
> > > >> I don't think I have heard from you on this thread.
> > > >>
> > > >> Thanks
> > > >> Anand
> > > >>
> > > >> Anand Padmanabhan wrote:
> > > >>> Hi Jing,
> > > >>>
> > > >>> Yesterday in the troubleshooting meeting you mention we should
> > > >>> concentrate on errors encountered during the running of your jobs. As
> > > >>> your document has identified we have 3 types of errors you see at run
> > > >>> time. I think we need to try and understand them one by one.
> > > >>>
> > > >>> The first kind of error affects the following eight sites
> > > >>> "osg.hpcc.nd.edu, osg.rcac.purdue.edu, cmsgrid02.hep.wisc.edu,
> > > >>> fiupg.ampath.net, hg.ihepa.ufl.edu, pg.ihepa.ufl.edu, nest.phys.uwm.edu,
> > > >>> gate01.aglt2.org". You list all of them failing with "Application
> > > >>> exception: Job cwtsmall failed with an exit code of 1024". In the
> > > >>> reasons you list "All of them use condor job managers. The condor job
> > > >>> manager has a bug. It does not properly quote arguments. So some strange
> > > >>> things might happen like this if using it". How exactly did you figure
> > > >>> out the problem was a "quoting"/"bug in Condor" issue? You had success
> > > >>> on the site "cmsgrid01.hep.wisc.edu" which also uses the Condor JM, how
> > > >>> would you explain this? Do they have some sort of patched Condor, which
> > > >>> we can also release to other sites?
> > > >>>
> > > >>> Also can you send me the JDL/Condor submit file that you are using to
> > > >>> submit your jobs. I would like to better understand the steps you are
> > > >>> taking so that we can determine if this is indeed an OSG issue. Also I
> > > >>> would like to submit few pilot jobs myself and see if I can reproduce
> > > >>> the errors.
> > > >>>
> > > >>> Thanks,
> > > >>> Anand
> > > >>>
> > > 
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> > 
>