[Swift-devel] Re: Fwd: Considering Running errors

Ben Clifford benc at hawaga.org.uk
Wed Sep 5 18:17:19 CDT 2007


The site model that VDS and swift ended up with is based around having 
something like OSG_DATA available. I don't see enough explicitly listed in 
the below model to see what the problem is there.

As for removing the need for a shared file system on a site, we've talked 
about that in the past - I don't see anything new that is introduced by 
this.

On Wed, 5 Sep 2007, Mihael Hategan wrote:

> Well, it looks like OSG makes what I would think would be a relatively
> simple process complicated. Maybe I'm missing something.
> 
> However, there is one thing that we should address in Swift. And that is
> the separation between storage and temporary job directories. We
> currently make the assumption that the latter are one level below the
> former. But I don't think this restriction is necessary, and I don't
> think there are many places in the code that make that assumption.
> 
> As for the lack of a shared file system and the necessity to address
> this using an intricate and seemingly unstable procedure that needs to
> happen on the head node, I don't know what to say. It looks like OSG
> could, in a best case scenario, provide some form of service (not
> necessarily in the TCP/IP/web service sense) with a well defined
> interface that allows these things to be done in a uniform way across
> sites. Any ideas?
> 
> Mihael
> 
> On Wed, 2007-09-05 at 17:40 -0500, Jing Tie wrote: 
> > Hi Miheal,
> > 
> > I forward the email from osg troubleshooting group. I think the shared
> > file system is not a common thing on osg sites.
> > 
> > Thanks,
> > Jing
> > 
> > ---------- Forwarded message ----------
> > From: Anand Padmanabhan <anand-padmanabhan-1 at uiowa.edu>
> > Date: Aug 31, 2007 4:10 PM
> > Subject: Re: Considering Running errors
> > To: Jing Tie <tiejing at gmail.com>
> > Cc: "Wang, Shaowen" <shaowen-wang at uiowa.edu>
> > 
> > 
> > Hi Jing,
> > Jing Tie wrote:
> > > Hi Anand,
> > >
> > > I am sorry I can't explain why GLOW site can run the application
> > > successfully, but others can't.
> > >
> > > I am using VDL to describe SID application workflow, and SWIFT
> > > generates the submit file itself. I have sent email to ask the swift
> > > developer for the submit file that swift generated.
> > I am not quite familiar with VDL or SWIFT. Please send me the  submit
> > files as soon as you hear from the developers. If possible submit a
> > simple Condor-G/Globus job that replicates what VDL/SWIFT does so that
> > you can replicate the failure.
> > >
> > > For the "Exception on getFile" error, it says that the result files
> > > under $data_dir cannot be opened. Is the $data_dir on a shared file
> > > system?
> > I did not know you were expecting a shared $data, the requirements on
> > OSG_DATA are much more complicated. I guess if $OSG_DATA is defined
> > there is a requirement that it should be accessible from the CE, but I
> > don't think this means that it to be a shared file system. In fact I
> > believe $OSG_DATA can be UNAVAILABLE if OSG_SITE_READ and OSG_SITE_WRITE
> > have been defined.
> > https://twiki.grid.iu.edu/twiki/bin/view/Integration/ITB_0_7/LocalStorageRequirements
> > should give you what needs to be defined and I know few sites don't have
> > NFS mounted $DATA, but I know quite a few that have NFS mounted data.
> > E.g. 1. osg.hpcc.nd.edu has a mounted $OSG_DATA
> > 2. FNAL_FERMIGRID has it local to the Worker Node
> > 
> > Also you should follow the same directory structure for DATA as you do
> > for APP (i.e. copy data to $OSG_DATA/osg/jtie, this will avoid any
> > potential conflicts). Do you do this?
> > 
> > The why you should access data from $OSG_DATA is to copy it to $WN_TMP
> > (more than just cp, see best practices) before you start using it from
> > worker nodes in your job. I don't know if this is what is causing the
> > issue, but it could explain why it is working for fork and not for
> > Condor jobs. Also it is good practice to copy to WN_TMP.  The best
> > practices guide we created for the use of an SE is available at
> > https://twiki.grid.iu.edu/twiki/bin/view/Troubleshooting/StorageElementBestPractices
> > and should give you some useful pointers.
> > 
> > Let me know if you have questions.
> > 
> > Thanks
> > Anand
> > >
> > > Thanks,
> > > Jing
> > >
> > > On 8/31/07, Anand Padmanabhan <anand-padmanabhan-1 at uiowa.edu> wrote:
> > >> Hi Jing,
> > >>
> > >> I don't think I have heard from you on this thread.
> > >>
> > >> Thanks
> > >> Anand
> > >>
> > >> Anand Padmanabhan wrote:
> > >>> Hi Jing,
> > >>>
> > >>> Yesterday in the troubleshooting meeting you mention we should
> > >>> concentrate on errors encountered during the running of your jobs. As
> > >>> your document has identified we have 3 types of errors you see at run
> > >>> time. I think we need to try and understand them one by one.
> > >>>
> > >>> The first kind of error affects the following eight sites
> > >>> "osg.hpcc.nd.edu, osg.rcac.purdue.edu, cmsgrid02.hep.wisc.edu,
> > >>> fiupg.ampath.net, hg.ihepa.ufl.edu, pg.ihepa.ufl.edu, nest.phys.uwm.edu,
> > >>> gate01.aglt2.org". You list all of them failing with "Application
> > >>> exception: Job cwtsmall failed with an exit code of 1024". In the
> > >>> reasons you list "All of them use condor job managers. The condor job
> > >>> manager has a bug. It does not properly quote arguments. So some strange
> > >>> things might happen like this if using it". How exactly did you figure
> > >>> out the problem was a "quoting"/"bug in Condor" issue? You had success
> > >>> on the site "cmsgrid01.hep.wisc.edu" which also uses the Condor JM, how
> > >>> would you explain this? Do they have some sort of patched Condor, which
> > >>> we can also release to other sites?
> > >>>
> > >>> Also can you send me the JDL/Condor submit file that you are using to
> > >>> submit your jobs. I would like to better understand the steps you are
> > >>> taking so that we can determine if this is indeed an OSG issue. Also I
> > >>> would like to submit few pilot jobs myself and see if I can reproduce
> > >>> the errors.
> > >>>
> > >>> Thanks,
> > >>> Anand
> > >>>
> > 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 



More information about the Swift-devel mailing list