[Swift-devel] Re: [Swift-user] MolDyn workflow failing on GPFS at NCSA

Wed Apr 11 14:45:33 CDT 2007

[moving to swift-devel]

On Wed, 2007-04-11 at 14:06 -0500, Mike Wilde wrote:
> Hi all,
> 
> I'm moving this discussion to swift-devel. This concerns a problem that is 
> preventing Nika's MolDyn workflow from running on the NCSA TG IA64 cluster 
> under GPFS.
> 
[...]
> So lets pust forward and solve this one.  Here's my questions:
> 
> 0) What do we know about the problem from tests to date?  Is the problem (the 
> timeout?) due to not getting a job-completion notification (as it appears form 
> the log below) or a data transfer timeout (as I got the possibly mistaken 
> impression from my discussion yesterday with Nika)?
> 
> 1) As its only been run at NCSA, I suggested yesterday that we take more 
> advantage of the uc-teragrid node - less loaded, and we have very capable 
> eager staff at Argonne to help us debug.  It would also help us better assess 
> whether we are looking at a GPFS problem, a local NCSA problem, a Globus 
> problem, or a Swift problem.  I dont see any conclusive evidence yet that this 
> problem is specifically in GPFS or that GPFS is behaving incorrectly.

Me neither. And I find it difficult to see what the file system has to
do with the outcome that we see. From the runs Nika did so far, it seems
that there is a correlation. I am however concerned about correlations
inferred from a low number of experiments.

> 
> 2) It might be worth a try to isolate the problem with simpler Swift code, but 
> that will take time and we have no guarantee up-front *how* much time.  Could 
> be a long hunt.  Do we have a clue as to what pattern to start with in simpler 
> Swift code?  Could we dial-down the job duration of this run - way down - so 
> that we can make the problem happen quicker, with less waiting?  Its more 
> important that we can make it happen quickly and repeatedly than that we do it 
> with simpler code.  The latter would be nice but seems to be not a necessity 
> (yet).

Sorry. I wrote that with the assumption that speed and reproducibility
are closely related to simpler code.

> 
> 3) Do we have enough logging turned on in the current runs to isolate the 
> problem a bit further? Do we need to insert additional logging

Don't know. From my analysis of the logs, I couldn't tell that there is
a abnormal problem. Other workflows finish successfully similar log
patterns.

What I also instructed Nika to do is to always run this with remote
debugging enabled. That way I can hook up to the running process and get
more details that would otherwise be hard to see.

> 
> 4) Does this problem bear any relation to observed pre-WS-GRAM problems in 
> getting back job completion notification on NFS sites?  These are, eg, 
> problems that Jen got deeply into with VDS/Pegasus/DAGman.  He made mods to 
> the PBS and COndor job managers to support reduce a suspected NFS-aggravated 
> race condition.  Could there be a similar issue in the NCSA Globus PBS 
> jobmanager?  (Note that Jens's issues were with NFS, but similar phenomena 
> could plague any shared FS that globus depends on for completion 
> synchronization). This problem was, to my knowldege, never fully resolved.

It's all cool, but the job manager stores its files on the same
filesystem. GPFS vs. home refers to the swift temporary directories,
which should be a separate concern (and this is what makes it odd, since
the expected behavior in case of problems would be missing file errors
not notification issues).

> 
> 5) Could we re-run this workflow using WS-GRAM?

I think that would only complicate the issue since it's harder to debug
WS-GRAM than pre-WS GRAM.

> 
> 6) I see notes in the thread below that say the problem happens only on GPFS, 
> and other notes that it happens also on the /home filesystem.  Whats the 
> current knowledge on this?  Have you determined conclusively that this is a 
> GPFS-only problem?

No. See above.

Mihael

> 
> Thanks,
> 
> Mik
> 
> 
> Veronika V. Nefedova wrote, On 4/11/2007 1:16 PM:
> > Mike, This is the thread that Mihael was involved with (with help at teragrid)
> > 
> > Nika
> > 
> > 
> >> Date: Mon, 9 Apr 2007 22:15:36 -0500
> >> Subject: Re: workflow not working on GPFS
> >> To: "Veronika  V. Nefedova" <nefedova at mcs.anl.gov>
> >> From: help at teragrid.org
> >> Cc:
> >> X-Mailer: Perl5 Mail::Internet v1.74
> >> Sender: Nobody <nobody at ncsa.uiuc.edu>
> >> X-Null-Tag: b6f18422934f93a28d94a359b5c1b409
> >> X-NCSA-MailScanner-Information: Please contact help at ncsa.uiuc.edu for 
> >> more information, amantadine.ncsa.uiuc.edu
> >> X-NCSA-MailScanner: Found to be clean
> >> X-Virus-Scanned: by amavisd-new-20030616-p10 (Debian) at 
> >> mailgw.mcs.anl.gov
> >>
> >> FROM: McWilliams, David G
> >> (Concerning ticket No. 138013)
> >>
> >> Nika,
> >>
> >> I asked the system administrator to create a directory for you. I will 
> >> let you
> >> know when its been created.
> >>
> >> Dave McWilliams            (217) 244-1144          consult at ncsa.uiuc.edu
> >> NCSA Consulting Services    http://www.ncsa.uiuc.edu/UserInfo/Consulting/
> >> -------------------------------------------------------------------------- 
> >>
> >>
> >> Veronika V. Nefedova <help at teragrid.org> writes:
> >> >I have some preliminary number. Its 121.7MB per molecule. We would 
> >> like to
> >> >do 250 molecules at once (which is our goal), but for now it would be 
> >> very
> >> >good to be able to do 50 molecules at a time.
> >> >What do I need to do in order to have an access to such disk space ?
> >> >
> >> >Thanks,
> >> >
> >> >Nika
> >> >
> >> >At 03:25 PM 4/9/2007, Veronika  V. Nefedova wrote:
> >> >>yep, my main purpose for using gpfs was the disk space. And its only a
> >> >>temporary disk space (for the duration of the job run time), not for
> >> >>storage. Let me try to estimate how much space I need (as I 
> >> mentioned - my
> >> >>directories got removed once the job finishes successfully).
> >> >>
> >> >>Thanks!
> >> >>
> >> >>Nika
> >> >>
> >> >>At 03:15 PM 4/9/2007, help at teragrid.org wrote:
> >> >>>FROM: McWilliams, David G
> >> >>>(Concerning ticket No. 138013)
> >> >>>
> >> >>>Nika,
> >> >>>
> >> >>>Is it critical that you run out of /gpfs or do you just need more 
> >> space
> >> >>>than is
> >> >>>available in your home directory? Another alternative would be to 
> >> request
> >> >>>space
> >> >>>in the /usr/projects directory which is the same type filesystem as 
> >> the home
> >> >>>directory filesystem. If that would work, how much disk space do 
> >> you need?
> >> >>>
> >> >>>Dave McWilliams            (217) 244-1144          
> >> consult at ncsa.uiuc.edu
> >> >>>NCSA Consulting Services    
> >> http://www.ncsa.uiuc.edu/UserInfo/Consulting/
> >> >>>-------------------------------------------------------------------------- 
> >>
> >> >>>
> >> >>>Veronika V. Nefedova <help at teragrid.org> writes:
> >> >>> >Hi, Dave:
> >> >>> >
> >> >>> >OK, I do not have single successful run on /gpfs system of my 
> >> workflow. I
> >> >>> >ran the same exact job on /home and it worked just fine.
> >> >>> >Today the 3-rd job of the workflow failed because of the timeout 
> >> (thats my
> >> >>> >guess, it never timeouts on /home). Last time I tried on gpfs 
> >> that job
> >> >>> >finished just fine and in fact that were the next jobs in the 
> >> workflow
> >> >>> that
> >> >>> >failed. Something is wrong with gpfs it seems.
> >> >>> >
> >> >>> >Thanks,
> >> >>> >
> >> >>> >Nika
> >> >>> >
> >> >>> >
> >> >>> >At 05:06 PM 4/4/2007, Mihael Hategan wrote:
> >> >>> >>Nika reports that some jobs have finished correctly. This seems to
> >> >>> >>indicate that things are OK on the Globus side, and that it's a 
> >> problem
> >> >>> >>with the specification.
> >> >>> >>
> >> >>> >>Mihael
> >> >>> >>
> >> >>> >>On Wed, 2007-04-04 at 17:03 -0500, Veronika V. Nefedova wrote:
> >> >>> >> > Hi, Dave:
> >> >>> >> >
> >> >>> >> > the workflow engine is waiting for the completion 
> >> notification from
> >> >>> Globus.
> >> >>> >> > Now all my 68 jobs have finished, but no notification was sent
> >> >>> back. So
> >> >>> >> the
> >> >>> >> > workflow doesn't submit any new jobs. I am Cc Mihael who 
> >> might be
> >> >>> able to
> >> >>> >> > suggest what is going on wrong here.
> >> >>> >> >
> >> >>> >> > Could the problem be that telnet problem that happened earlier
> >> >>> today ? My
> >> >>> >> > jobs were already submitted, but not finished yet. (and I got
> >> >>> kicked off
> >> >>> >> > from the login node)
> >> >>> >> >
> >> >>> >> > Thanks for looking into this.
> >> >>> >> >
> >> >>> >> > Nika
> >> >>> >> >
> >> >>> >> > At 03:46 PM 4/4/2007, help at teragrid.org wrote:
> >> >>> >> > >FROM: McWilliams, David G
> >> >>> >> > >(Concerning ticket No. 138013)
> >> >>> >> > >
> >> >>> >> > >Nika,
> >> >>> >> > >
> >> >>> >> > >Thanks for the additional information. It is good to know 
> >> that the
> >> >>> >> problem is
> >> >>> >> > >not specific to jobs that use GPFS. How does the workflow 
> >> engine know
> >> >>> >> that the
> >> >>> >> > >job is done? Does that workflow engine wait for the globusrun
> >> >>> command to
> >> >>> >> > >return
> >> >>> >> > >or is it looking in a file for the job status?
> >> >>> >> > >
> >> >>> >> > >Dave McWilliams            (217)
> >> >>> 244-1144          consult at ncsa.uiuc.edu
> >> >>> >> > >NCSA Consulting
> >> >>> Services    http://www.ncsa.uiuc.edu/UserInfo/Consulting/
> >> >>> >> > 
> >> >---------------------------------------------------------------- ---
> >> >>> -------
> >> >>> >> > >
> >> >>> >> > >Veronika V. Nefedova <help at teragrid.org> writes:
> >> >>> >> > > >Actually, this current job is failing despite the workdir 
> >> set to
> >> >>> >> /home. I
> >> >>> >> > > >had 68 jobs (75 mins each) in the queue. I see that I have 
> >> now 19
> >> >>> >> running
> >> >>> >> > > >and 27 queued jobs, which means that 68-19-27=22 jobs have 
> >> already
> >> >>> >> finished
> >> >>> >> > > >(the output from those jobs confirms it)  but none of these
> >> >>> finished
> >> >>> >> jobs
> >> >>> >> > > >are marked as 'done' on my side. I.e. gram didn't send the
> >> >>> >> completion info
> >> >>> >> > > >back to the submit side. So the problem is not unique to 
> >> gpfs...
> >> >>> >> > > >
> >> >>> >> > > >Nika
> >> >>> >> > > >
> >> >>> >> > > >At 02:00 PM 4/4/2007, Veronika  V. Nefedova wrote:
> >> >>> >> > > >>Hi, Dave:
> >> >>> >> > > >>
> >> >>> >> > > >>The jobs are submitted in the form of RSL via API calls. 
> >> So I do
> >> >>> >> not have
> >> >>> >> > > >>scripts per se. This what I have in my log file:
> >> >>> >> > > >>
> >> >>> >> > > >>2007-04-04 13:15:51,653 DEBUG JobSubmissionTaskHandler 
> >> RSL: &(
> >> >>> >> directory =
> >> >>> >> > > >>"/home/ac/nefed
> >> >>> >> > > >>ova/SWIFT/MolDyn-ty2oh8r2ki171" )( arguments = 
> >> "shared/wrapper.sh"
> >> >>> >> > > >>"chrm_long-720q4k9i" "s
> >> >>> >> > > >>olv_repu_0.8_0.9_b0_m001.out" "stderr.txt" "solv.inp" "" 
> >> "solv.inp
> >> >>> >> > > >>parm03_gaffnb_all.prm p
> >> >>> >> > > >>arm03_gaff_all.rtf m001_am1.rtf m001_am1.prm solv_m001.psf
> >> >>> >> > > >>solv_m001_eq.crd solv_repu_0.8_
> >> >>> >> > > >>0.9_b0.prt" "solv_repu_0.8_0.9_b0_m001.wham
> >> >>> >> solv_repu_0.8_0.9_b0_m001.crd
> >> >>> >> > > >>solv_repu_0.8_0.
> >> >>> >> > > >>9_b0_m001.out solv_repu_0.8_0.9_b0_m001_done" ""
> >> >>> >> > > >>"/home/ac/yqdeng/c34a2/exec/altix/charmm"
> >> >>> >> > > >>  "pstep:40000" "prtfile:solv_repu_0.8_0.9_b0" 
> >> "system:solv_m001"
> >> >>> >> > > >> "stitle:m001" "rtffile:pa
> >> >>> >> > > >>rm03_gaff_all.rtf" "paramfile:parm03_gaffnb_all.prm"
> >> >>> "gaff:m001_am1"
> >> >>> >> > > >>"vac:" "restart:NONE"
> >> >>> >> > > >>  "faster:off" "rwater:15" "chem:chem" "minstep:0" 
> >> "rforce:0"
> >> >>> >> "ligcrd:lyz"
> >> >>> >> > > >> "stage:repu" "ur
> >> >>> >> > > >>andseed:2640378" "dirname:solv_repu_0.8_0.9_b0_m001" 
> >> "rcut1:0.8"
> >> >>> >> > > >>"rcut2:0.9" )( executable
> >> >>> >> > > >>  = "/bin/sh" )( maxwalltime = "75" )( environment = ( 
> >> "PATH"
> >> >>> >> > > >> "/home/ac/nefedova/bin/tools/
> >> >>> >> > > >>bin/:/usr/bin:/bin" ) )
> >> >>> >> > > >>
> >> >>> >> > > >>the working directory is specified at the very beginning (
> >> >>> directory =
> >> >>> >> > > >>"/home/ac/nefedova/SWIFT/MolDyn-ty2oh8r2ki171" ). When I 
> >> try to
> >> >>> specify
> >> >>> >> > > >>the /gpfs_scratch1/nefedova/bla -- the 'finished' status
> >> >>> doesn't return
> >> >>> >> > > >>back to my workflow engine. Right now I have the working
> >> >>> version of the
> >> >>> >> > > >>workflow running (in /home).
> >> >>> >> > > >>
> >> >>> >> > > >>Thank you very much for looking into this!
> >> >>> >> > > >>
> >> >>> >> > > >>Nika
> >> >>> >> > > >>
> >> >>> >> > > >>At 07:51 PM 4/3/2007, help at teragrid.org wrote:
> >> >>> >> > > >>>FROM: McWilliams, David G
> >> >>> >> > > >>>(Concerning ticket No. 138013)
> >> >>> >> > > >>>
> >> >>> >> > > >>>Nika,
> >> >>> >> > > >>>
> >> >>> >> > > >>>I would suggest that we start by comparing the PBS batch 
> >> scripts
> >> >>> >> that are
> >> >>> >> > > >>>created for the jobs with both working directories. Does 
> >> the
> >> >>> workflow
> >> >>> >> > > engine
> >> >>> >> > > >>>report the PBS job IDs? If so, please send the job IDs 
> >> for both
> >> >>> >> job types?
> >> >>> >> > > >>>
> >> >>> >> > > >>>Dave McWilliams            (217)
> >> >>> >> 244-1144          consult at ncsa.uiuc.edu
> >> >>> >> > > >>>NCSA Consulting
> >> >>> >> Services    http://www.ncsa.uiuc.edu/UserInfo/Consulting/
> >> >>> >> > > 
> >> >>>------------------------------------------------------------ ---
> >> >>> ----
> >> >>> >> -------
> >> >>> >> > > >>>
> >> >>> >> > > >>>Veronika V. Nefedova <help at teragrid.org> writes:
> >> >>> >> > > >>> >Hi,
> >> >>> >> > > >>> >
> >> >>> >> > > >>> >I am using Swift workflow engine
> >> >>> >> (http://www.ci.uchicago.edu/swift/)  to
> >> >>> >> > > >>> >submit my jobs to Teragrid/NCSA from my machine at ANL.
> >> >>> >> > > >>> >I have some problems with having my workflow ran on 
> >> /gpfs on
> >> >>> >> Teragrid.
> >> >>> >> > > >>> When
> >> >>> >> > > >>> >my working directory is set to /home, the workflow works
> >> >>> just fine.
> >> >>> >> > > When I
> >> >>> >> > > >>> >want to run a large run I set my workdir to /gpfs 
> >> (which has
> >> >>> a lot
> >> >>> >> > > of disk
> >> >>> >> > > >>> >space) and then my workflow stops working. The very
> >> >>> beginning of the
> >> >>> >> > > >>> >workflow (that has some small jobs) is working ok, but 
> >> after the
> >> >>> >> > > long jobs
> >> >>> >> > > >>> >(66 min each) are finished the system  never returns 
> >> the 'done'
> >> >>> >> status
> >> >>> >> > > >>> back
> >> >>> >> > > >>> >to Swift (@ANL) and thus no new jobs are submitted and 
> >> the
> >> >>> workflow
> >> >>> >> > > stops.
> >> >>> >> > > >>> >Again, exactly the same workflow works fine when 
> >> workdir is
> >> >>> set to
> >> >>> >> > > /home.
> >> >>> >> > > >>> >
> >> >>> >> > > >>> >Thanks!
> >> >>> >> > > >>> >
> >> >>> >> > > >>> >Nika
> >> >>> >> >
> >> >>> >> >
> > 
> > 
> > 
>