[Swift-user] MolDyn workflow failing on GPFS at NCSA

Wed Apr 11 14:06:10 CDT 2007

Hi all,

I'm moving this discussion to swift-devel. This concerns a problem that is 
preventing Nika's MolDyn workflow from running on the NCSA TG IA64 cluster 
under GPFS.

Ive read the message dialog below with TG support, as well as Mihael's comment 
in another message that this is hard to solve because its sporadic and 
seemingly load related, and have the following thoughts and questions.

Its important that we move forward on this workflow - its the next step with 
the MolDyn group. Its also important that we run correctly on GPFS.  I think 
that requiring non-GPFS filesystems is at best a short-term and limiting 
solution, and it leaves a bug unresolved that may well turn out ot be in 
Swift, CoG, or Globus.

So lets pust forward and solve this one.  Here's my questions:

0) What do we know about the problem from tests to date?  Is the problem (the 
timeout?) due to not getting a job-completion notification (as it appears form 
the log below) or a data transfer timeout (as I got the possibly mistaken 
impression from my discussion yesterday with Nika)?

1) As its only been run at NCSA, I suggested yesterday that we take more 
advantage of the uc-teragrid node - less loaded, and we have very capable 
eager staff at Argonne to help us debug.  It would also help us better assess 
whether we are looking at a GPFS problem, a local NCSA problem, a Globus 
problem, or a Swift problem.  I dont see any conclusive evidence yet that this 
problem is specifically in GPFS or that GPFS is behaving incorrectly.

2) It might be worth a try to isolate the problem with simpler Swift code, but 
that will take time and we have no guarantee up-front *how* much time.  Could 
be a long hunt.  Do we have a clue as to what pattern to start with in simpler 
Swift code?  Could we dial-down the job duration of this run - way down - so 
that we can make the problem happen quicker, with less waiting?  Its more 
important that we can make it happen quickly and repeatedly than that we do it 
with simpler code.  The latter would be nice but seems to be not a necessity 
(yet).

3) Do we have enough logging turned on in the current runs to isolate the 
problem a bit further? Do we need to insert additional logging

4) Does this problem bear any relation to observed pre-WS-GRAM problems in 
getting back job completion notification on NFS sites?  These are, eg, 
problems that Jen got deeply into with VDS/Pegasus/DAGman.  He made mods to 
the PBS and COndor job managers to support reduce a suspected NFS-aggravated 
race condition.  Could there be a similar issue in the NCSA Globus PBS 
jobmanager?  (Note that Jens's issues were with NFS, but similar phenomena 
could plague any shared FS that globus depends on for completion 
synchronization). This problem was, to my knowldege, never fully resolved.

5) Could we re-run this workflow using WS-GRAM?

6) I see notes in the thread below that say the problem happens only on GPFS, 
and other notes that it happens also on the /home filesystem.  Whats the 
current knowledge on this?  Have you determined conclusively that this is a 
GPFS-only problem?

Thanks,

Mik

Veronika V. Nefedova wrote, On 4/11/2007 1:16 PM:
> Mike, This is the thread that Mihael was involved with (with help at teragrid)
> 
> Nika
> 
> 
>> Date: Mon, 9 Apr 2007 22:15:36 -0500
>> Subject: Re: workflow not working on GPFS
>> To: "Veronika  V. Nefedova" <nefedova at mcs.anl.gov>
>> From: help at teragrid.org
>> Cc:
>> X-Mailer: Perl5 Mail::Internet v1.74
>> Sender: Nobody <nobody at ncsa.uiuc.edu>
>> X-Null-Tag: b6f18422934f93a28d94a359b5c1b409
>> X-NCSA-MailScanner-Information: Please contact help at ncsa.uiuc.edu for 
>> more information, amantadine.ncsa.uiuc.edu
>> X-NCSA-MailScanner: Found to be clean
>> X-Virus-Scanned: by amavisd-new-20030616-p10 (Debian) at 
>> mailgw.mcs.anl.gov
>>
>> FROM: McWilliams, David G
>> (Concerning ticket No. 138013)
>>
>> Nika,
>>
>> I asked the system administrator to create a directory for you. I will 
>> let you
>> know when its been created.
>>
>> Dave McWilliams            (217) 244-1144          consult at ncsa.uiuc.edu
>> NCSA Consulting Services    http://www.ncsa.uiuc.edu/UserInfo/Consulting/
>> -------------------------------------------------------------------------- 
>>
>>
>> Veronika V. Nefedova <help at teragrid.org> writes:
>> >I have some preliminary number. Its 121.7MB per molecule. We would 
>> like to
>> >do 250 molecules at once (which is our goal), but for now it would be 
>> very
>> >good to be able to do 50 molecules at a time.
>> >What do I need to do in order to have an access to such disk space ?
>> >
>> >Thanks,
>> >
>> >Nika
>> >
>> >At 03:25 PM 4/9/2007, Veronika  V. Nefedova wrote:
>> >>yep, my main purpose for using gpfs was the disk space. And its only a
>> >>temporary disk space (for the duration of the job run time), not for
>> >>storage. Let me try to estimate how much space I need (as I 
>> mentioned - my
>> >>directories got removed once the job finishes successfully).
>> >>
>> >>Thanks!
>> >>
>> >>Nika
>> >>
>> >>At 03:15 PM 4/9/2007, help at teragrid.org wrote:
>> >>>FROM: McWilliams, David G
>> >>>(Concerning ticket No. 138013)
>> >>>
>> >>>Nika,
>> >>>
>> >>>Is it critical that you run out of /gpfs or do you just need more 
>> space
>> >>>than is
>> >>>available in your home directory? Another alternative would be to 
>> request
>> >>>space
>> >>>in the /usr/projects directory which is the same type filesystem as 
>> the home
>> >>>directory filesystem. If that would work, how much disk space do 
>> you need?
>> >>>
>> >>>Dave McWilliams            (217) 244-1144          
>> consult at ncsa.uiuc.edu
>> >>>NCSA Consulting Services    
>> http://www.ncsa.uiuc.edu/UserInfo/Consulting/
>> >>>-------------------------------------------------------------------------- 
>>
>> >>>
>> >>>Veronika V. Nefedova <help at teragrid.org> writes:
>> >>> >Hi, Dave:
>> >>> >
>> >>> >OK, I do not have single successful run on /gpfs system of my 
>> workflow. I
>> >>> >ran the same exact job on /home and it worked just fine.
>> >>> >Today the 3-rd job of the workflow failed because of the timeout 
>> (thats my
>> >>> >guess, it never timeouts on /home). Last time I tried on gpfs 
>> that job
>> >>> >finished just fine and in fact that were the next jobs in the 
>> workflow
>> >>> that
>> >>> >failed. Something is wrong with gpfs it seems.
>> >>> >
>> >>> >Thanks,
>> >>> >
>> >>> >Nika
>> >>> >
>> >>> >
>> >>> >At 05:06 PM 4/4/2007, Mihael Hategan wrote:
>> >>> >>Nika reports that some jobs have finished correctly. This seems to
>> >>> >>indicate that things are OK on the Globus side, and that it's a 
>> problem
>> >>> >>with the specification.
>> >>> >>
>> >>> >>Mihael
>> >>> >>
>> >>> >>On Wed, 2007-04-04 at 17:03 -0500, Veronika V. Nefedova wrote:
>> >>> >> > Hi, Dave:
>> >>> >> >
>> >>> >> > the workflow engine is waiting for the completion 
>> notification from
>> >>> Globus.
>> >>> >> > Now all my 68 jobs have finished, but no notification was sent
>> >>> back. So
>> >>> >> the
>> >>> >> > workflow doesn't submit any new jobs. I am Cc Mihael who 
>> might be
>> >>> able to
>> >>> >> > suggest what is going on wrong here.
>> >>> >> >
>> >>> >> > Could the problem be that telnet problem that happened earlier
>> >>> today ? My
>> >>> >> > jobs were already submitted, but not finished yet. (and I got
>> >>> kicked off
>> >>> >> > from the login node)
>> >>> >> >
>> >>> >> > Thanks for looking into this.
>> >>> >> >
>> >>> >> > Nika
>> >>> >> >
>> >>> >> > At 03:46 PM 4/4/2007, help at teragrid.org wrote:
>> >>> >> > >FROM: McWilliams, David G
>> >>> >> > >(Concerning ticket No. 138013)
>> >>> >> > >
>> >>> >> > >Nika,
>> >>> >> > >
>> >>> >> > >Thanks for the additional information. It is good to know 
>> that the
>> >>> >> problem is
>> >>> >> > >not specific to jobs that use GPFS. How does the workflow 
>> engine know
>> >>> >> that the
>> >>> >> > >job is done? Does that workflow engine wait for the globusrun
>> >>> command to
>> >>> >> > >return
>> >>> >> > >or is it looking in a file for the job status?
>> >>> >> > >
>> >>> >> > >Dave McWilliams            (217)
>> >>> 244-1144          consult at ncsa.uiuc.edu
>> >>> >> > >NCSA Consulting
>> >>> Services    http://www.ncsa.uiuc.edu/UserInfo/Consulting/
>> >>> >> > 
>> >---------------------------------------------------------------- ---
>> >>> -------
>> >>> >> > >
>> >>> >> > >Veronika V. Nefedova <help at teragrid.org> writes:
>> >>> >> > > >Actually, this current job is failing despite the workdir 
>> set to
>> >>> >> /home. I
>> >>> >> > > >had 68 jobs (75 mins each) in the queue. I see that I have 
>> now 19
>> >>> >> running
>> >>> >> > > >and 27 queued jobs, which means that 68-19-27=22 jobs have 
>> already
>> >>> >> finished
>> >>> >> > > >(the output from those jobs confirms it)  but none of these
>> >>> finished
>> >>> >> jobs
>> >>> >> > > >are marked as 'done' on my side. I.e. gram didn't send the
>> >>> >> completion info
>> >>> >> > > >back to the submit side. So the problem is not unique to 
>> gpfs...
>> >>> >> > > >
>> >>> >> > > >Nika
>> >>> >> > > >
>> >>> >> > > >At 02:00 PM 4/4/2007, Veronika  V. Nefedova wrote:
>> >>> >> > > >>Hi, Dave:
>> >>> >> > > >>
>> >>> >> > > >>The jobs are submitted in the form of RSL via API calls. 
>> So I do
>> >>> >> not have
>> >>> >> > > >>scripts per se. This what I have in my log file:
>> >>> >> > > >>
>> >>> >> > > >>2007-04-04 13:15:51,653 DEBUG JobSubmissionTaskHandler 
>> RSL: &(
>> >>> >> directory =
>> >>> >> > > >>"/home/ac/nefed
>> >>> >> > > >>ova/SWIFT/MolDyn-ty2oh8r2ki171" )( arguments = 
>> "shared/wrapper.sh"
>> >>> >> > > >>"chrm_long-720q4k9i" "s
>> >>> >> > > >>olv_repu_0.8_0.9_b0_m001.out" "stderr.txt" "solv.inp" "" 
>> "solv.inp
>> >>> >> > > >>parm03_gaffnb_all.prm p
>> >>> >> > > >>arm03_gaff_all.rtf m001_am1.rtf m001_am1.prm solv_m001.psf
>> >>> >> > > >>solv_m001_eq.crd solv_repu_0.8_
>> >>> >> > > >>0.9_b0.prt" "solv_repu_0.8_0.9_b0_m001.wham
>> >>> >> solv_repu_0.8_0.9_b0_m001.crd
>> >>> >> > > >>solv_repu_0.8_0.
>> >>> >> > > >>9_b0_m001.out solv_repu_0.8_0.9_b0_m001_done" ""
>> >>> >> > > >>"/home/ac/yqdeng/c34a2/exec/altix/charmm"
>> >>> >> > > >>  "pstep:40000" "prtfile:solv_repu_0.8_0.9_b0" 
>> "system:solv_m001"
>> >>> >> > > >> "stitle:m001" "rtffile:pa
>> >>> >> > > >>rm03_gaff_all.rtf" "paramfile:parm03_gaffnb_all.prm"
>> >>> "gaff:m001_am1"
>> >>> >> > > >>"vac:" "restart:NONE"
>> >>> >> > > >>  "faster:off" "rwater:15" "chem:chem" "minstep:0" 
>> "rforce:0"
>> >>> >> "ligcrd:lyz"
>> >>> >> > > >> "stage:repu" "ur
>> >>> >> > > >>andseed:2640378" "dirname:solv_repu_0.8_0.9_b0_m001" 
>> "rcut1:0.8"
>> >>> >> > > >>"rcut2:0.9" )( executable
>> >>> >> > > >>  = "/bin/sh" )( maxwalltime = "75" )( environment = ( 
>> "PATH"
>> >>> >> > > >> "/home/ac/nefedova/bin/tools/
>> >>> >> > > >>bin/:/usr/bin:/bin" ) )
>> >>> >> > > >>
>> >>> >> > > >>the working directory is specified at the very beginning (
>> >>> directory =
>> >>> >> > > >>"/home/ac/nefedova/SWIFT/MolDyn-ty2oh8r2ki171" ). When I 
>> try to
>> >>> specify
>> >>> >> > > >>the /gpfs_scratch1/nefedova/bla -- the 'finished' status
>> >>> doesn't return
>> >>> >> > > >>back to my workflow engine. Right now I have the working
>> >>> version of the
>> >>> >> > > >>workflow running (in /home).
>> >>> >> > > >>
>> >>> >> > > >>Thank you very much for looking into this!
>> >>> >> > > >>
>> >>> >> > > >>Nika
>> >>> >> > > >>
>> >>> >> > > >>At 07:51 PM 4/3/2007, help at teragrid.org wrote:
>> >>> >> > > >>>FROM: McWilliams, David G
>> >>> >> > > >>>(Concerning ticket No. 138013)
>> >>> >> > > >>>
>> >>> >> > > >>>Nika,
>> >>> >> > > >>>
>> >>> >> > > >>>I would suggest that we start by comparing the PBS batch 
>> scripts
>> >>> >> that are
>> >>> >> > > >>>created for the jobs with both working directories. Does 
>> the
>> >>> workflow
>> >>> >> > > engine
>> >>> >> > > >>>report the PBS job IDs? If so, please send the job IDs 
>> for both
>> >>> >> job types?
>> >>> >> > > >>>
>> >>> >> > > >>>Dave McWilliams            (217)
>> >>> >> 244-1144          consult at ncsa.uiuc.edu
>> >>> >> > > >>>NCSA Consulting
>> >>> >> Services    http://www.ncsa.uiuc.edu/UserInfo/Consulting/
>> >>> >> > > 
>> >>>------------------------------------------------------------ ---
>> >>> ----
>> >>> >> -------
>> >>> >> > > >>>
>> >>> >> > > >>>Veronika V. Nefedova <help at teragrid.org> writes:
>> >>> >> > > >>> >Hi,
>> >>> >> > > >>> >
>> >>> >> > > >>> >I am using Swift workflow engine
>> >>> >> (http://www.ci.uchicago.edu/swift/)  to
>> >>> >> > > >>> >submit my jobs to Teragrid/NCSA from my machine at ANL.
>> >>> >> > > >>> >I have some problems with having my workflow ran on 
>> /gpfs on
>> >>> >> Teragrid.
>> >>> >> > > >>> When
>> >>> >> > > >>> >my working directory is set to /home, the workflow works
>> >>> just fine.
>> >>> >> > > When I
>> >>> >> > > >>> >want to run a large run I set my workdir to /gpfs 
>> (which has
>> >>> a lot
>> >>> >> > > of disk
>> >>> >> > > >>> >space) and then my workflow stops working. The very
>> >>> beginning of the
>> >>> >> > > >>> >workflow (that has some small jobs) is working ok, but 
>> after the
>> >>> >> > > long jobs
>> >>> >> > > >>> >(66 min each) are finished the system  never returns 
>> the 'done'
>> >>> >> status
>> >>> >> > > >>> back
>> >>> >> > > >>> >to Swift (@ANL) and thus no new jobs are submitted and 
>> the
>> >>> workflow
>> >>> >> > > stops.
>> >>> >> > > >>> >Again, exactly the same workflow works fine when 
>> workdir is
>> >>> set to
>> >>> >> > > /home.
>> >>> >> > > >>> >
>> >>> >> > > >>> >Thanks!
>> >>> >> > > >>> >
>> >>> >> > > >>> >Nika
>> >>> >> >
>> >>> >> >
> 
> 
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997