From wilde at mcs.anl.gov Wed Apr 11 14:06:10 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Wed, 11 Apr 2007 14:06:10 -0500 Subject: [Swift-user] MolDyn workflow failing on GPFS at NCSA In-Reply-To: <6.0.0.22.2.20070411131554.03a8be80@mail.mcs.anl.gov> References: <6.0.0.22.2.20070411131554.03a8be80@mail.mcs.anl.gov> Message-ID: <461D31A2.4080403@mcs.anl.gov> Hi all, I'm moving this discussion to swift-devel. This concerns a problem that is preventing Nika's MolDyn workflow from running on the NCSA TG IA64 cluster under GPFS. Ive read the message dialog below with TG support, as well as Mihael's comment in another message that this is hard to solve because its sporadic and seemingly load related, and have the following thoughts and questions. Its important that we move forward on this workflow - its the next step with the MolDyn group. Its also important that we run correctly on GPFS. I think that requiring non-GPFS filesystems is at best a short-term and limiting solution, and it leaves a bug unresolved that may well turn out ot be in Swift, CoG, or Globus. So lets pust forward and solve this one. Here's my questions: 0) What do we know about the problem from tests to date? Is the problem (the timeout?) due to not getting a job-completion notification (as it appears form the log below) or a data transfer timeout (as I got the possibly mistaken impression from my discussion yesterday with Nika)? 1) As its only been run at NCSA, I suggested yesterday that we take more advantage of the uc-teragrid node - less loaded, and we have very capable eager staff at Argonne to help us debug. It would also help us better assess whether we are looking at a GPFS problem, a local NCSA problem, a Globus problem, or a Swift problem. I dont see any conclusive evidence yet that this problem is specifically in GPFS or that GPFS is behaving incorrectly. 2) It might be worth a try to isolate the problem with simpler Swift code, but that will take time and we have no guarantee up-front *how* much time. Could be a long hunt. Do we have a clue as to what pattern to start with in simpler Swift code? Could we dial-down the job duration of this run - way down - so that we can make the problem happen quicker, with less waiting? Its more important that we can make it happen quickly and repeatedly than that we do it with simpler code. The latter would be nice but seems to be not a necessity (yet). 3) Do we have enough logging turned on in the current runs to isolate the problem a bit further? Do we need to insert additional logging 4) Does this problem bear any relation to observed pre-WS-GRAM problems in getting back job completion notification on NFS sites? These are, eg, problems that Jen got deeply into with VDS/Pegasus/DAGman. He made mods to the PBS and COndor job managers to support reduce a suspected NFS-aggravated race condition. Could there be a similar issue in the NCSA Globus PBS jobmanager? (Note that Jens's issues were with NFS, but similar phenomena could plague any shared FS that globus depends on for completion synchronization). This problem was, to my knowldege, never fully resolved. 5) Could we re-run this workflow using WS-GRAM? 6) I see notes in the thread below that say the problem happens only on GPFS, and other notes that it happens also on the /home filesystem. Whats the current knowledge on this? Have you determined conclusively that this is a GPFS-only problem? Thanks, Mik Veronika V. Nefedova wrote, On 4/11/2007 1:16 PM: > Mike, This is the thread that Mihael was involved with (with help at teragrid) > > Nika > > >> Date: Mon, 9 Apr 2007 22:15:36 -0500 >> Subject: Re: workflow not working on GPFS >> To: "Veronika V. Nefedova" >> From: help at teragrid.org >> Cc: >> X-Mailer: Perl5 Mail::Internet v1.74 >> Sender: Nobody >> X-Null-Tag: b6f18422934f93a28d94a359b5c1b409 >> X-NCSA-MailScanner-Information: Please contact help at ncsa.uiuc.edu for >> more information, amantadine.ncsa.uiuc.edu >> X-NCSA-MailScanner: Found to be clean >> X-Virus-Scanned: by amavisd-new-20030616-p10 (Debian) at >> mailgw.mcs.anl.gov >> >> FROM: McWilliams, David G >> (Concerning ticket No. 138013) >> >> Nika, >> >> I asked the system administrator to create a directory for you. I will >> let you >> know when its been created. >> >> Dave McWilliams (217) 244-1144 consult at ncsa.uiuc.edu >> NCSA Consulting Services http://www.ncsa.uiuc.edu/UserInfo/Consulting/ >> -------------------------------------------------------------------------- >> >> >> Veronika V. Nefedova writes: >> >I have some preliminary number. Its 121.7MB per molecule. We would >> like to >> >do 250 molecules at once (which is our goal), but for now it would be >> very >> >good to be able to do 50 molecules at a time. >> >What do I need to do in order to have an access to such disk space ? >> > >> >Thanks, >> > >> >Nika >> > >> >At 03:25 PM 4/9/2007, Veronika V. Nefedova wrote: >> >>yep, my main purpose for using gpfs was the disk space. And its only a >> >>temporary disk space (for the duration of the job run time), not for >> >>storage. Let me try to estimate how much space I need (as I >> mentioned - my >> >>directories got removed once the job finishes successfully). >> >> >> >>Thanks! >> >> >> >>Nika >> >> >> >>At 03:15 PM 4/9/2007, help at teragrid.org wrote: >> >>>FROM: McWilliams, David G >> >>>(Concerning ticket No. 138013) >> >>> >> >>>Nika, >> >>> >> >>>Is it critical that you run out of /gpfs or do you just need more >> space >> >>>than is >> >>>available in your home directory? Another alternative would be to >> request >> >>>space >> >>>in the /usr/projects directory which is the same type filesystem as >> the home >> >>>directory filesystem. If that would work, how much disk space do >> you need? >> >>> >> >>>Dave McWilliams (217) 244-1144 >> consult at ncsa.uiuc.edu >> >>>NCSA Consulting Services >> http://www.ncsa.uiuc.edu/UserInfo/Consulting/ >> >>>-------------------------------------------------------------------------- >> >> >>> >> >>>Veronika V. Nefedova writes: >> >>> >Hi, Dave: >> >>> > >> >>> >OK, I do not have single successful run on /gpfs system of my >> workflow. I >> >>> >ran the same exact job on /home and it worked just fine. >> >>> >Today the 3-rd job of the workflow failed because of the timeout >> (thats my >> >>> >guess, it never timeouts on /home). Last time I tried on gpfs >> that job >> >>> >finished just fine and in fact that were the next jobs in the >> workflow >> >>> that >> >>> >failed. Something is wrong with gpfs it seems. >> >>> > >> >>> >Thanks, >> >>> > >> >>> >Nika >> >>> > >> >>> > >> >>> >At 05:06 PM 4/4/2007, Mihael Hategan wrote: >> >>> >>Nika reports that some jobs have finished correctly. This seems to >> >>> >>indicate that things are OK on the Globus side, and that it's a >> problem >> >>> >>with the specification. >> >>> >> >> >>> >>Mihael >> >>> >> >> >>> >>On Wed, 2007-04-04 at 17:03 -0500, Veronika V. Nefedova wrote: >> >>> >> > Hi, Dave: >> >>> >> > >> >>> >> > the workflow engine is waiting for the completion >> notification from >> >>> Globus. >> >>> >> > Now all my 68 jobs have finished, but no notification was sent >> >>> back. So >> >>> >> the >> >>> >> > workflow doesn't submit any new jobs. I am Cc Mihael who >> might be >> >>> able to >> >>> >> > suggest what is going on wrong here. >> >>> >> > >> >>> >> > Could the problem be that telnet problem that happened earlier >> >>> today ? My >> >>> >> > jobs were already submitted, but not finished yet. (and I got >> >>> kicked off >> >>> >> > from the login node) >> >>> >> > >> >>> >> > Thanks for looking into this. >> >>> >> > >> >>> >> > Nika >> >>> >> > >> >>> >> > At 03:46 PM 4/4/2007, help at teragrid.org wrote: >> >>> >> > >FROM: McWilliams, David G >> >>> >> > >(Concerning ticket No. 138013) >> >>> >> > > >> >>> >> > >Nika, >> >>> >> > > >> >>> >> > >Thanks for the additional information. It is good to know >> that the >> >>> >> problem is >> >>> >> > >not specific to jobs that use GPFS. How does the workflow >> engine know >> >>> >> that the >> >>> >> > >job is done? Does that workflow engine wait for the globusrun >> >>> command to >> >>> >> > >return >> >>> >> > >or is it looking in a file for the job status? >> >>> >> > > >> >>> >> > >Dave McWilliams (217) >> >>> 244-1144 consult at ncsa.uiuc.edu >> >>> >> > >NCSA Consulting >> >>> Services http://www.ncsa.uiuc.edu/UserInfo/Consulting/ >> >>> >> > >> >---------------------------------------------------------------- --- >> >>> ------- >> >>> >> > > >> >>> >> > >Veronika V. Nefedova writes: >> >>> >> > > >Actually, this current job is failing despite the workdir >> set to >> >>> >> /home. I >> >>> >> > > >had 68 jobs (75 mins each) in the queue. I see that I have >> now 19 >> >>> >> running >> >>> >> > > >and 27 queued jobs, which means that 68-19-27=22 jobs have >> already >> >>> >> finished >> >>> >> > > >(the output from those jobs confirms it) but none of these >> >>> finished >> >>> >> jobs >> >>> >> > > >are marked as 'done' on my side. I.e. gram didn't send the >> >>> >> completion info >> >>> >> > > >back to the submit side. So the problem is not unique to >> gpfs... >> >>> >> > > > >> >>> >> > > >Nika >> >>> >> > > > >> >>> >> > > >At 02:00 PM 4/4/2007, Veronika V. Nefedova wrote: >> >>> >> > > >>Hi, Dave: >> >>> >> > > >> >> >>> >> > > >>The jobs are submitted in the form of RSL via API calls. >> So I do >> >>> >> not have >> >>> >> > > >>scripts per se. This what I have in my log file: >> >>> >> > > >> >> >>> >> > > >>2007-04-04 13:15:51,653 DEBUG JobSubmissionTaskHandler >> RSL: &( >> >>> >> directory = >> >>> >> > > >>"/home/ac/nefed >> >>> >> > > >>ova/SWIFT/MolDyn-ty2oh8r2ki171" )( arguments = >> "shared/wrapper.sh" >> >>> >> > > >>"chrm_long-720q4k9i" "s >> >>> >> > > >>olv_repu_0.8_0.9_b0_m001.out" "stderr.txt" "solv.inp" "" >> "solv.inp >> >>> >> > > >>parm03_gaffnb_all.prm p >> >>> >> > > >>arm03_gaff_all.rtf m001_am1.rtf m001_am1.prm solv_m001.psf >> >>> >> > > >>solv_m001_eq.crd solv_repu_0.8_ >> >>> >> > > >>0.9_b0.prt" "solv_repu_0.8_0.9_b0_m001.wham >> >>> >> solv_repu_0.8_0.9_b0_m001.crd >> >>> >> > > >>solv_repu_0.8_0. >> >>> >> > > >>9_b0_m001.out solv_repu_0.8_0.9_b0_m001_done" "" >> >>> >> > > >>"/home/ac/yqdeng/c34a2/exec/altix/charmm" >> >>> >> > > >> "pstep:40000" "prtfile:solv_repu_0.8_0.9_b0" >> "system:solv_m001" >> >>> >> > > >> "stitle:m001" "rtffile:pa >> >>> >> > > >>rm03_gaff_all.rtf" "paramfile:parm03_gaffnb_all.prm" >> >>> "gaff:m001_am1" >> >>> >> > > >>"vac:" "restart:NONE" >> >>> >> > > >> "faster:off" "rwater:15" "chem:chem" "minstep:0" >> "rforce:0" >> >>> >> "ligcrd:lyz" >> >>> >> > > >> "stage:repu" "ur >> >>> >> > > >>andseed:2640378" "dirname:solv_repu_0.8_0.9_b0_m001" >> "rcut1:0.8" >> >>> >> > > >>"rcut2:0.9" )( executable >> >>> >> > > >> = "/bin/sh" )( maxwalltime = "75" )( environment = ( >> "PATH" >> >>> >> > > >> "/home/ac/nefedova/bin/tools/ >> >>> >> > > >>bin/:/usr/bin:/bin" ) ) >> >>> >> > > >> >> >>> >> > > >>the working directory is specified at the very beginning ( >> >>> directory = >> >>> >> > > >>"/home/ac/nefedova/SWIFT/MolDyn-ty2oh8r2ki171" ). When I >> try to >> >>> specify >> >>> >> > > >>the /gpfs_scratch1/nefedova/bla -- the 'finished' status >> >>> doesn't return >> >>> >> > > >>back to my workflow engine. Right now I have the working >> >>> version of the >> >>> >> > > >>workflow running (in /home). >> >>> >> > > >> >> >>> >> > > >>Thank you very much for looking into this! >> >>> >> > > >> >> >>> >> > > >>Nika >> >>> >> > > >> >> >>> >> > > >>At 07:51 PM 4/3/2007, help at teragrid.org wrote: >> >>> >> > > >>>FROM: McWilliams, David G >> >>> >> > > >>>(Concerning ticket No. 138013) >> >>> >> > > >>> >> >>> >> > > >>>Nika, >> >>> >> > > >>> >> >>> >> > > >>>I would suggest that we start by comparing the PBS batch >> scripts >> >>> >> that are >> >>> >> > > >>>created for the jobs with both working directories. Does >> the >> >>> workflow >> >>> >> > > engine >> >>> >> > > >>>report the PBS job IDs? If so, please send the job IDs >> for both >> >>> >> job types? >> >>> >> > > >>> >> >>> >> > > >>>Dave McWilliams (217) >> >>> >> 244-1144 consult at ncsa.uiuc.edu >> >>> >> > > >>>NCSA Consulting >> >>> >> Services http://www.ncsa.uiuc.edu/UserInfo/Consulting/ >> >>> >> > > >> >>>------------------------------------------------------------ --- >> >>> ---- >> >>> >> ------- >> >>> >> > > >>> >> >>> >> > > >>>Veronika V. Nefedova writes: >> >>> >> > > >>> >Hi, >> >>> >> > > >>> > >> >>> >> > > >>> >I am using Swift workflow engine >> >>> >> (http://www.ci.uchicago.edu/swift/) to >> >>> >> > > >>> >submit my jobs to Teragrid/NCSA from my machine at ANL. >> >>> >> > > >>> >I have some problems with having my workflow ran on >> /gpfs on >> >>> >> Teragrid. >> >>> >> > > >>> When >> >>> >> > > >>> >my working directory is set to /home, the workflow works >> >>> just fine. >> >>> >> > > When I >> >>> >> > > >>> >want to run a large run I set my workdir to /gpfs >> (which has >> >>> a lot >> >>> >> > > of disk >> >>> >> > > >>> >space) and then my workflow stops working. The very >> >>> beginning of the >> >>> >> > > >>> >workflow (that has some small jobs) is working ok, but >> after the >> >>> >> > > long jobs >> >>> >> > > >>> >(66 min each) are finished the system never returns >> the 'done' >> >>> >> status >> >>> >> > > >>> back >> >>> >> > > >>> >to Swift (@ANL) and thus no new jobs are submitted and >> the >> >>> workflow >> >>> >> > > stops. >> >>> >> > > >>> >Again, exactly the same workflow works fine when >> workdir is >> >>> set to >> >>> >> > > /home. >> >>> >> > > >>> > >> >>> >> > > >>> >Thanks! >> >>> >> > > >>> > >> >>> >> > > >>> >Nika >> >>> >> > >> >>> >> > > > > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997