[Swift-devel] swift problem?

Thu Mar 22 09:34:05 CDT 2007

Yes, the first 3 stages of this workflow worked just fine -- I have some 
results transferred back to my submit host (from the original run, not the 
restart). None of the results from stage 4 were transferred back (both from 
the original run and the restart run). Stage 4 has about 3500 jobs and this 
is when the queue got saturated. Stages 1-3 never had more then 50 jobs at 
the same time... The workflow is still going...

Nika

At 09:18 AM 3/22/2007, Mihael Hategan wrote:
>On Thu, 2007-03-22 at 08:43 -0500, Veronika V. Nefedova wrote:
> > Ok, After I restarted the run, I have a similar behavior:
> > the queue got saturated at first with 384 jobs, but then the number 
> started
> > to decline as the jobs get finished. I have now only 230 jobs (vs 384 max).
> > Another weird thing: I see that the jobs that finished - all 192 finished
> > successfully but one (the one has this error: forrtl: error (78): process
> > killed (SIGTERM) - probably it was killed for some reason). Anyway -- none
> > of the results of the finished jobs were transferred back to my submit 
> host.
>
>That would indicate that Swift doesn't know that the jobs finished. Does
>a simple workflow still work on NCSA?
>
> >
> > I should probably just kill the whole thing and start it fresh - the
> > restart thing probably is not working properly (?). The only question is -
> > should I modify anything in the settings to produce more of the debug
> > output, etc ?
> >
> > Thanks,
> >
> > Nika
> >
> > At 06:14 PM 3/21/2007, Veronika  V. Nefedova wrote:
> > >You want me to cancel the whole job and then restart it?
> > >
> > >At 05:37 PM 3/21/2007, Mihael Hategan wrote:
> > >>On Wed, 2007-03-21 at 17:32 -0500, Veronika V. Nefedova wrote:
> > >> >  I am not sure what I should look for. I have several hundreds of gram
> > >> > logs -- I checked a few of them and they looked normal (all
> > >> > approximately the same size). I also didn't see any stderr in my
> > >> > outputs (usually when the job is killed you get some kind of GRAM
> > >> > and/or PBS error in stderr.txt file)...
> > >> >
> > >> > The number of jobs in the queue are decreasing
> > >>
> > >>The fact that the number of jobs in the queue is decreasing doesn't mean
> > >>that Swift knows about it.
> > >>Can you add
> > >>"log4j.logger.org.globus.cog.abstraction.impl.common.task.TaskImpl=DEB 
> UG"
> > >>in log4j.properties and try it again?
> > >>
> > >>Mihael
> > >>
> > >> > -- i.e. the jobs are finishing and nothing new is submitted...
> > >> >
> > >> > Nika
> > >> >
> > >> > At 05:16 PM 3/21/2007, Mihael Hategan wrote:
> > >> > > I've never seen this error before, but it's coming from the GRAM
> > >> > > service. It's not the reason why more jobs were not submitted
> > >> > > properly,
> > >> > > but it may be related to it. My guess is that something happened on
> > >> > > the
> > >> > > server side that caused most jobs to not send notifications and some
> > >> > > (or
> > >> > > one) to fail in that way, and Swift thinks most of these jobs are
> > >> > > still
> > >> > > running.
> > >> > >
> > >> > > Did the jobs get killed? Do the GRAM logs give any details?
> > >> > >
> > >> > > Mihael
> > >> > >
> > >> > > On Wed, 2007-03-21 at 17:08 -0500, Veronika V. Nefedova wrote:
> > >> > > > I've submitted a big job to TG NCSA today. At some point it filled
> > >> > > up the
> > >> > > > PBS queue completely - I had 384 jobs queued/running (thats the
> > >> > > limit). And
> > >> > > > I know that I had many more jobs waiting on my local machine to
> > >> > > be
> > >> > > > submitted to TG. Once the jobs started to leave the queue (i.e.
> > >> > > were
> > >> > > > finished) - no more jobs were submitted. So I have now only 372
> > >> > > jobs in the
> > >> > > > queue while I should be having 384. Any ideas why is it
> > >> > > happening ?
> > >> > > >
> > >> > > > I checked my log on wiggum:
> > >> > > > /sandbox/ydeng/alamines/swift-MolDyn-free-final-c2eygeq2do861.log
> > >> > > >
> > >> > > > and found this error:
> > >> > > >
> > >> > > > 2007-03-21 15:51:35,963 INFO  vdl:execute2 Running job
> > >> > > chrm_long-8qmvzv8i
> > >> > > > chrm_long with arguments [pstep:40000, prtfile:solv_chg_a3,
> > >> > > > system:solv_m018, stitle:m018, rtffile:parm03_gaff_all.rtf,
> > >> > > > paramfile:parm03_gaffnb_all.prm, gaff:m018_am1, vac:,
> > >> > > restart:NONE,
> > >> > > > faster:off, rwater:15, chem:chem, minstep:0, rforce:0,
> > >> > > ligcrd:lyz,
> > >> > > > stage:chg, urandseed:4212951, dirname:solv_chg_a3_m018] in
> > >> > > > swift-MolDyn-free-final-c2eygeq2do861/chrm_long-8qmvzv8i on
> > >> > > TG-NCSA
> > >> > > > 2007-03-21 15:51:38,162 DEBUG vdl:execute2 Application exception:
> > >> > > It is
> > >> > > > unknown if the job was submitted
> > >> > > >          task:execute @ vdl-int.k, line: 352
> > >> > > >          vdl:execute2 @ execute-default.k, line: 22
> > >> > > >          vdl:execute @ swift-MolDyn-free-final.kml, line: 142
> > >> > > >          charmm2 @ swift-MolDyn-free-final.kml, line: 155790
> > >> > > >          vdl:mains @ swift-MolDyn-free-final.kml, line: 122678
> > >> > > > Caused by: org.globus.gram.GramException: It is unknown if the job
> > >> > > was
> > >> > > > submitted
> > >> > > >
> > >> > > > I am not sure if its causing the job submission problems ?
> > >> > > > I am using this swift code: /sandbox/nefedova/SWIFT/vdsk-0.1rc2
> > >> > > (with some
> > >> > > > options tweaked in scheduler.xml and swift exec)
> > >> > > > Thanks!
> > >> > > >
> > >> > > > Nika
> > >> > > >
> > >> > > >
> > >> > > > _______________________________________________
> > >> > > > Swift-devel mailing list
> > >> > > > Swift-devel at ci.uchicago.edu
> > >> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > >> > > >
> > >> >
> > >
> > >
> > >_______________________________________________
> > >Swift-devel mailing list
> > >Swift-devel at ci.uchicago.edu
> > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> >