[Swift-devel] swift problem?

Veronika V. Nefedova nefedova at mcs.anl.gov
Wed Mar 21 18:14:39 CDT 2007


You want me to cancel the whole job and then restart it?

At 05:37 PM 3/21/2007, Mihael Hategan wrote:
>On Wed, 2007-03-21 at 17:32 -0500, Veronika V. Nefedova wrote:
> >  I am not sure what I should look for. I have several hundreds of gram
> > logs -- I checked a few of them and they looked normal (all
> > approximately the same size). I also didn't see any stderr in my
> > outputs (usually when the job is killed you get some kind of GRAM
> > and/or PBS error in stderr.txt file)...
> >
> > The number of jobs in the queue are decreasing
>
>The fact that the number of jobs in the queue is decreasing doesn't mean
>that Swift knows about it.
>Can you add
>"log4j.logger.org.globus.cog.abstraction.impl.common.task.TaskImpl=DEBUG" 
>in log4j.properties and try it again?
>
>Mihael
>
> > -- i.e. the jobs are finishing and nothing new is submitted...
> >
> > Nika
> >
> > At 05:16 PM 3/21/2007, Mihael Hategan wrote:
> > > I've never seen this error before, but it's coming from the GRAM
> > > service. It's not the reason why more jobs were not submitted
> > > properly,
> > > but it may be related to it. My guess is that something happened on
> > > the
> > > server side that caused most jobs to not send notifications and some
> > > (or
> > > one) to fail in that way, and Swift thinks most of these jobs are
> > > still
> > > running.
> > >
> > > Did the jobs get killed? Do the GRAM logs give any details?
> > >
> > > Mihael
> > >
> > > On Wed, 2007-03-21 at 17:08 -0500, Veronika V. Nefedova wrote:
> > > > I've submitted a big job to TG NCSA today. At some point it filled
> > > up the
> > > > PBS queue completely - I had 384 jobs queued/running (thats the
> > > limit). And
> > > > I know that I had many more jobs waiting on my local machine to
> > > be
> > > > submitted to TG. Once the jobs started to leave the queue (i.e.
> > > were
> > > > finished) - no more jobs were submitted. So I have now only 372
> > > jobs in the
> > > > queue while I should be having 384. Any ideas why is it
> > > happening ?
> > > >
> > > > I checked my log on wiggum:
> > > > /sandbox/ydeng/alamines/swift-MolDyn-free-final-c2eygeq2do861.log
> > > >
> > > > and found this error:
> > > >
> > > > 2007-03-21 15:51:35,963 INFO  vdl:execute2 Running job
> > > chrm_long-8qmvzv8i
> > > > chrm_long with arguments [pstep:40000, prtfile:solv_chg_a3,
> > > > system:solv_m018, stitle:m018, rtffile:parm03_gaff_all.rtf,
> > > > paramfile:parm03_gaffnb_all.prm, gaff:m018_am1, vac:,
> > > restart:NONE,
> > > > faster:off, rwater:15, chem:chem, minstep:0, rforce:0,
> > > ligcrd:lyz,
> > > > stage:chg, urandseed:4212951, dirname:solv_chg_a3_m018] in
> > > > swift-MolDyn-free-final-c2eygeq2do861/chrm_long-8qmvzv8i on
> > > TG-NCSA
> > > > 2007-03-21 15:51:38,162 DEBUG vdl:execute2 Application exception:
> > > It is
> > > > unknown if the job was submitted
> > > >          task:execute @ vdl-int.k, line: 352
> > > >          vdl:execute2 @ execute-default.k, line: 22
> > > >          vdl:execute @ swift-MolDyn-free-final.kml, line: 142
> > > >          charmm2 @ swift-MolDyn-free-final.kml, line: 155790
> > > >          vdl:mains @ swift-MolDyn-free-final.kml, line: 122678
> > > > Caused by: org.globus.gram.GramException: It is unknown if the job
> > > was
> > > > submitted
> > > >
> > > > I am not sure if its causing the job submission problems ?
> > > > I am using this swift code: /sandbox/nefedova/SWIFT/vdsk-0.1rc2
> > > (with some
> > > > options tweaked in scheduler.xml and swift exec)
> > > > Thanks!
> > > >
> > > > Nika
> > > >
> > > >
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > >
> >





More information about the Swift-devel mailing list