[Swift-devel] swift problem?
Mihael Hategan
hategan at mcs.anl.gov
Thu Mar 22 09:18:30 CDT 2007
On Thu, 2007-03-22 at 08:43 -0500, Veronika V. Nefedova wrote:
> Ok, After I restarted the run, I have a similar behavior:
> the queue got saturated at first with 384 jobs, but then the number started
> to decline as the jobs get finished. I have now only 230 jobs (vs 384 max).
> Another weird thing: I see that the jobs that finished - all 192 finished
> successfully but one (the one has this error: forrtl: error (78): process
> killed (SIGTERM) - probably it was killed for some reason). Anyway -- none
> of the results of the finished jobs were transferred back to my submit host.
That would indicate that Swift doesn't know that the jobs finished. Does
a simple workflow still work on NCSA?
>
> I should probably just kill the whole thing and start it fresh - the
> restart thing probably is not working properly (?). The only question is -
> should I modify anything in the settings to produce more of the debug
> output, etc ?
>
> Thanks,
>
> Nika
>
> At 06:14 PM 3/21/2007, Veronika V. Nefedova wrote:
> >You want me to cancel the whole job and then restart it?
> >
> >At 05:37 PM 3/21/2007, Mihael Hategan wrote:
> >>On Wed, 2007-03-21 at 17:32 -0500, Veronika V. Nefedova wrote:
> >> > I am not sure what I should look for. I have several hundreds of gram
> >> > logs -- I checked a few of them and they looked normal (all
> >> > approximately the same size). I also didn't see any stderr in my
> >> > outputs (usually when the job is killed you get some kind of GRAM
> >> > and/or PBS error in stderr.txt file)...
> >> >
> >> > The number of jobs in the queue are decreasing
> >>
> >>The fact that the number of jobs in the queue is decreasing doesn't mean
> >>that Swift knows about it.
> >>Can you add
> >>"log4j.logger.org.globus.cog.abstraction.impl.common.task.TaskImpl=DEBUG"
> >>in log4j.properties and try it again?
> >>
> >>Mihael
> >>
> >> > -- i.e. the jobs are finishing and nothing new is submitted...
> >> >
> >> > Nika
> >> >
> >> > At 05:16 PM 3/21/2007, Mihael Hategan wrote:
> >> > > I've never seen this error before, but it's coming from the GRAM
> >> > > service. It's not the reason why more jobs were not submitted
> >> > > properly,
> >> > > but it may be related to it. My guess is that something happened on
> >> > > the
> >> > > server side that caused most jobs to not send notifications and some
> >> > > (or
> >> > > one) to fail in that way, and Swift thinks most of these jobs are
> >> > > still
> >> > > running.
> >> > >
> >> > > Did the jobs get killed? Do the GRAM logs give any details?
> >> > >
> >> > > Mihael
> >> > >
> >> > > On Wed, 2007-03-21 at 17:08 -0500, Veronika V. Nefedova wrote:
> >> > > > I've submitted a big job to TG NCSA today. At some point it filled
> >> > > up the
> >> > > > PBS queue completely - I had 384 jobs queued/running (thats the
> >> > > limit). And
> >> > > > I know that I had many more jobs waiting on my local machine to
> >> > > be
> >> > > > submitted to TG. Once the jobs started to leave the queue (i.e.
> >> > > were
> >> > > > finished) - no more jobs were submitted. So I have now only 372
> >> > > jobs in the
> >> > > > queue while I should be having 384. Any ideas why is it
> >> > > happening ?
> >> > > >
> >> > > > I checked my log on wiggum:
> >> > > > /sandbox/ydeng/alamines/swift-MolDyn-free-final-c2eygeq2do861.log
> >> > > >
> >> > > > and found this error:
> >> > > >
> >> > > > 2007-03-21 15:51:35,963 INFO vdl:execute2 Running job
> >> > > chrm_long-8qmvzv8i
> >> > > > chrm_long with arguments [pstep:40000, prtfile:solv_chg_a3,
> >> > > > system:solv_m018, stitle:m018, rtffile:parm03_gaff_all.rtf,
> >> > > > paramfile:parm03_gaffnb_all.prm, gaff:m018_am1, vac:,
> >> > > restart:NONE,
> >> > > > faster:off, rwater:15, chem:chem, minstep:0, rforce:0,
> >> > > ligcrd:lyz,
> >> > > > stage:chg, urandseed:4212951, dirname:solv_chg_a3_m018] in
> >> > > > swift-MolDyn-free-final-c2eygeq2do861/chrm_long-8qmvzv8i on
> >> > > TG-NCSA
> >> > > > 2007-03-21 15:51:38,162 DEBUG vdl:execute2 Application exception:
> >> > > It is
> >> > > > unknown if the job was submitted
> >> > > > task:execute @ vdl-int.k, line: 352
> >> > > > vdl:execute2 @ execute-default.k, line: 22
> >> > > > vdl:execute @ swift-MolDyn-free-final.kml, line: 142
> >> > > > charmm2 @ swift-MolDyn-free-final.kml, line: 155790
> >> > > > vdl:mains @ swift-MolDyn-free-final.kml, line: 122678
> >> > > > Caused by: org.globus.gram.GramException: It is unknown if the job
> >> > > was
> >> > > > submitted
> >> > > >
> >> > > > I am not sure if its causing the job submission problems ?
> >> > > > I am using this swift code: /sandbox/nefedova/SWIFT/vdsk-0.1rc2
> >> > > (with some
> >> > > > options tweaked in scheduler.xml and swift exec)
> >> > > > Thanks!
> >> > > >
> >> > > > Nika
> >> > > >
> >> > > >
> >> > > > _______________________________________________
> >> > > > Swift-devel mailing list
> >> > > > Swift-devel at ci.uchicago.edu
> >> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >> > > >
> >> >
> >
> >
> >_______________________________________________
> >Swift-devel mailing list
> >Swift-devel at ci.uchicago.edu
> >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>
More information about the Swift-devel
mailing list