[Swift-devel] swift problem?

Veronika V. Nefedova nefedova at mcs.anl.gov
Thu Mar 22 08:43:55 CDT 2007


Ok, After I restarted the run, I have a similar behavior:
the queue got saturated at first with 384 jobs, but then the number started 
to decline as the jobs get finished. I have now only 230 jobs (vs 384 max).
Another weird thing: I see that the jobs that finished - all 192 finished 
successfully but one (the one has this error: forrtl: error (78): process 
killed (SIGTERM) - probably it was killed for some reason). Anyway -- none 
of the results of the finished jobs were transferred back to my submit host.

I should probably just kill the whole thing and start it fresh - the 
restart thing probably is not working properly (?). The only question is - 
should I modify anything in the settings to produce more of the debug 
output, etc ?

Thanks,

Nika

At 06:14 PM 3/21/2007, Veronika  V. Nefedova wrote:
>You want me to cancel the whole job and then restart it?
>
>At 05:37 PM 3/21/2007, Mihael Hategan wrote:
>>On Wed, 2007-03-21 at 17:32 -0500, Veronika V. Nefedova wrote:
>> >  I am not sure what I should look for. I have several hundreds of gram
>> > logs -- I checked a few of them and they looked normal (all
>> > approximately the same size). I also didn't see any stderr in my
>> > outputs (usually when the job is killed you get some kind of GRAM
>> > and/or PBS error in stderr.txt file)...
>> >
>> > The number of jobs in the queue are decreasing
>>
>>The fact that the number of jobs in the queue is decreasing doesn't mean
>>that Swift knows about it.
>>Can you add
>>"log4j.logger.org.globus.cog.abstraction.impl.common.task.TaskImpl=DEBUG" 
>>in log4j.properties and try it again?
>>
>>Mihael
>>
>> > -- i.e. the jobs are finishing and nothing new is submitted...
>> >
>> > Nika
>> >
>> > At 05:16 PM 3/21/2007, Mihael Hategan wrote:
>> > > I've never seen this error before, but it's coming from the GRAM
>> > > service. It's not the reason why more jobs were not submitted
>> > > properly,
>> > > but it may be related to it. My guess is that something happened on
>> > > the
>> > > server side that caused most jobs to not send notifications and some
>> > > (or
>> > > one) to fail in that way, and Swift thinks most of these jobs are
>> > > still
>> > > running.
>> > >
>> > > Did the jobs get killed? Do the GRAM logs give any details?
>> > >
>> > > Mihael
>> > >
>> > > On Wed, 2007-03-21 at 17:08 -0500, Veronika V. Nefedova wrote:
>> > > > I've submitted a big job to TG NCSA today. At some point it filled
>> > > up the
>> > > > PBS queue completely - I had 384 jobs queued/running (thats the
>> > > limit). And
>> > > > I know that I had many more jobs waiting on my local machine to
>> > > be
>> > > > submitted to TG. Once the jobs started to leave the queue (i.e.
>> > > were
>> > > > finished) - no more jobs were submitted. So I have now only 372
>> > > jobs in the
>> > > > queue while I should be having 384. Any ideas why is it
>> > > happening ?
>> > > >
>> > > > I checked my log on wiggum:
>> > > > /sandbox/ydeng/alamines/swift-MolDyn-free-final-c2eygeq2do861.log
>> > > >
>> > > > and found this error:
>> > > >
>> > > > 2007-03-21 15:51:35,963 INFO  vdl:execute2 Running job
>> > > chrm_long-8qmvzv8i
>> > > > chrm_long with arguments [pstep:40000, prtfile:solv_chg_a3,
>> > > > system:solv_m018, stitle:m018, rtffile:parm03_gaff_all.rtf,
>> > > > paramfile:parm03_gaffnb_all.prm, gaff:m018_am1, vac:,
>> > > restart:NONE,
>> > > > faster:off, rwater:15, chem:chem, minstep:0, rforce:0,
>> > > ligcrd:lyz,
>> > > > stage:chg, urandseed:4212951, dirname:solv_chg_a3_m018] in
>> > > > swift-MolDyn-free-final-c2eygeq2do861/chrm_long-8qmvzv8i on
>> > > TG-NCSA
>> > > > 2007-03-21 15:51:38,162 DEBUG vdl:execute2 Application exception:
>> > > It is
>> > > > unknown if the job was submitted
>> > > >          task:execute @ vdl-int.k, line: 352
>> > > >          vdl:execute2 @ execute-default.k, line: 22
>> > > >          vdl:execute @ swift-MolDyn-free-final.kml, line: 142
>> > > >          charmm2 @ swift-MolDyn-free-final.kml, line: 155790
>> > > >          vdl:mains @ swift-MolDyn-free-final.kml, line: 122678
>> > > > Caused by: org.globus.gram.GramException: It is unknown if the job
>> > > was
>> > > > submitted
>> > > >
>> > > > I am not sure if its causing the job submission problems ?
>> > > > I am using this swift code: /sandbox/nefedova/SWIFT/vdsk-0.1rc2
>> > > (with some
>> > > > options tweaked in scheduler.xml and swift exec)
>> > > > Thanks!
>> > > >
>> > > > Nika
>> > > >
>> > > >
>> > > > _______________________________________________
>> > > > Swift-devel mailing list
>> > > > Swift-devel at ci.uchicago.edu
>> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> > > >
>> >
>
>
>_______________________________________________
>Swift-devel mailing list
>Swift-devel at ci.uchicago.edu
>http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel





More information about the Swift-devel mailing list