[Swift-devel] Standard Swift coaster behavior doesnt work well for sporadic jobs

Michael Wilde wilde at mcs.anl.gov
Tue Oct 12 10:26:25 CDT 2010


I forgot to say: I will retest on an unmodified trunk to see if this problem is innate in the latest code or introduced by my mod.

- Mike

----- wilde at mcs.anl.gov wrote:

> I see:
> 
> 2010-10-11 14:54:35,010-0500 INFO  CoasterService Started coaster
> service: http://192.5.86.5:34445
> 2010-10-11 14:54:35,021-0500 INFO  Command Sending Command(2,
> SUBMITJOB) on null[244757954: {}]
> 2010-10-11 14:54:35,080-0500 INFO  BlockQueueProcessor allocsize =
> 0.0, queuedsize = 1.0660596665516473, qsz = 1
> 2010-10-11 14:54:35,081-0500 INFO  BlockQueueProcessor Requeued 1
> non-fitting jobs
> 2010-10-11 14:54:35,082-0500 INFO  BlockQueueProcessor
> Settings {
>         slots = 32
>         workersPerNode = 1
>    
> 
> but for the second job, I see:
> 
> 2010-10-11 14:59:55,200-0500 INFO  Command Sending Command(3,
> SUBMITJOB) on null[244757954: {}]
> 2010-10-11 14:59:55,224-0500 INFO  Cpu 1011-540235-000000:0 pull
> 2010-10-11 14:59:55,225-0500 INFO  Cpu 1011-540235-000000:0 submitting
> urn:1286826874044-1286826874046-1286826874047
> 2010-10-11 14:59:55,226-0500 INFO  Command Sending Command(3,
> SUBMITJOB) on SC-1011-540235-000000-000000
> 2010-10-11 14:59:55,226-0500 INFO  AbstractStreamKarajanChannel Sender
> 390276053 queue size: 0
> 2010-10-11 14:59:56,620-0500 INFO  BlockQueueProcessor allocsize =
> 0.0, queuedsize = 0.0, qsz = 0
> 2010-10-11 14:59:56,620-0500 INFO  BlockQueueProcessor Plan time: 0
> 2010-10-11 14:59:58,822-0500 INFO  BlockQueueProcessor allocsize =
> 0.0, queuedsize = 0.0, qsz = 0
> 2010-10-11 14:59:58,822-0500 INFO  BlockQueueProcessor Plan time: 0
> 2010-10-11 14:59:59,935-0500 INFO
>  AbstractStreamKarajanChannel$Multiplexer Avg stream buf: 0
> 2010-10-11 15:00:00,790-0500 INFO  Cpu runTime: 2, sleepTime: 10036
> 2010-10-11 15:00:01,024-0500 INFO  BlockQueueProcessor allocsize =
> 0.0, queuedsize = 0.0, qsz = 0
> 2010-10-11 15:00:01,024-0500 INFO  BlockQueueProcessor Plan time: 0
> 2010-10-11 15:00:03,226-0500 INFO  BlockQueueProcessor allocsize =
> 0.0, queuedsize = 0.0, qsz = 0
> 2010-10-11 15:00:03,226-0500 INFO  BlockQueueProcessor Plan time: 0
> 2010-10-11 15:00:05,112-0500 INFO  CoasterService Idle time: 0
> 2010-10-11 15:00:05,122-0500 INFO  TaskNotifier Congestion queue size:
> 0
> 
> 
> ...which suggests that the coaster service doesnt really see the job
> in the queue?
> 
> The one mod I made that may be causing this was to set the
> service-side timeout value for the coaster provider up high; this was
> needed to keep manually configured passive-persistent configurations
> alive while the Swift client was idle (eg for the multiple-ssh-server
> configuration of the R server).
> 
> - Mike
> 
> 
> ----- wilde at mcs.anl.gov wrote:
> 
> > Mihael, Justin, does the following sound like a likely coaster
> issue:
> > 
> > When using the standard Swift coaster code (not passive or
> > persistent), if I have a job that runs at the start, and then there
> is
> > a long delay before the next job, such that the coaster worker
> times
> > out, then the coaster scheduler doesnt think that there is a valid
> > block into which the job can fit, and Swift just hangs, with the
> job
> > in submitted state but never getting assigned to a block.
> > 
> > The behavior seems similar to what you see when you try to run a
> job
> > that doesnt fit into any block that you have defined using the
> > coasters sites.xml parameters: in that case, too, Swift just hangs.
> > 
> > Both of these situations (whether or not they are indeed due to the
> > same algorithmic issue) seem to be problems that we need to
> address.
> > In the first case (which is my immediate and more important
> problem)
> > you can see a log for the problem in ~wilde/swift-rserver-hangs,
> along
> > with the swift script, tc, sites, and properties  file(cf).
> > 
> > Thanks,
> > 
> > Mike
> > 
> > -- 
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> -- 
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list