[Swift-devel] Re: Broken pipe on persistent coasters (was Re: Next steps on making the ExTENCI SCEC workflow run reliably)

Wed May 11 20:00:35 CDT 2011

Redirecting the thread to swift-devel.

I did a simple test where I killed off a worker while a single job is
being dispatched (persistent coasters, passive workers).

$run_service.sh
$cat workflow.swift
int t = 300;

app (external o) sleep_pads(int time) {
  sleep_pads time;
}
external o_pads;
o_pads = sleep_pads(t);

$swift workflow.swift
Swift svn swift-r4399 cog-r3087

RunID: 20110511-1908-kv67luid
Progress:
Find: https://communicado.ci.uchicago.edu:64999
Find:  keepalive(120), reconnect - https://communicado.ci.uchicago.edu:64999
Passive queue processor initialized. Callback URI is http://128.135.125.17:63999
Progress:  Submitted:1
Progress:  Active:1
Progress:  Active:1
Progress:  Active:1
Progress:  Active:1
Progress:  Active:1
...
...

(on a parallel terminal):
$/worker.pl     http://communicado.ci.uchicago.edu:63999 PADS /scratch
Ctrl-C # when the job sleep_pads() started running for a while
$

 Upon killing the worker, the application terminated as well.  But the
swift console session still reports the job as being 'Active'.  Also,
no error has been reports (yet) on the coaster service log.  Maybe
these will register later after a sufficient amount of time? i'll
report on this again later as the run further progresses

but i do get the same last tens of lines from the Swift log:

2011-05-11 19:54:18,868-0500 INFO
AbstractStreamKarajanChannel$Multiplexer No streams
2011-05-11 19:54:28,747-0500 INFO  AbstractStreamKarajanChannel Sender
1545215993 queue size: 0
2011-05-11 19:54:28,873-0500 INFO
AbstractStreamKarajanChannel$Multiplexer No streams
2011-05-11 19:54:38,754-0500 INFO  AbstractStreamKarajanChannel Sender
1545215993 queue size: 0
2011-05-11 19:54:38,881-0500 INFO
AbstractStreamKarajanChannel$Multiplexer No streams
2011-05-11 19:54:48,756-0500 INFO  AbstractStreamKarajanChannel Sender
1545215993 queue size: 0
2011-05-11 19:54:48,885-0500 INFO
AbstractStreamKarajanChannel$Multiplexer No streams
2011-05-11 19:54:58,762-0500 INFO  AbstractStreamKarajanChannel Sender
1545215993 queue size: 0
2011-05-11 19:54:58,903-0500 INFO
AbstractStreamKarajanChannel$Multiplexer No streams
2011-05-11 19:55:08,768-0500 INFO  AbstractStreamKarajanChannel Sender
1545215993 queue size: 0
2011-05-11 19:55:08,912-0500 INFO
AbstractStreamKarajanChannel$Multiplexer No streams
2011-05-11 19:55:18,772-0500 INFO  AbstractStreamKarajanChannel Sender
1545215993 queue size: 0
2011-05-11 19:55:18,921-0500 INFO
AbstractStreamKarajanChannel$Multiplexer No streams
2011-05-11 19:55:28,779-0500 INFO  AbstractStreamKarajanChannel Sender
1545215993 queue size: 0
2011-05-11 19:55:28,926-0500 INFO
AbstractStreamKarajanChannel$Multiplexer No streams
2011-05-11 19:55:38,784-0500 INFO  AbstractStreamKarajanChannel Sender
1545215993 queue size: 0
2011-05-11 19:55:38,933-0500 INFO
AbstractStreamKarajanChannel$Multiplexer No streams
2011-05-11 19:55:48,791-0500 INFO  AbstractStreamKarajanChannel Sender
1545215993 queue size: 0
2011-05-11 19:55:48,954-0500 INFO
AbstractStreamKarajanChannel$Multiplexer No streams
2011-05-11 19:55:58,800-0500 INFO  AbstractStreamKarajanChannel Sender
1545215993 queue size: 0
2011-05-11 19:55:58,955-0500 INFO
AbstractStreamKarajanChannel$Multiplexer No streams
2011-05-11 19:56:08,808-0500 INFO  AbstractStreamKarajanChannel Sender
1545215993 queue size: 0
2011-05-11 19:56:08,972-0500 INFO
AbstractStreamKarajanChannel$Multiplexer No streams
2011-05-11 19:56:18,816-0500 INFO  AbstractStreamKarajanChannel Sender
1545215993 queue size: 0
2011-05-11 19:56:18,974-0500 INFO
AbstractStreamKarajanChannel$Multiplexer No streams
2011-05-11 19:56:28,822-0500 INFO  AbstractStreamKarajanChannel Sender
1545215993 queue size: 0
2011-05-11 19:56:28,984-0500 INFO
AbstractStreamKarajanChannel$Multiplexer No streams
2011-05-11 19:56:38,825-0500 INFO  AbstractStreamKarajanChannel Sender
1545215993 queue size: 0
2011-05-11 19:56:38,988-0500 INFO
AbstractStreamKarajanChannel$Multiplexer No streams
2011-05-11 19:56:48,832-0500 INFO  AbstractStreamKarajanChannel Sender
1545215993 queue size: 0
2011-05-11 19:56:48,999-0500 INFO
AbstractStreamKarajanChannel$Multiplexer No streams
2011-05-11 19:56:58,845-0500 INFO  AbstractStreamKarajanChannel Sender
1545215993 queue size: 0
2011-05-11 19:56:59,001-0500 INFO
AbstractStreamKarajanChannel$Multiplexer No streams
2011-05-11 19:57:08,846-0500 INFO  AbstractStreamKarajanChannel Sender
1545215993 queue size: 0
2011-05-11 19:57:09,014-0500 INFO
AbstractStreamKarajanChannel$Multiplexer No streams
2011-05-11 19:57:18,854-0500 INFO  AbstractStreamKarajanChannel Sender
1545215993 queue size: 0
2011-05-11 19:57:19,025-0500 INFO
AbstractStreamKarajanChannel$Multiplexer No streams

2011/5/11 Mihael Hategan <hategan at mcs.anl.gov>:
> On Wed, 2011-05-11 at 16:42 -0500, Allan Espinosa wrote:
>> Right. Workers die because they exceed the maximum walltime.  Does the
>> coaster service expect the workers to die cleanly (passive ones)?
>
> Hmm. They aren't expected to die. Which may be a problem.
>
> We (as in I) need to change that. Passive workers should advertise their
> walltime to the service and the service should take that into account so
> that jobs don't get sent to workers who don't have enough time left.
>
> However, as inefficient as this may be, the service should notify the
> client that the jobs that were running on a dying worker have failed,
> and those jobs should be restarted by swift. Is that not happening?
>
>
>
>

-- 
Allan M. Espinosa <http://amespinosa.wordpress.com>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>