[Swift-devel] Coaster Task Submission Stalling

Thu Sep 11 13:10:52 CDT 2014

I meant the github master., but it turns out that I had had the wrong Swift
on my path.  Apologies for the confusion.

I've rerun with the current one.

I'm getting a null pointer exception on line 226 of
BlockQueueProcessor.java.  Adding some printfs revealed that settings was
null.

Log attached.

- Tim

Job: Job(id:0 600.000s)
Settings: null
java.lang.NullPointerException
    at
org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.checkJob(BlockQueueProcessor.java:228)
    at
org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.enqueue1(BlockQueueProcessor.java:210)
    at
org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.enqueue(BlockQueueProcessor.java:204)
    at
org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.enqueue(JobQueue.java:103)
    at
org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:96)
    at
org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:112)
    at
org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:589)
    at
org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:175)
    at org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:90)

On Thu, Sep 11, 2014 at 12:41 PM, Tim Armstrong <tim.g.armstrong at gmail.com>
wrote:

> I thought I was running the latest trunk, I'll rebuild and see if I can
> reproduce the issue.
>
>  - Tim
>
> On Thu, Sep 11, 2014 at 12:39 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
>
>> The method "getMetaChannel()" has been removed. Where did you get the
>> code from?
>>
>> Mihael
>>
>> On Thu, 2014-09-11 at 12:16 -0500, Tim Armstrong wrote:
>> > I'm seeing failures when running Swift/T tests with
>> > start-coaster-service.sh.
>> >
>> > E.g. the turbine test coaster-exec-1.  I can provide instructions for
>> > running the test if needed (roughly, you need to build Swift/T with
>> coaster
>> > support enabled, then make tests/coaster-exec-1.result in the turbine
>> > directory).  The github swift-t release is up to date if you want to use
>> > that.
>> >
>> > Full log is attached, stack trace excerpt is below.
>> >
>> > - Tim
>> >
>> > 2014-09-11 12:11:13,708-0500 INFO  BlockQueueProcessor Starting...
>> > id=0911-1112130
>> > Using threaded sender for TCPChannel [type: server, contact:
>> 127.0.0.1:48242
>> > ]
>> > 2014-09-11 12:11:13,708-0500 INFO  AbstractStreamCoasterChannel Using
>> > threaded sender for TCPChannel [type: server, contact: 127.0.0.1:48242]
>> > org.globus.cog.coaster.channels.ChannelException: Invalid channel: null
>> > @id://null-nullS
>> >     at
>> >
>> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452)
>> >     at
>> >
>> org.globus.cog.coaster.channels.ChannelManager.reserveChannel(ChannelManager.java:226)
>> >     at
>> >
>> org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.setClientChannelContext(PassiveQueueProcessor.java:41)
>> >     at
>> >
>> org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.setClientChannelContext(JobQueue.java:135)
>> >     at
>> >
>> org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:77)
>> >     at
>> >
>> org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88)
>> >     at
>> >
>> org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527)
>> >     at
>> >
>> org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173)
>> >     at
>> org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70)
>> > provider=local
>> > 2014-09-11 12:11:13,930-0500 INFO  ExecutionTaskHandler provider=local
>> > org.globus.cog.coaster.channels.ChannelException: Invalid channel: null
>> > @id://null-nullS
>> >     at
>> >
>> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452)
>> >     at
>> >
>> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432)
>> >     at
>> >
>> org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472)
>> >     at
>> >
>> org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80)
>> >     at
>> >
>> org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88)
>> >     at
>> >
>> org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527)
>> >     at
>> >
>> org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173)
>> >     at
>> org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70)
>> > Handler(tag: 38907, SUBMITJOB) sending error: Could not deserialize job
>> > description
>> > org.globus.cog.coaster.ProtocolException: Could not deserialize job
>> > description
>> >     at
>> >
>> org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:84)
>> >     at
>> >
>> org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88)
>> >     at
>> >
>> org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527)
>> >     at
>> >
>> org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173)
>> >     at
>> org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70)
>> > Caused by: org.globus.cog.coaster.channels.ChannelException: Invalid
>> > channel: null at id://null-nullS
>> >     at
>> >
>> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452)
>> >     at
>> >
>> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432)
>> >     at
>> >
>> org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472)
>> >     at
>> >
>> org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80)
>> >     ... 4 more
>> > 2014-09-11 12:11:13,937-0500 INFO  RequestReply Handler(tag: 38907,
>> > SUBMITJOB) sending error: Could not deserialize job description
>> > org.globus.cog.coaster.ProtocolException: Could not deserialize job
>> > description
>> >     at
>> >
>> org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:84)
>> >     at
>> >
>> org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88)
>> >     at
>> >
>> org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527)
>> >     at
>> >
>> org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173)
>> >     at
>> org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70)
>> > Caused by: org.globus.cog.coaster.channels.ChannelException: Invalid
>> > channel: null at id://null-nullS
>> >     at
>> >
>> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452)
>> >     at
>> >
>> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432)
>> >     at
>> >
>> org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472)
>> >     at
>> >
>> org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80)
>> >     ... 4 more
>> >
>> >
>> > On Thu, Sep 11, 2014 at 10:30 AM, Tim Armstrong <
>> tim.g.armstrong at gmail.com>
>> > wrote:
>> >
>> > > This all sounds great.
>> > >
>> > > Just to check that I've understood correctly, from the client's point
>> of
>> > > view:
>> > > * The per-client settings behave the same if -shared is not provided.
>> > > * Per-client settings are ignored if -shared is provided
>> > >
>> > > I had one question:
>> > > * Do automatically allocated workers work with per-client settings?  I
>> > > understand there were some issues related to sharing workers between
>> > > clients.  Was the solution to have separate worker pools, or is this
>> just
>> > > not supported?
>> > >
>> > > - Tim
>> > >
>> > > On Mon, Sep 8, 2014 at 2:38 PM, Mihael Hategan <hategan at mcs.anl.gov>
>> > > wrote:
>> > >
>> > >> So...
>> > >>
>> > >> There were bugs. Lots of bugs.
>> > >> I did some work over the weekend to fix some of these and clean up
>> the
>> > >> coaster code. Here's a summary:
>> > >>
>> > >> - there was some stuff in the low level coaster code to deal with
>> > >> persisting coaster channels over multiple connections with various
>> > >> options, like periodic connections, client or server initiated
>> > >> connections, buffering of commands, etc. None of this was used by
>> Swift,
>> > >> and the code was pretty messy. I removed that.
>> > >> - there were some issues with multiple clients:
>> > >>   * improper shutdown of relevant workers when a client disconnected
>> > >>   * the worker task dispatcher was a singleton and had a reference to
>> > >> one block allocator, whereas multiple clients involved multiple
>> > >> allocators.
>> > >> - there were a bunch of locking issues in the C client that valgrind
>> > >> caught
>> > >> - the idea of remote job ids was a bit hard to work with. This
>> remote id
>> > >> was the job id that the service assigned to a job. This is necessary
>> > >> because two different clients can submit jobs with the same id. The
>> > >> remote id would be communicated to the client as the reply to the
>> submit
>> > >> request. However, it was entirely possible for a notification about
>> job
>> > >> status to be sent to the client before the submit reply was. Since
>> > >> notifications were sent using the remote-id, the client would have no
>> > >> idea what job the notifications belonged to. Now, the server might
>> need
>> > >> a unique job id, but there is no reason why it cannot use the client
>> id
>> > >> when communicating the status to a client. So that's there now.
>> > >> - the way the C client was working, its jobs ended up not going to
>> the
>> > >> workers, but the local queue. The service settings now allow
>> specifying
>> > >> the provider/jobManager/url to be used to start blocks, and jobs are
>> > >> routed appropriately if they do not have the batch job flag set.
>> > >>
>> > >> I also added a shared service mode. We discussed this before.
>> Basically
>> > >> you start the coaster service with "-shared <service.properties>" and
>> > >> all the settings are read from that file. In this case, all clients
>> > >> share the same worker pool, and client settings are ignored.
>> > >>
>> > >> The C client now has a multi-job testing tool which can submit many
>> jobs
>> > >> with the desired level of concurrency.
>> > >>
>> > >> I have tested the C client with both shared and non-shared mode, with
>> > >> various levels of jobs being sent, with either one or two concurrent
>> > >> clients.
>> > >>
>> > >> I haven't tested manual workers.
>> > >>
>> > >> I've also decided that during normal operation (i.e. client connects,
>> > >> submits jobs, shuts down gracefully), there should be no exceptions
>> in
>> > >> the coaster log. I think we should stick to that principle. This was
>> the
>> > >> case last I tested, and we should consider any deviation from that
>> to be
>> > >> a problem. Of course, there are some things for which there is no
>> > >> graceful shut down, such as ctrl+C-ing a manual worker. Exceptions
>> are
>> > >> fine in that case.
>> > >>
>> > >> So anyway, let's start from here.
>> > >>
>> > >> Mihael
>> > >>
>> > >> On Fri, 2014-09-05 at 13:09 -0500, Tim Armstrong wrote:
>> > >> > Thanks, let me know if there's anything I can help do.
>> > >> >
>> > >> > - Tim
>> > >> >
>> > >> >
>> > >> > On Fri, Sep 5, 2014 at 12:57 PM, Mihael Hategan <
>> hategan at mcs.anl.gov>
>> > >> wrote:
>> > >> >
>> > >> > > Thanks. It also seems that there is an older bug in there in
>> which the
>> > >> > > client connection is not properly accounted for and things start
>> > >> failing
>> > >> > > two minutes after the client connects (which is also probably
>> why you
>> > >> > > didn't see this in runs with many short client connections). I'm
>> not
>> > >> > > sure why the fix for that bug isn't in the trunk code.
>> > >> > >
>> > >> > > In any event, I'll set up a client submission loop and fix all
>> these
>> > >> > > things.
>> > >> > >
>> > >> > > Mihael
>> > >> > >
>> > >> > > On Fri, 2014-09-05 at 12:13 -0500, Tim Armstrong wrote:
>> > >> > > > Ok, here it is with the additional debug messages.  Source code
>> > >> change is
>> > >> > > > in commit 890c41f2ba701b10264553471590096d6f94c278.
>> > >> > > >
>> > >> > > > Warning: the tarball will expand to several gigabytes of logs.
>> > >> > > >
>> > >> > > > I had to do multiple client runs to trigger it.  It seems like
>> the
>> > >> > > problem
>> > >> > > > might be triggered by abnormal termination of the client.
>> First 18
>> > >> runs
>> > >> > > > went fine, problem only started when I ctrl-c-ed the swift/t
>> run #19
>> > >> > > before
>> > >> > > > the run #20 that exhibited delays.
>> > >> > > >
>> > >> > > > http://people.cs.uchicago.edu/~tga/files/worker-logs3.tar.gz
>> > >> > > >
>> > >> > > > - Tim
>> > >> > > >
>> > >> > > >
>> > >> > > > On Fri, Sep 5, 2014 at 8:55 AM, Tim Armstrong <
>> > >> tim.g.armstrong at gmail.com
>> > >> > > >
>> > >> > > > wrote:
>> > >> > > >
>> > >> > > > > It's here:
>> > >> > > > >
>> > >> http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz
>> .
>> > >> > > > >
>> > >> > > > > I'll add some extra debug messages in the coaster C++ client
>> and
>> > >> see
>> > >> > > if I
>> > >> > > > > can recreate the scenario.
>> > >> > > > >
>> > >> > > > > - Tim
>> > >> > > > >
>> > >> > > > >
>> > >> > > > > On Thu, Sep 4, 2014 at 7:27 PM, Mihael Hategan <
>> > >> hategan at mcs.anl.gov>
>> > >> > > > > wrote:
>> > >> > > > >
>> > >> > > > >> Ok, so that's legit.
>> > >> > > > >>
>> > >> > > > >> It does look like shut down workers are not properly
>> accounted
>> > >> for in
>> > >> > > > >> some places (and I believe Yadu submitted a bug for this).
>> > >> However, I
>> > >> > > do
>> > >> > > > >> not see the dead time you mention in either of the last two
>> sets
>> > >> of
>> > >> > > > >> logs. It looks like each client instance submits a continous
>> > >> stream of
>> > >> > > > >> jobs.
>> > >> > > > >>
>> > >> > > > >> So let's get back to the initial log. Can I have the full
>> > >> service log?
>> > >> > > > >> I'm trying to track what happened with the jobs submitted
>> before
>> > >> the
>> > >> > > > >> first big pause.
>> > >> > > > >>
>> > >> > > > >> Also, a log message in CoasterClient::updateJobStatus() (or
>> > >> friends)
>> > >> > > > >> would probably help a lot here.
>> > >> > > > >>
>> > >> > > > >> Mihael
>> > >> > > > >>
>> > >> > > > >> On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong wrote:
>> > >> > > > >> > Should be here:
>> > >> > > > >> >
>> > >> > > > >> > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz
>> > >> > > > >> >
>> > >> > > > >> >
>> > >> > > > >> >
>> > >> > > > >> >
>> > >> > > > >> > On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan <
>> > >> hategan at mcs.anl.gov
>> > >> > > >
>> > >> > > > >> wrote:
>> > >> > > > >> >
>> > >> > > > >> > > The first worker "failing" is 0904-20022331. The log
>> looks
>> > >> funny
>> > >> > > at
>> > >> > > > >> the
>> > >> > > > >> > > end.
>> > >> > > > >> > >
>> > >> > > > >> > > Can you git pull and re-run? The worker is getting some
>> > >> command
>> > >> > > at the
>> > >> > > > >> > > end there and doing nothing about it and I wonder why.
>> > >> > > > >> > >
>> > >> > > > >> > > Mihael
>> > >> > > > >> > >
>> > >> > > > >> > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong wrote:
>> > >> > > > >> > > > Ok, now I have some worker logs:
>> > >> > > > >> > > >
>> > >> > > > >> > > >
>> > >> http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz
>> > >> > > > >> > > >
>> > >> > > > >> > > > There's nothing obvious I see in the worker logs that
>> would
>> > >> > > > >> indicate why
>> > >> > > > >> > > > the connection was broken.
>> > >> > > > >> > > >
>> > >> > > > >> > > > - Tim
>> > >> > > > >> > > >
>> > >> > > > >> > > >
>> > >> > > > >> > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong <
>> > >> > > > >> tim.g.armstrong at gmail.com
>> > >> > > > >> > > >
>> > >> > > > >> > > > wrote:
>> > >> > > > >> > > >
>> > >> > > > >> > > > > This is all running locally on my laptop, so I
>> think we
>> > >> can
>> > >> > > rule
>> > >> > > > >> out
>> > >> > > > >> > > 1).
>> > >> > > > >> > > > >
>> > >> > > > >> > > > > It also seems like it's a state the coaster service
>> gets
>> > >> into
>> > >> > > > >> after a
>> > >> > > > >> > > few
>> > >> > > > >> > > > > client sessions: generally the first coaster run
>> works
>> > >> fine,
>> > >> > > then
>> > >> > > > >> > > after a
>> > >> > > > >> > > > > few runs the problem occurs more frequently.
>> > >> > > > >> > > > >
>> > >> > > > >> > > > > I'm going to try and get worker logs, in the
>> meantime
>> > >> i've got
>> > >> > > > >> some
>> > >> > > > >> > > > > jstacks (attached).
>> > >> > > > >> > > > >
>> > >> > > > >> > > > > Matching service logs (largish) are here if needed:
>> > >> > > > >> > > > > http://people.cs.uchicago.edu/~tga/service.out.gz
>> > >> > > > >> > > > >
>> > >> > > > >> > > > >
>> > >> > > > >> > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan <
>> > >> > > > >> hategan at mcs.anl.gov>
>> > >> > > > >> > > > > wrote:
>> > >> > > > >> > > > >
>> > >> > > > >> > > > >> Ah, makes sense.
>> > >> > > > >> > > > >>
>> > >> > > > >> > > > >> 2 minutes is the channel timeout. Each live
>> connection
>> > >> is
>> > >> > > > >> guaranteed
>> > >> > > > >> > > to
>> > >> > > > >> > > > >> have some communication for any 2 minute time
>> window,
>> > >> > > partially
>> > >> > > > >> due to
>> > >> > > > >> > > > >> periodic heartbeats (sent every 1 minute). If no
>> > >> packets flow
>> > >> > > > >> for the
>> > >> > > > >> > > > >> duration of 2 minutes, the connection is assumed
>> broken
>> > >> and
>> > >> > > all
>> > >> > > > >> jobs
>> > >> > > > >> > > > >> that were submitted to the respective workers are
>> > >> considered
>> > >> > > > >> failed.
>> > >> > > > >> > > So
>> > >> > > > >> > > > >> there seems to be an issue with the connections to
>> some
>> > >> of
>> > >> > > the
>> > >> > > > >> > > workers,
>> > >> > > > >> > > > >> and it takes 2 minutes to detect them.
>> > >> > > > >> > > > >>
>> > >> > > > >> > > > >> Since the service seems to be alive (although a
>> jstack
>> > >> on the
>> > >> > > > >> service
>> > >> > > > >> > > > >> when thing seem to hang might help), this leaves
>> two
>> > >> > > > >> possibilities:
>> > >> > > > >> > > > >> 1 - some genuine network problem
>> > >> > > > >> > > > >> 2 - the worker died without properly closing TCP
>> > >> connections
>> > >> > > > >> > > > >>
>> > >> > > > >> > > > >> If (2), you could enable worker logging
>> > >> > > > >> > > > >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to
>> see
>> > >> if
>> > >> > > > >> anything
>> > >> > > > >> > > shows
>> > >> > > > >> > > > >> up.
>> > >> > > > >> > > > >>
>> > >> > > > >> > > > >> Mihael
>> > >> > > > >> > > > >>
>> > >> > > > >> > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong
>> wrote:
>> > >> > > > >> > > > >> > Here are client and service logs, with part of
>> > >> service log
>> > >> > > > >> edited
>> > >> > > > >> > > down
>> > >> > > > >> > > > >> to
>> > >> > > > >> > > > >> > be a reasonable size (I have the full thing if
>> > >> needed, but
>> > >> > > it
>> > >> > > > >> was
>> > >> > > > >> > > over a
>> > >> > > > >> > > > >> > gigabyte).
>> > >> > > > >> > > > >> >
>> > >> > > > >> > > > >> > One relevant section is from 19:49:35 onwards.
>> The
>> > >> client
>> > >> > > > >> submits 4
>> > >> > > > >> > > > >> jobs
>> > >> > > > >> > > > >> > (its limit), but they don't complete until
>> 19:51:32
>> > >> or so
>> > >> > > (I
>> > >> > > > >> can see
>> > >> > > > >> > > > >> that
>> > >> > > > >> > > > >> > one task completed based on ncompleted=1 in the
>> > >> > > check_tasks log
>> > >> > > > >> > > > >> message).
>> > >> > > > >> > > > >> > It looks like something has happened with broken
>> > >> pipes and
>> > >> > > > >> workers
>> > >> > > > >> > > being
>> > >> > > > >> > > > >> > lost, but I'm not sure what the ultimate cause of
>> > >> that is
>> > >> > > > >> likely to
>> > >> > > > >> > > be.
>> > >> > > > >> > > > >> >
>> > >> > > > >> > > > >> > - Tim
>> > >> > > > >> > > > >> >
>> > >> > > > >> > > > >> >
>> > >> > > > >> > > > >> >
>> > >> > > > >> > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan <
>> > >> > > > >> hategan at mcs.anl.gov
>> > >> > > > >> > > >
>> > >> > > > >> > > > >> wrote:
>> > >> > > > >> > > > >> >
>> > >> > > > >> > > > >> > > Hi Tim,
>> > >> > > > >> > > > >> > >
>> > >> > > > >> > > > >> > > I've never seen this before with pure Java.
>> > >> > > > >> > > > >> > >
>> > >> > > > >> > > > >> > > Do you have logs from these runs?
>> > >> > > > >> > > > >> > >
>> > >> > > > >> > > > >> > > Mihael
>> > >> > > > >> > > > >> > >
>> > >> > > > >> > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim
>> Armstrong
>> > >> wrote:
>> > >> > > > >> > > > >> > > > I'm running a test Swift/T script that submit
>> > >> tasks to
>> > >> > > > >> Coasters
>> > >> > > > >> > > > >> through
>> > >> > > > >> > > > >> > > the
>> > >> > > > >> > > > >> > > > C++ client and I'm seeing some odd behaviour
>> > >> where task
>> > >> > > > >> > > > >> > > > submission/execution is stalling for ~2
>> minute
>> > >> periods.
>> > >> > > > >> For
>> > >> > > > >> > > > >> example, I'm
>> > >> > > > >> > > > >> > > > seeing submit log messages like "submitting
>> > >> > > > >> > > > >> > > > urn:133-1409778135377-1409778135378:
>> > >> /bin/hostname" in
>> > >> > > > >> bursts of
>> > >> > > > >> > > > >> several
>> > >> > > > >> > > > >> > > > seconds with a gap of roughly 2 minutes in
>> > >> between,
>> > >> > > e.g.
>> > >> > > > >> I'm
>> > >> > > > >> > > seeing
>> > >> > > > >> > > > >> > > bursts
>> > >> > > > >> > > > >> > > > with the following intervals in my logs.
>> > >> > > > >> > > > >> > > >
>> > >> > > > >> > > > >> > > > 16:07:04,603 to 16:07:10,391
>> > >> > > > >> > > > >> > > > 16:09:07,377 to 16:09:13,076
>> > >> > > > >> > > > >> > > > 16:11:10,005 to 16:11:16,770
>> > >> > > > >> > > > >> > > > 16:13:13,291 to 16:13:19,296
>> > >> > > > >> > > > >> > > > 16:15:16,000 to 16:15:21,602
>> > >> > > > >> > > > >> > > >
>> > >> > > > >> > > > >> > > > From what I can tell, the delay is on the
>> coaster
>> > >> > > service
>> > >> > > > >> side:
>> > >> > > > >> > > the
>> > >> > > > >> > > > >> C
>> > >> > > > >> > > > >> > > > client is just waiting for a response.
>> > >> > > > >> > > > >> > > >
>> > >> > > > >> > > > >> > > > The jobs are just being submitted through the
>> > >> local job
>> > >> > > > >> > > manager, so
>> > >> > > > >> > > > >> I
>> > >> > > > >> > > > >> > > > wouldn't expect any delays there.  The tasks
>> are
>> > >> also
>> > >> > > just
>> > >> > > > >> > > > >> > > "/bin/hostname",
>> > >> > > > >> > > > >> > > > so should return immediately.
>> > >> > > > >> > > > >> > > >
>> > >> > > > >> > > > >> > > > I'm going to continue digging into this on my
>> > >> own, but
>> > >> > > the
>> > >> > > > >> 2
>> > >> > > > >> > > minute
>> > >> > > > >> > > > >> delay
>> > >> > > > >> > > > >> > > > seems like a big clue: does anyone have an
>> idea
>> > >> what
>> > >> > > could
>> > >> > > > >> cause
>> > >> > > > >> > > > >> stalls
>> > >> > > > >> > > > >> > > in
>> > >> > > > >> > > > >> > > > task submission of 2 minute duration?
>> > >> > > > >> > > > >> > > >
>> > >> > > > >> > > > >> > > > Cheers,
>> > >> > > > >> > > > >> > > > Tim
>> > >> > > > >> > > > >> > >
>> > >> > > > >> > > > >> > >
>> > >> > > > >> > > > >> > >
>> > >> > > > >> > > > >>
>> > >> > > > >> > > > >>
>> > >> > > > >> > > > >>
>> > >> > > > >> > > > >
>> > >> > > > >> > >
>> > >> > > > >> > >
>> > >> > > > >> > >
>> > >> > > > >>
>> > >> > > > >>
>> > >> > > > >>
>> > >> > > > >
>> > >> > >
>> > >> > >
>> > >> > >
>> > >>
>> > >>
>> > >>
>> > >
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20140911/57824dd7/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: start-coaster-service.log.gz
Type: application/x-gzip
Size: 1370 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20140911/57824dd7/attachment.bin>