[Swift-devel] Localhost coasters not working on Beagle

Mihael Hategan hategan at mcs.anl.gov
Sun Jun 8 22:22:51 CDT 2014


That's odd. Have you tried netstat -lntp? telnet?

I'll give it a shot, but this looks rather strange.

Mihael

On Sun, 2014-06-08 at 22:10 -0500, Michael Wilde wrote:
> login1$ more /home/wilde/.globus/coasters/worker-0608-0710120-000000.log
> 2014/06/08 22:07:12.296 INFO  - 0608-0710120-000000 Logging started: Sun 
> Jun  8 22:07:12 2014
> 2014/06/08 22:07:12.296 INFO  - Running on node 
> login1.beagle.ci.uchicago.edu
> 2014/06/08 22:07:12.296 DEBUG - uri=http://127.0.0.1:50003
> 2014/06/08 22:07:12.296 DEBUG - scheme=http
> 2014/06/08 22:07:12.297 DEBUG - host=127.0.0.1
> 2014/06/08 22:07:12.297 DEBUG - port=50003
> 2014/06/08 22:07:12.297 DEBUG - blockid=0608-0710120-000000
> 2014/06/08 22:07:12.297 INFO  - Connect attempt: 0 ...
> 2014/06/08 22:07:12.297 INFO  - Trying 127.0.0.1:50003 ...
> 2014/06/08 22:07:33.296 INFO  - Connection failed: Connection timed out. 
> Trying other addresses
> 2014/06/08 22:07:33.296 ERROR - Connection failed for all addresses.
> 2014/06/08 22:07:33.296 ERROR - Retrying in 1 seconds
> 2014/06/08 22:07:34.297 INFO  - Connect attempt: 1 ...
> 2014/06/08 22:07:34.297 INFO  - Trying 127.0.0.1:50003 ...
> 2014/06/08 22:07:55.295 INFO  - Connection failed: Connection timed out. 
> Trying other addresses
> 2014/06/08 22:07:55.296 ERROR - Connection failed for all addresses.
> 2014/06/08 22:07:55.296 ERROR - Retrying in 2 seconds
> 2014/06/08 22:07:57.298 INFO  - Connect attempt: 2 ...
> 2014/06/08 22:07:57.298 INFO  - Trying 127.0.0.1:50003 ...
> 2014/06/08 22:08:18.295 INFO  - Connection failed: Connection timed out. 
> Trying other addresses
> 2014/06/08 22:08:18.295 ERROR - Connection failed for all addresses.
> 2014/06/08 22:08:18.295 ERROR - Failed to connect: Connection timed out
> login1$
> 
> 
> On 6/8/14, 5:33 PM, Mihael Hategan wrote:
> > Can you enable worker logging and post the worker log?
> >
> > Mihael
> >
> > On Sun, 2014-06-08 at 16:48 -0500, Michael Wilde wrote:
> >> Mihael - Im not able to get a simple localhost coasters run working on
> >> Beagle login1.
> >>
> >> All: Is anyone seeing something similar?  It looks to me like my coaster
> >> worker is not able to connect to the Swift coaster service (using
> >> standard automatic coasters).
> >>
> >> Im working in /lustre/beagle/wilde/swift/lab/fastio (where you can find
> >> logs and configs).  Running 0.95RC6.
> >>
> >> Im setting GLOBUS_HOSTNAME (to 127.0.0.1) and have tried internalHost as
> >> well:
> >>
> >> login1$ swift -config cf -tc.file apps -sites.file localcoast.xml
> >> catsn.swift
> >>
> >> login1$ cat localcoast.xml
> >> <?xml version="1.0" encoding="UTF-8"?>
> >> <config xmlns="http://www.ci.uchicago.edu/swift/SwiftSites">
> >>
> >> <pool handle="localhost">
> >>
> >> <execution provider="coaster" jobmanager="local:local"/>
> >>
> >> <profile namespace="globus" key="internalHostname">127.0.0.1</profile>
> >>     <profile namespace="globus" key="maxwalltime">00:01:00</profile>
> >>     <profile namespace="globus" key="maxtime">3600</profile>
> >>
> >>     <profile namespace="globus" key="jobsPerNode">1</profile>
> >>     <profile namespace="globus" key="slots">1</profile>
> >>     <profile namespace="globus" key="nodeGranularity">1</profile>
> >>     <profile namespace="globus" key="maxNodes">1</profile>
> >>
> >>     <profile namespace="karajan" key="jobThrottle">12</profile>
> >>     <profile namespace="karajan" key="initialScore">10000</profile>
> >>
> >>     <profile namespace="karajan" key="lowOverAllocation">100</profile>
> >>     <profile namespace="karajan" key="highOverAllocation">100</profile>
> >>
> >> <filesystem provider="local"/>
> >> <workdirectory>/tmp/swiftwork</workdirectory>
> >>
> >>
> >> </pool>
> >>
> >> I get error 110 connection timeouts:
> >>
> >> 2014-06-08 16:37:50,762-0500 DEBUG swift JOB_START jobid=cat-7jiymsrl
> >> tr=cat arguments=[data.txt] tmpdir=catsn-run013/jobs/7/cat-7jiymsrl
> >> host=localhost
> >> 2014-06-08 16:37:50,829-0500 INFO  LocalService Started local service:
> >> 127.0.0.1:50000
> >> 2014-06-08 16:37:50,837-0500 INFO  BootstrapService Socket bound. URL is
> >> http://127.0.0.1:50001
> >> 2014-06-08 16:37:50,914-0500 INFO  Settings Local contacts:
> >> [http://127.0.0.2:50003, http://192.5.86.104:50003,
> >> http://10.128.2.244:50003]
> >> 2014-06-08 16:37:50,917-0500 INFO  CoasterService Started local service:
> >> http://127.0.0.1:50003
> >> 2014-06-08 16:37:50,917-0500 INFO  CoasterService Reserving channel for
> >> registration
> >> 2014-06-08 16:37:50,942-0500 INFO  MetaChannel MetaChannel [context:
> >> cpipe, boundTo: null] binding to cpipe://1
> >> 2014-06-08 16:37:50,942-0500 INFO  MetaChannel MetaChannel [context:
> >> spipe, boundTo: null] binding to spipe://1
> >> 2014-06-08 16:37:50,942-0500 INFO  CoasterService Sending registration
> >> 2014-06-08 16:37:50,948-0500 INFO  MetaChannel Trying to re-bind current
> >> channel
> >> 2014-06-08 16:37:50,949-0500 INFO  RequestHandler Handler(tag: 1,
> >> REGISTER) unregistering (send)
> >> 2014-06-08 16:37:50,949-0500 INFO  CoasterService Registration complete
> >> 2014-06-08 16:37:50,949-0500 INFO  CoasterService Started coaster
> >> service: http://127.0.0.1:50002
> >> 2014-06-08 16:37:50,952-0500 WARN  Settings original callback URI is
> >> http://10.128.2.244:50003
> >> 2014-06-08 16:37:50,952-0500 WARN  Settings callback URI has been
> >> overridden to http://127.0.0.1:50003
> >> 2014-06-08 16:37:50,953-0500 INFO  RequestHandler Handler(tag: 1,
> >> CONFIGSERVICE) unregistering (send)
> >> 2014-06-08 16:37:50,969-0500 INFO  BlockQueueProcessor Starting...
> >> id=0608-3704500
> >> 2014-06-08 16:37:50,969-0500 INFO  RequestHandler Handler(tag: 2,
> >> SUBMITJOB) unregistering (send)
> >> 2014-06-08 16:37:51,009-0500 INFO  BlockQueueProcessor
> >> Settings {
> >>       slots = 1
> >>       jobsPerNode = 1
> >>       workersPerNode = 1
> >>       nodeGranularity = 1
> >>       allocationStepSize = 0.1
> >>       maxNodes = 1
> >>       lowOverallocation = 10.0
> >>       highOverallocation = 1.0
> >>       overallocationDecayFactor = 0.001
> >>       spread = 0.9
> >>       reserve = 60.000s
> >>       maxtime = 3600
> >>       remoteMonitorEnabled = false
> >>       internalHostname = 127.0.0.1
> >>       hookClass = null
> >>       workerManager = block
> >>       workerLoggingLevel = NONE
> >>       workerLoggingDirectory = DEFAULT
> >>       ldLibraryPath = null
> >>       workerCopies = null
> >>       directory = null
> >>       useHashBang = null
> >>       parallelism = 0.01
> >>       coresPerNode = 1
> >>       perfTraceWorker = false
> >>       perfTraceInterval = -1
> >>       attributes = {}
> >>       callbackURIs = [http://127.0.0.1:50003]
> >> }
> >>
> >> 2014-06-08 16:37:51,009-0500 INFO  BlockQueueProcessor Jobs in holding
> >> queue: 1
> >> 2014-06-08 16:37:51,009-0500 INFO  BlockQueueProcessor Time estimate for
> >> holding queue (seconds): 1
> >> 2014-06-08 16:37:51,009-0500 INFO  BlockQueueProcessor Allocating blocks
> >> for a total walltime of: 1s
> >> 2014-06-08 16:37:51,009-0500 INFO  BlockQueueProcessor  Considering:
> >> Job(id:0 60.000s)
> >> 2014-06-08 16:37:51,009-0500 INFO  BlockQueueProcessor       Max
> >> Walltime (seconds):   60
> >> 2014-06-08 16:37:51,009-0500 INFO  BlockQueueProcessor       Time
> >> estimate (seconds):  600
> >> 2014-06-08 16:37:51,010-0500 INFO  BlockQueueProcessor       Total for
> >> this new Block (est. seconds): 0
> >> 2014-06-08 16:37:51,013-0500 INFO  BlockQueueProcessor index: 0, last:
> >> 0, holding.size(): 1
> >> 2014-06-08 16:37:51,014-0500 INFO  BlockQueueProcessor Queued: 1 jobs to
> >> new Block
> >> 2014-06-08 16:37:51,014-0500 INFO  BlockQueueProcessor index: 0, last:
> >> 0, ii: 1, holding.size(): 1
> >> 2014-06-08 16:37:51,014-0500 INFO  Block Starting block: workers=1,
> >> walltime=600.000s
> >> 2014-06-08 16:37:51,016-0500 INFO  RemoteLogHandler BLOCK_REQUESTED
> >> id=0608-3704500-000000, cores=1, coresPerWorker=1, walltime=600
> >> 2014-06-08 16:37:51,016-0500 INFO  RequestHandler Handler(tag: 2, RLOG)
> >> unregistering (send)
> >> 2014-06-08 16:37:51,018-0500 INFO  BlockTaskSubmitter Queuing block
> >> Block 0608-3704500-000000 (1x600.000s) for submission
> >> 2014-06-08 16:37:51,018-0500 INFO  BlockQueueProcessor Added 1 jobs to
> >> new blocks
> >> 2014-06-08 16:37:51,018-0500 INFO  BlockTaskSubmitter Submitting block
> >> Block 0608-3704500-000000 (1x600.000s)
> >> 2014-06-08 16:37:51,018-0500 INFO  ExecutionTaskHandler provider=local
> >> 2014-06-08 16:37:51,023-0500 INFO  Block Block task status changed:
> >> Submitting
> >> 2014-06-08 16:37:51,023-0500 INFO  JobSubmissionTaskHandler Submit: in:
> >> / command: /usr/bin/perl
> >> /home/wilde/.globus/coasters/cscript2445623341660096310.pl
> >> http://127.0.0.1:50003 0608-3704500-000000 NOLOGGING
> >> 2014-06-08 16:37:51,024-0500 INFO  Block Block task status changed:
> >> Submitted
> >> 2014-06-08 16:37:51,027-0500 INFO  Block Block task status changed: Active
> >> 2014-06-08 16:37:51,027-0500 INFO  RemoteLogHandler BLOCK_ACTIVE
> >> id=0608-3704500-000000
> >> 2014-06-08 16:37:51,027-0500 INFO  RequestHandler Handler(tag: 3, RLOG)
> >> unregistering (send)
> >> 2014-06-08 16:37:51,681-0500 INFO  RuntimeStats$ProgressTicker Submitted:1
> >> 2014-06-08 16:37:51,681-0500 INFO  RuntimeStats$ProgressTicker HeapMax:
> >> 954466304, CrtHeap: 253624320, UsedHeap: 28583112
> >> 2014-06-08 16:38:21,683-0500 INFO  RuntimeStats$ProgressTicker Submitted:1
> >> 2014-06-08 16:38:21,683-0500 INFO  RuntimeStats$ProgressTicker HeapMax:
> >> 954466304, CrtHeap: 253624320, UsedHeap: 29067208
> >> 2014-06-08 16:38:51,686-0500 INFO  RuntimeStats$ProgressTicker Submitted:1
> >> 2014-06-08 16:38:51,686-0500 INFO  RuntimeStats$ProgressTicker HeapMax:
> >> 954466304, CrtHeap: 253624320, UsedHeap: 29551304
> >> 2014-06-08 16:38:57,113-0500 INFO  Block Block task status changed:
> >> Failed Job failed with an exit code of 110
> >> 2014-06-08 16:38:57,115-0500 INFO  Block Failed task spec: Job:
> >>       executable: /usr/bin/perl
> >>       arguments:
> >> /home/wilde/.globus/coasters/cscript2445623341660096310.pl
> >> http://127.0.0.1:50003 0608-3704500-000000 NOLOGGING
> >>       stdout:     null
> >>       stderr:     null
> >>       directory:  /
> >>       batch:      false
> >>       redirected: false
> >>       attributes:
> >> hostcount=1,count=1,jobspernode=1,corespernode=1,maxwalltime=10
> >>       env:        WORKER_LOGGING_LEVEL=NONE
> >>
> >> 2014-06-08 16:38:57,115-0500 INFO  Block Worker task failed:
> >> Failed to connect: Connection timed out at
> >> /home/wilde/.globus/coasters/cscript2445623341660096310.pl line 1101.
> >>
> >>
> >>
> >
> 





More information about the Swift-devel mailing list