[Swift-devel] Localhost coasters not working on Beagle

Michael Wilde wilde at anl.gov
Sun Jun 8 22:31:35 CDT 2014


I'll try the other addresses for that host.

Maybe something changed there in iptables or similar.

- MIke

On 6/8/14, 10:27 PM, Mihael Hategan wrote:
> Ok, so:
>
> shell1: hategan at login1:~> netcat -l -p 50003
>
> shell2: hategan at login1:~> netstat -lntp
> ...
> tcp        0      0 0.0.0.0:50003           0.0.0.0:*
> LISTEN      22806/netcat
> ...
>
> hategan at login1:~> telnet 127.0.0.1 50003
> Trying 127.0.0.1...
> telnet: connect to address 127.0.0.1: Connection timed out
>
> I don't think this has anything to do with swift or coasters.
>
> Mihael
>
> On Sun, 2014-06-08 at 20:22 -0700, Mihael Hategan wrote:
>> That's odd. Have you tried netstat -lntp? telnet?
>>
>> I'll give it a shot, but this looks rather strange.
>>
>> Mihael
>>
>> On Sun, 2014-06-08 at 22:10 -0500, Michael Wilde wrote:
>>> login1$ more /home/wilde/.globus/coasters/worker-0608-0710120-000000.log
>>> 2014/06/08 22:07:12.296 INFO  - 0608-0710120-000000 Logging started: Sun
>>> Jun  8 22:07:12 2014
>>> 2014/06/08 22:07:12.296 INFO  - Running on node
>>> login1.beagle.ci.uchicago.edu
>>> 2014/06/08 22:07:12.296 DEBUG - uri=http://127.0.0.1:50003
>>> 2014/06/08 22:07:12.296 DEBUG - scheme=http
>>> 2014/06/08 22:07:12.297 DEBUG - host=127.0.0.1
>>> 2014/06/08 22:07:12.297 DEBUG - port=50003
>>> 2014/06/08 22:07:12.297 DEBUG - blockid=0608-0710120-000000
>>> 2014/06/08 22:07:12.297 INFO  - Connect attempt: 0 ...
>>> 2014/06/08 22:07:12.297 INFO  - Trying 127.0.0.1:50003 ...
>>> 2014/06/08 22:07:33.296 INFO  - Connection failed: Connection timed out.
>>> Trying other addresses
>>> 2014/06/08 22:07:33.296 ERROR - Connection failed for all addresses.
>>> 2014/06/08 22:07:33.296 ERROR - Retrying in 1 seconds
>>> 2014/06/08 22:07:34.297 INFO  - Connect attempt: 1 ...
>>> 2014/06/08 22:07:34.297 INFO  - Trying 127.0.0.1:50003 ...
>>> 2014/06/08 22:07:55.295 INFO  - Connection failed: Connection timed out.
>>> Trying other addresses
>>> 2014/06/08 22:07:55.296 ERROR - Connection failed for all addresses.
>>> 2014/06/08 22:07:55.296 ERROR - Retrying in 2 seconds
>>> 2014/06/08 22:07:57.298 INFO  - Connect attempt: 2 ...
>>> 2014/06/08 22:07:57.298 INFO  - Trying 127.0.0.1:50003 ...
>>> 2014/06/08 22:08:18.295 INFO  - Connection failed: Connection timed out.
>>> Trying other addresses
>>> 2014/06/08 22:08:18.295 ERROR - Connection failed for all addresses.
>>> 2014/06/08 22:08:18.295 ERROR - Failed to connect: Connection timed out
>>> login1$
>>>
>>>
>>> On 6/8/14, 5:33 PM, Mihael Hategan wrote:
>>>> Can you enable worker logging and post the worker log?
>>>>
>>>> Mihael
>>>>
>>>> On Sun, 2014-06-08 at 16:48 -0500, Michael Wilde wrote:
>>>>> Mihael - Im not able to get a simple localhost coasters run working on
>>>>> Beagle login1.
>>>>>
>>>>> All: Is anyone seeing something similar?  It looks to me like my coaster
>>>>> worker is not able to connect to the Swift coaster service (using
>>>>> standard automatic coasters).
>>>>>
>>>>> Im working in /lustre/beagle/wilde/swift/lab/fastio (where you can find
>>>>> logs and configs).  Running 0.95RC6.
>>>>>
>>>>> Im setting GLOBUS_HOSTNAME (to 127.0.0.1) and have tried internalHost as
>>>>> well:
>>>>>
>>>>> login1$ swift -config cf -tc.file apps -sites.file localcoast.xml
>>>>> catsn.swift
>>>>>
>>>>> login1$ cat localcoast.xml
>>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>>> <config xmlns="http://www.ci.uchicago.edu/swift/SwiftSites">
>>>>>
>>>>> <pool handle="localhost">
>>>>>
>>>>> <execution provider="coaster" jobmanager="local:local"/>
>>>>>
>>>>> <profile namespace="globus" key="internalHostname">127.0.0.1</profile>
>>>>>      <profile namespace="globus" key="maxwalltime">00:01:00</profile>
>>>>>      <profile namespace="globus" key="maxtime">3600</profile>
>>>>>
>>>>>      <profile namespace="globus" key="jobsPerNode">1</profile>
>>>>>      <profile namespace="globus" key="slots">1</profile>
>>>>>      <profile namespace="globus" key="nodeGranularity">1</profile>
>>>>>      <profile namespace="globus" key="maxNodes">1</profile>
>>>>>
>>>>>      <profile namespace="karajan" key="jobThrottle">12</profile>
>>>>>      <profile namespace="karajan" key="initialScore">10000</profile>
>>>>>
>>>>>      <profile namespace="karajan" key="lowOverAllocation">100</profile>
>>>>>      <profile namespace="karajan" key="highOverAllocation">100</profile>
>>>>>
>>>>> <filesystem provider="local"/>
>>>>> <workdirectory>/tmp/swiftwork</workdirectory>
>>>>>
>>>>>
>>>>> </pool>
>>>>>
>>>>> I get error 110 connection timeouts:
>>>>>
>>>>> 2014-06-08 16:37:50,762-0500 DEBUG swift JOB_START jobid=cat-7jiymsrl
>>>>> tr=cat arguments=[data.txt] tmpdir=catsn-run013/jobs/7/cat-7jiymsrl
>>>>> host=localhost
>>>>> 2014-06-08 16:37:50,829-0500 INFO  LocalService Started local service:
>>>>> 127.0.0.1:50000
>>>>> 2014-06-08 16:37:50,837-0500 INFO  BootstrapService Socket bound. URL is
>>>>> http://127.0.0.1:50001
>>>>> 2014-06-08 16:37:50,914-0500 INFO  Settings Local contacts:
>>>>> [http://127.0.0.2:50003, http://192.5.86.104:50003,
>>>>> http://10.128.2.244:50003]
>>>>> 2014-06-08 16:37:50,917-0500 INFO  CoasterService Started local service:
>>>>> http://127.0.0.1:50003
>>>>> 2014-06-08 16:37:50,917-0500 INFO  CoasterService Reserving channel for
>>>>> registration
>>>>> 2014-06-08 16:37:50,942-0500 INFO  MetaChannel MetaChannel [context:
>>>>> cpipe, boundTo: null] binding to cpipe://1
>>>>> 2014-06-08 16:37:50,942-0500 INFO  MetaChannel MetaChannel [context:
>>>>> spipe, boundTo: null] binding to spipe://1
>>>>> 2014-06-08 16:37:50,942-0500 INFO  CoasterService Sending registration
>>>>> 2014-06-08 16:37:50,948-0500 INFO  MetaChannel Trying to re-bind current
>>>>> channel
>>>>> 2014-06-08 16:37:50,949-0500 INFO  RequestHandler Handler(tag: 1,
>>>>> REGISTER) unregistering (send)
>>>>> 2014-06-08 16:37:50,949-0500 INFO  CoasterService Registration complete
>>>>> 2014-06-08 16:37:50,949-0500 INFO  CoasterService Started coaster
>>>>> service: http://127.0.0.1:50002
>>>>> 2014-06-08 16:37:50,952-0500 WARN  Settings original callback URI is
>>>>> http://10.128.2.244:50003
>>>>> 2014-06-08 16:37:50,952-0500 WARN  Settings callback URI has been
>>>>> overridden to http://127.0.0.1:50003
>>>>> 2014-06-08 16:37:50,953-0500 INFO  RequestHandler Handler(tag: 1,
>>>>> CONFIGSERVICE) unregistering (send)
>>>>> 2014-06-08 16:37:50,969-0500 INFO  BlockQueueProcessor Starting...
>>>>> id=0608-3704500
>>>>> 2014-06-08 16:37:50,969-0500 INFO  RequestHandler Handler(tag: 2,
>>>>> SUBMITJOB) unregistering (send)
>>>>> 2014-06-08 16:37:51,009-0500 INFO  BlockQueueProcessor
>>>>> Settings {
>>>>>        slots = 1
>>>>>        jobsPerNode = 1
>>>>>        workersPerNode = 1
>>>>>        nodeGranularity = 1
>>>>>        allocationStepSize = 0.1
>>>>>        maxNodes = 1
>>>>>        lowOverallocation = 10.0
>>>>>        highOverallocation = 1.0
>>>>>        overallocationDecayFactor = 0.001
>>>>>        spread = 0.9
>>>>>        reserve = 60.000s
>>>>>        maxtime = 3600
>>>>>        remoteMonitorEnabled = false
>>>>>        internalHostname = 127.0.0.1
>>>>>        hookClass = null
>>>>>        workerManager = block
>>>>>        workerLoggingLevel = NONE
>>>>>        workerLoggingDirectory = DEFAULT
>>>>>        ldLibraryPath = null
>>>>>        workerCopies = null
>>>>>        directory = null
>>>>>        useHashBang = null
>>>>>        parallelism = 0.01
>>>>>        coresPerNode = 1
>>>>>        perfTraceWorker = false
>>>>>        perfTraceInterval = -1
>>>>>        attributes = {}
>>>>>        callbackURIs = [http://127.0.0.1:50003]
>>>>> }
>>>>>
>>>>> 2014-06-08 16:37:51,009-0500 INFO  BlockQueueProcessor Jobs in holding
>>>>> queue: 1
>>>>> 2014-06-08 16:37:51,009-0500 INFO  BlockQueueProcessor Time estimate for
>>>>> holding queue (seconds): 1
>>>>> 2014-06-08 16:37:51,009-0500 INFO  BlockQueueProcessor Allocating blocks
>>>>> for a total walltime of: 1s
>>>>> 2014-06-08 16:37:51,009-0500 INFO  BlockQueueProcessor  Considering:
>>>>> Job(id:0 60.000s)
>>>>> 2014-06-08 16:37:51,009-0500 INFO  BlockQueueProcessor       Max
>>>>> Walltime (seconds):   60
>>>>> 2014-06-08 16:37:51,009-0500 INFO  BlockQueueProcessor       Time
>>>>> estimate (seconds):  600
>>>>> 2014-06-08 16:37:51,010-0500 INFO  BlockQueueProcessor       Total for
>>>>> this new Block (est. seconds): 0
>>>>> 2014-06-08 16:37:51,013-0500 INFO  BlockQueueProcessor index: 0, last:
>>>>> 0, holding.size(): 1
>>>>> 2014-06-08 16:37:51,014-0500 INFO  BlockQueueProcessor Queued: 1 jobs to
>>>>> new Block
>>>>> 2014-06-08 16:37:51,014-0500 INFO  BlockQueueProcessor index: 0, last:
>>>>> 0, ii: 1, holding.size(): 1
>>>>> 2014-06-08 16:37:51,014-0500 INFO  Block Starting block: workers=1,
>>>>> walltime=600.000s
>>>>> 2014-06-08 16:37:51,016-0500 INFO  RemoteLogHandler BLOCK_REQUESTED
>>>>> id=0608-3704500-000000, cores=1, coresPerWorker=1, walltime=600
>>>>> 2014-06-08 16:37:51,016-0500 INFO  RequestHandler Handler(tag: 2, RLOG)
>>>>> unregistering (send)
>>>>> 2014-06-08 16:37:51,018-0500 INFO  BlockTaskSubmitter Queuing block
>>>>> Block 0608-3704500-000000 (1x600.000s) for submission
>>>>> 2014-06-08 16:37:51,018-0500 INFO  BlockQueueProcessor Added 1 jobs to
>>>>> new blocks
>>>>> 2014-06-08 16:37:51,018-0500 INFO  BlockTaskSubmitter Submitting block
>>>>> Block 0608-3704500-000000 (1x600.000s)
>>>>> 2014-06-08 16:37:51,018-0500 INFO  ExecutionTaskHandler provider=local
>>>>> 2014-06-08 16:37:51,023-0500 INFO  Block Block task status changed:
>>>>> Submitting
>>>>> 2014-06-08 16:37:51,023-0500 INFO  JobSubmissionTaskHandler Submit: in:
>>>>> / command: /usr/bin/perl
>>>>> /home/wilde/.globus/coasters/cscript2445623341660096310.pl
>>>>> http://127.0.0.1:50003 0608-3704500-000000 NOLOGGING
>>>>> 2014-06-08 16:37:51,024-0500 INFO  Block Block task status changed:
>>>>> Submitted
>>>>> 2014-06-08 16:37:51,027-0500 INFO  Block Block task status changed: Active
>>>>> 2014-06-08 16:37:51,027-0500 INFO  RemoteLogHandler BLOCK_ACTIVE
>>>>> id=0608-3704500-000000
>>>>> 2014-06-08 16:37:51,027-0500 INFO  RequestHandler Handler(tag: 3, RLOG)
>>>>> unregistering (send)
>>>>> 2014-06-08 16:37:51,681-0500 INFO  RuntimeStats$ProgressTicker Submitted:1
>>>>> 2014-06-08 16:37:51,681-0500 INFO  RuntimeStats$ProgressTicker HeapMax:
>>>>> 954466304, CrtHeap: 253624320, UsedHeap: 28583112
>>>>> 2014-06-08 16:38:21,683-0500 INFO  RuntimeStats$ProgressTicker Submitted:1
>>>>> 2014-06-08 16:38:21,683-0500 INFO  RuntimeStats$ProgressTicker HeapMax:
>>>>> 954466304, CrtHeap: 253624320, UsedHeap: 29067208
>>>>> 2014-06-08 16:38:51,686-0500 INFO  RuntimeStats$ProgressTicker Submitted:1
>>>>> 2014-06-08 16:38:51,686-0500 INFO  RuntimeStats$ProgressTicker HeapMax:
>>>>> 954466304, CrtHeap: 253624320, UsedHeap: 29551304
>>>>> 2014-06-08 16:38:57,113-0500 INFO  Block Block task status changed:
>>>>> Failed Job failed with an exit code of 110
>>>>> 2014-06-08 16:38:57,115-0500 INFO  Block Failed task spec: Job:
>>>>>        executable: /usr/bin/perl
>>>>>        arguments:
>>>>> /home/wilde/.globus/coasters/cscript2445623341660096310.pl
>>>>> http://127.0.0.1:50003 0608-3704500-000000 NOLOGGING
>>>>>        stdout:     null
>>>>>        stderr:     null
>>>>>        directory:  /
>>>>>        batch:      false
>>>>>        redirected: false
>>>>>        attributes:
>>>>> hostcount=1,count=1,jobspernode=1,corespernode=1,maxwalltime=10
>>>>>        env:        WORKER_LOGGING_LEVEL=NONE
>>>>>
>>>>> 2014-06-08 16:38:57,115-0500 INFO  Block Worker task failed:
>>>>> Failed to connect: Connection timed out at
>>>>> /home/wilde/.globus/coasters/cscript2445623341660096310.pl line 1101.
>>>>>
>>>>>
>>>>>
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>

-- 
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory               The University of Chicago




More information about the Swift-devel mailing list