[Swift-devel] Localhost coasters not working on Beagle
Michael Wilde
wilde at anl.gov
Sun Jun 8 23:12:50 CDT 2014
Yadu pointed out that beagle's login host ports are open at a different
range.
When I set the correct port range in GLOBUS_TCP_PORT_RANGE and
GLOBUS_TCP_SOURCE_RANGE, it works.
The swift module on Beagle does this automatically. I was using my own
download of 0.95-RC6.
Thanks, Mihael and Yadu.
- Mike
On 6/8/14, 10:31 PM, Michael Wilde wrote:
> I'll try the other addresses for that host.
>
> Maybe something changed there in iptables or similar.
>
> - MIke
>
> On 6/8/14, 10:27 PM, Mihael Hategan wrote:
>> Ok, so:
>>
>> shell1: hategan at login1:~> netcat -l -p 50003
>>
>> shell2: hategan at login1:~> netstat -lntp
>> ...
>> tcp 0 0 0.0.0.0:50003 0.0.0.0:*
>> LISTEN 22806/netcat
>> ...
>>
>> hategan at login1:~> telnet 127.0.0.1 50003
>> Trying 127.0.0.1...
>> telnet: connect to address 127.0.0.1: Connection timed out
>>
>> I don't think this has anything to do with swift or coasters.
>>
>> Mihael
>>
>> On Sun, 2014-06-08 at 20:22 -0700, Mihael Hategan wrote:
>>> That's odd. Have you tried netstat -lntp? telnet?
>>>
>>> I'll give it a shot, but this looks rather strange.
>>>
>>> Mihael
>>>
>>> On Sun, 2014-06-08 at 22:10 -0500, Michael Wilde wrote:
>>>> login1$ more /home/wilde/.globus/coasters/worker-0608-0710120-000000.log
>>>> 2014/06/08 22:07:12.296 INFO - 0608-0710120-000000 Logging started: Sun
>>>> Jun 8 22:07:12 2014
>>>> 2014/06/08 22:07:12.296 INFO - Running on node
>>>> login1.beagle.ci.uchicago.edu
>>>> 2014/06/08 22:07:12.296 DEBUG - uri=http://127.0.0.1:50003
>>>> 2014/06/08 22:07:12.296 DEBUG - scheme=http
>>>> 2014/06/08 22:07:12.297 DEBUG - host=127.0.0.1
>>>> 2014/06/08 22:07:12.297 DEBUG - port=50003
>>>> 2014/06/08 22:07:12.297 DEBUG - blockid=0608-0710120-000000
>>>> 2014/06/08 22:07:12.297 INFO - Connect attempt: 0 ...
>>>> 2014/06/08 22:07:12.297 INFO - Trying 127.0.0.1:50003 ...
>>>> 2014/06/08 22:07:33.296 INFO - Connection failed: Connection timed out.
>>>> Trying other addresses
>>>> 2014/06/08 22:07:33.296 ERROR - Connection failed for all addresses.
>>>> 2014/06/08 22:07:33.296 ERROR - Retrying in 1 seconds
>>>> 2014/06/08 22:07:34.297 INFO - Connect attempt: 1 ...
>>>> 2014/06/08 22:07:34.297 INFO - Trying 127.0.0.1:50003 ...
>>>> 2014/06/08 22:07:55.295 INFO - Connection failed: Connection timed out.
>>>> Trying other addresses
>>>> 2014/06/08 22:07:55.296 ERROR - Connection failed for all addresses.
>>>> 2014/06/08 22:07:55.296 ERROR - Retrying in 2 seconds
>>>> 2014/06/08 22:07:57.298 INFO - Connect attempt: 2 ...
>>>> 2014/06/08 22:07:57.298 INFO - Trying 127.0.0.1:50003 ...
>>>> 2014/06/08 22:08:18.295 INFO - Connection failed: Connection timed out.
>>>> Trying other addresses
>>>> 2014/06/08 22:08:18.295 ERROR - Connection failed for all addresses.
>>>> 2014/06/08 22:08:18.295 ERROR - Failed to connect: Connection timed out
>>>> login1$
>>>>
>>>>
>>>> On 6/8/14, 5:33 PM, Mihael Hategan wrote:
>>>>> Can you enable worker logging and post the worker log?
>>>>>
>>>>> Mihael
>>>>>
>>>>> On Sun, 2014-06-08 at 16:48 -0500, Michael Wilde wrote:
>>>>>> Mihael - Im not able to get a simple localhost coasters run working on
>>>>>> Beagle login1.
>>>>>>
>>>>>> All: Is anyone seeing something similar? It looks to me like my coaster
>>>>>> worker is not able to connect to the Swift coaster service (using
>>>>>> standard automatic coasters).
>>>>>>
>>>>>> Im working in /lustre/beagle/wilde/swift/lab/fastio (where you can find
>>>>>> logs and configs). Running 0.95RC6.
>>>>>>
>>>>>> Im setting GLOBUS_HOSTNAME (to 127.0.0.1) and have tried internalHost as
>>>>>> well:
>>>>>>
>>>>>> login1$ swift -config cf -tc.file apps -sites.file localcoast.xml
>>>>>> catsn.swift
>>>>>>
>>>>>> login1$ cat localcoast.xml
>>>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>>>> <config xmlns="http://www.ci.uchicago.edu/swift/SwiftSites">
>>>>>>
>>>>>> <pool handle="localhost">
>>>>>>
>>>>>> <execution provider="coaster" jobmanager="local:local"/>
>>>>>>
>>>>>> <profile namespace="globus" key="internalHostname">127.0.0.1</profile>
>>>>>> <profile namespace="globus" key="maxwalltime">00:01:00</profile>
>>>>>> <profile namespace="globus" key="maxtime">3600</profile>
>>>>>>
>>>>>> <profile namespace="globus" key="jobsPerNode">1</profile>
>>>>>> <profile namespace="globus" key="slots">1</profile>
>>>>>> <profile namespace="globus" key="nodeGranularity">1</profile>
>>>>>> <profile namespace="globus" key="maxNodes">1</profile>
>>>>>>
>>>>>> <profile namespace="karajan" key="jobThrottle">12</profile>
>>>>>> <profile namespace="karajan" key="initialScore">10000</profile>
>>>>>>
>>>>>> <profile namespace="karajan" key="lowOverAllocation">100</profile>
>>>>>> <profile namespace="karajan" key="highOverAllocation">100</profile>
>>>>>>
>>>>>> <filesystem provider="local"/>
>>>>>> <workdirectory>/tmp/swiftwork</workdirectory>
>>>>>>
>>>>>>
>>>>>> </pool>
>>>>>>
>>>>>> I get error 110 connection timeouts:
>>>>>>
>>>>>> 2014-06-08 16:37:50,762-0500 DEBUG swift JOB_START jobid=cat-7jiymsrl
>>>>>> tr=cat arguments=[data.txt] tmpdir=catsn-run013/jobs/7/cat-7jiymsrl
>>>>>> host=localhost
>>>>>> 2014-06-08 16:37:50,829-0500 INFO LocalService Started local service:
>>>>>> 127.0.0.1:50000
>>>>>> 2014-06-08 16:37:50,837-0500 INFO BootstrapService Socket bound. URL is
>>>>>> http://127.0.0.1:50001
>>>>>> 2014-06-08 16:37:50,914-0500 INFO Settings Local contacts:
>>>>>> [http://127.0.0.2:50003, http://192.5.86.104:50003,
>>>>>> http://10.128.2.244:50003]
>>>>>> 2014-06-08 16:37:50,917-0500 INFO CoasterService Started local service:
>>>>>> http://127.0.0.1:50003
>>>>>> 2014-06-08 16:37:50,917-0500 INFO CoasterService Reserving channel for
>>>>>> registration
>>>>>> 2014-06-08 16:37:50,942-0500 INFO MetaChannel MetaChannel [context:
>>>>>> cpipe, boundTo: null] binding to cpipe://1
>>>>>> 2014-06-08 16:37:50,942-0500 INFO MetaChannel MetaChannel [context:
>>>>>> spipe, boundTo: null] binding to spipe://1
>>>>>> 2014-06-08 16:37:50,942-0500 INFO CoasterService Sending registration
>>>>>> 2014-06-08 16:37:50,948-0500 INFO MetaChannel Trying to re-bind current
>>>>>> channel
>>>>>> 2014-06-08 16:37:50,949-0500 INFO RequestHandler Handler(tag: 1,
>>>>>> REGISTER) unregistering (send)
>>>>>> 2014-06-08 16:37:50,949-0500 INFO CoasterService Registration complete
>>>>>> 2014-06-08 16:37:50,949-0500 INFO CoasterService Started coaster
>>>>>> service: http://127.0.0.1:50002
>>>>>> 2014-06-08 16:37:50,952-0500 WARN Settings original callback URI is
>>>>>> http://10.128.2.244:50003
>>>>>> 2014-06-08 16:37:50,952-0500 WARN Settings callback URI has been
>>>>>> overridden to http://127.0.0.1:50003
>>>>>> 2014-06-08 16:37:50,953-0500 INFO RequestHandler Handler(tag: 1,
>>>>>> CONFIGSERVICE) unregistering (send)
>>>>>> 2014-06-08 16:37:50,969-0500 INFO BlockQueueProcessor Starting...
>>>>>> id=0608-3704500
>>>>>> 2014-06-08 16:37:50,969-0500 INFO RequestHandler Handler(tag: 2,
>>>>>> SUBMITJOB) unregistering (send)
>>>>>> 2014-06-08 16:37:51,009-0500 INFO BlockQueueProcessor
>>>>>> Settings {
>>>>>> slots = 1
>>>>>> jobsPerNode = 1
>>>>>> workersPerNode = 1
>>>>>> nodeGranularity = 1
>>>>>> allocationStepSize = 0.1
>>>>>> maxNodes = 1
>>>>>> lowOverallocation = 10.0
>>>>>> highOverallocation = 1.0
>>>>>> overallocationDecayFactor = 0.001
>>>>>> spread = 0.9
>>>>>> reserve = 60.000s
>>>>>> maxtime = 3600
>>>>>> remoteMonitorEnabled = false
>>>>>> internalHostname = 127.0.0.1
>>>>>> hookClass = null
>>>>>> workerManager = block
>>>>>> workerLoggingLevel = NONE
>>>>>> workerLoggingDirectory = DEFAULT
>>>>>> ldLibraryPath = null
>>>>>> workerCopies = null
>>>>>> directory = null
>>>>>> useHashBang = null
>>>>>> parallelism = 0.01
>>>>>> coresPerNode = 1
>>>>>> perfTraceWorker = false
>>>>>> perfTraceInterval = -1
>>>>>> attributes = {}
>>>>>> callbackURIs = [http://127.0.0.1:50003]
>>>>>> }
>>>>>>
>>>>>> 2014-06-08 16:37:51,009-0500 INFO BlockQueueProcessor Jobs in holding
>>>>>> queue: 1
>>>>>> 2014-06-08 16:37:51,009-0500 INFO BlockQueueProcessor Time estimate for
>>>>>> holding queue (seconds): 1
>>>>>> 2014-06-08 16:37:51,009-0500 INFO BlockQueueProcessor Allocating blocks
>>>>>> for a total walltime of: 1s
>>>>>> 2014-06-08 16:37:51,009-0500 INFO BlockQueueProcessor Considering:
>>>>>> Job(id:0 60.000s)
>>>>>> 2014-06-08 16:37:51,009-0500 INFO BlockQueueProcessor Max
>>>>>> Walltime (seconds): 60
>>>>>> 2014-06-08 16:37:51,009-0500 INFO BlockQueueProcessor Time
>>>>>> estimate (seconds): 600
>>>>>> 2014-06-08 16:37:51,010-0500 INFO BlockQueueProcessor Total for
>>>>>> this new Block (est. seconds): 0
>>>>>> 2014-06-08 16:37:51,013-0500 INFO BlockQueueProcessor index: 0, last:
>>>>>> 0, holding.size(): 1
>>>>>> 2014-06-08 16:37:51,014-0500 INFO BlockQueueProcessor Queued: 1 jobs to
>>>>>> new Block
>>>>>> 2014-06-08 16:37:51,014-0500 INFO BlockQueueProcessor index: 0, last:
>>>>>> 0, ii: 1, holding.size(): 1
>>>>>> 2014-06-08 16:37:51,014-0500 INFO Block Starting block: workers=1,
>>>>>> walltime=600.000s
>>>>>> 2014-06-08 16:37:51,016-0500 INFO RemoteLogHandler BLOCK_REQUESTED
>>>>>> id=0608-3704500-000000, cores=1, coresPerWorker=1, walltime=600
>>>>>> 2014-06-08 16:37:51,016-0500 INFO RequestHandler Handler(tag: 2, RLOG)
>>>>>> unregistering (send)
>>>>>> 2014-06-08 16:37:51,018-0500 INFO BlockTaskSubmitter Queuing block
>>>>>> Block 0608-3704500-000000 (1x600.000s) for submission
>>>>>> 2014-06-08 16:37:51,018-0500 INFO BlockQueueProcessor Added 1 jobs to
>>>>>> new blocks
>>>>>> 2014-06-08 16:37:51,018-0500 INFO BlockTaskSubmitter Submitting block
>>>>>> Block 0608-3704500-000000 (1x600.000s)
>>>>>> 2014-06-08 16:37:51,018-0500 INFO ExecutionTaskHandler provider=local
>>>>>> 2014-06-08 16:37:51,023-0500 INFO Block Block task status changed:
>>>>>> Submitting
>>>>>> 2014-06-08 16:37:51,023-0500 INFO JobSubmissionTaskHandler Submit: in:
>>>>>> / command: /usr/bin/perl
>>>>>> /home/wilde/.globus/coasters/cscript2445623341660096310.pl
>>>>>> http://127.0.0.1:50003 0608-3704500-000000 NOLOGGING
>>>>>> 2014-06-08 16:37:51,024-0500 INFO Block Block task status changed:
>>>>>> Submitted
>>>>>> 2014-06-08 16:37:51,027-0500 INFO Block Block task status changed: Active
>>>>>> 2014-06-08 16:37:51,027-0500 INFO RemoteLogHandler BLOCK_ACTIVE
>>>>>> id=0608-3704500-000000
>>>>>> 2014-06-08 16:37:51,027-0500 INFO RequestHandler Handler(tag: 3, RLOG)
>>>>>> unregistering (send)
>>>>>> 2014-06-08 16:37:51,681-0500 INFO RuntimeStats$ProgressTicker Submitted:1
>>>>>> 2014-06-08 16:37:51,681-0500 INFO RuntimeStats$ProgressTicker HeapMax:
>>>>>> 954466304, CrtHeap: 253624320, UsedHeap: 28583112
>>>>>> 2014-06-08 16:38:21,683-0500 INFO RuntimeStats$ProgressTicker Submitted:1
>>>>>> 2014-06-08 16:38:21,683-0500 INFO RuntimeStats$ProgressTicker HeapMax:
>>>>>> 954466304, CrtHeap: 253624320, UsedHeap: 29067208
>>>>>> 2014-06-08 16:38:51,686-0500 INFO RuntimeStats$ProgressTicker Submitted:1
>>>>>> 2014-06-08 16:38:51,686-0500 INFO RuntimeStats$ProgressTicker HeapMax:
>>>>>> 954466304, CrtHeap: 253624320, UsedHeap: 29551304
>>>>>> 2014-06-08 16:38:57,113-0500 INFO Block Block task status changed:
>>>>>> Failed Job failed with an exit code of 110
>>>>>> 2014-06-08 16:38:57,115-0500 INFO Block Failed task spec: Job:
>>>>>> executable: /usr/bin/perl
>>>>>> arguments:
>>>>>> /home/wilde/.globus/coasters/cscript2445623341660096310.pl
>>>>>> http://127.0.0.1:50003 0608-3704500-000000 NOLOGGING
>>>>>> stdout: null
>>>>>> stderr: null
>>>>>> directory: /
>>>>>> batch: false
>>>>>> redirected: false
>>>>>> attributes:
>>>>>> hostcount=1,count=1,jobspernode=1,corespernode=1,maxwalltime=10
>>>>>> env: WORKER_LOGGING_LEVEL=NONE
>>>>>>
>>>>>> 2014-06-08 16:38:57,115-0500 INFO Block Worker task failed:
>>>>>> Failed to connect: Connection timed out at
>>>>>> /home/wilde/.globus/coasters/cscript2445623341660096310.pl line 1101.
>>>>>>
>>>>>>
>>>>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
--
Michael Wilde
Mathematics and Computer Science Computation Institute
Argonne National Laboratory The University of Chicago
More information about the Swift-devel
mailing list