[Swift-user] Debugging Swift Coaster ServiceManager
Michael Wilde
wilde at mcs.anl.gov
Sat Jun 15 18:52:14 CDT 2013
Hi TJ,
Mihael would be able to interpret these errors better than I, but here are some things to check:
- Does the host on which you are running Swift (presumably a cluster login host?) have multiple interfaces? If they are not all reachable from biox3.stanford.edu, then set GLOBUS_HOSTNAME=hostname where hostname is either a dns name or an IP address address of swift host which is reachable from biox3.
- If your username on biox3 is different than your username on the swift host, set it in $HOME/.ssh/conf:
Host biox3.stanford.edu
Hostname biox3.stanford.edu
User TJsOtherUsername
- also set the workdirectory accordingly for the remote host (ie if your username there is not tjlane)
- If biox3 can not connect back to the swift host on any anonymous port (eg due to firewall rules), set the valid port range:
export GLOBUS_TCP_PORT_RANGE=50000,51000 # for example
export GLOBUS_TCP_SOURCE_RANGE=50000,51000
- make sure java is in your path or available on biox3 by default. I *think* that swift coaster bootstrap causes your login shell's profile/rc to run. Need to check that. Make sure its a reasonable Java: Sun/Oracle 1.6 or later.
I see some traces of GCJ in the traceback: thats a possible problem: ("libgcj.so.10")
- try ssh'ing a simple test to biox3 and make sure for example that your login there can write to scratch directories.
- make sure that biox3 has a queue named "batch" and that it accepts one-hour jobs (maxtime setting)
Several of these problems (eg the ast one) wont stop the coaster service from starting, but would prevent coaster workers from starting.
Mihael can likely diagnose which of these or other routes are most likely the cause, and what evidence to look for on biox3.
Lastly, I see that both the IP listed of your swift host and biox3 are pingable on the public net, so its not likely the first problem (IP reachability).
- Mike
----- Original Message -----
> From: "TJ Lane" <tjlane at stanford.edu>
> To: swift-user at ci.uchicago.edu
> Sent: Saturday, June 15, 2013 5:15:04 PM
> Subject: [Swift-user] Debugging Swift Coaster ServiceManager
>
> Swift Users,
>
> Finally back to trying out swift after a delay -- thanks for all your
> help
> so far.
>
> I've got a functional swift script up and running, and am now trying
> to
> configure my sites.xml to get it running on 4 remote clusters. I've
> gotten
> it working on 2, so 2 more to go!
>
> Let's focus on one first. This cluster is running PBS and I'm trying
> to
> access it using coasters, via provider="ssh-cl:pbs". Unfortunately,
> it
> seems like swift can't boot up the coaster service for some reason,
> which I
> haven't been able to figure out. Maybe someone can help me debug
> this, or
> at least know where to start poking around!
>
> Here's the site xml entry:
>
> <pool handle="biox3">
>
> <execution provider="coaster" jobmanager="ssh-cl:pbs" url="
> biox3.stanford.edu"/>
>
> <profile namespace="globus" key="maxWalltime">00:30:00</profile>
>
> <profile namespace="globus" key="lowOverAllocation">100</profile>
> <profile namespace="globus"
> key="highOverAllocation">100</profile>
> <profile namespace="globus" key="maxtime">3600</profile>
>
> <profile namespace="globus" key="queue">batch</profile>
> <profile namespace="globus" key="slots">10</profile>
> <profile namespace="globus" key="maxnodes">1</profile>
> <profile namespace="globus" key="nodeGranularity">1</profile>
>
> <profile namespace="globus" key="jobsPerNode">1</profile>
>
> <profile namespace="karajan" key="jobThrottle">1.0</profile>
> <profile namespace="karajan" key="initialScore">10000</profile>
>
> <!--
> <profile namespace="env" key="SWIFT_GEN_SCRIPTS">1</profile>
> -->
>
> <workdirectory>/home/tjlane/swiftwork</workdirectory>
>
> </pool>
>
> and here's what gets printed when I try and run a very basic "hello
> cluster" swift script:
>
> tjlane at vspm42 ~/swift_hello
> $ swift -sites.file ~/opt/swift-0.94/etc/sites.xml -tc.file
> ~/opt/swift-0.94/etc/tc.data -config swift.properties uname.swift
> Swift started
> Swift 0.94 swift-r6492 cog-r3658
>
> RunID: 20130615-1512-h2fskgme
> Progress: time: Sat, 15 Jun 2013 15:12:32 -0700
> Progress: time: Sat, 15 Jun 2013 15:12:34 -0700 Submitted:1
> Execution failed:
> Exception in uname:
> Arguments: [-a]
> Host: biox3
> Directory: uname-20130615-1512-h2fskgme/jobs/a/uname-aan4rzal
>
> Caused by:
> Could not submit job
> Caused by:
> Could not start coaster service
> Caused by:
> Task ended before registration was received.
>
> Failed to start coaster service
> java.lang.NullPointerException
> at java.net.URI.compareTo(libgcj.so.10)
> at java.net.URI.compareTo(libgcj.so.10)
> at java.util.TreeMap.compare(libgcj.so.10)
> at java.util.TreeMap.put(libgcj.so.10)
> at java.util.TreeSet.addAll(libgcj.so.10)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.Settings.setCallbackURIs(Settings.java:403)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.<init>(JobQueue.java:41)
> at
> org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:148)
> at
> org.globus.cog.abstraction.coaster.service.CoasterService.main(CoasterService.java:382)
> java.lang.NullPointerException
> at java.net.URI.compareTo(libgcj.so.10)
> at java.net.URI.compareTo(libgcj.so.10)
> at java.util.TreeMap.compare(libgcj.so.10)
> at java.util.TreeMap.put(libgcj.so.10)
> at java.util.TreeSet.addAll(libgcj.so.10)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.Settings.setCallbackURIs(Settings.java:403)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.<init>(JobQueue.java:41)
> at
> org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:148)
> at
> org.globus.cog.abstraction.coaster.service.CoasterService.main(CoasterService.java:382)
>
>
> uname, uname.swift, line 12
>
> Finally, here's part of what gets dumped to my log file:
>
> <snip>
> 2013-06-15 14:54:22,350-0700 INFO BootstrapService
> [/171.67.106.68:39309]
> GET /coaster-bootstrap.jar HTTP/1.0
> 2013-06-15 14:54:22,713-0700 INFO ServiceManager Service task
> Task(type=JOB_SUBMISSION, identity=urn:cog-1371333260175) terminated.
> Removing service.
> 2013-06-15 14:54:22,713-0700 INFO ServiceManager Service does not
> appear
> to be registered with this manager
> 2013-06-15 14:54:22,713-0700 INFO ServiceManager Coaster service
> ended.
> Reason: null
> stdout:
> stderr: Failed to start coaster service
> java.lang.NullPointerException
> at java.net.URI.compareTo(libgcj.so.10)
> at java.net.URI.compareTo(libgcj.so.10)
> at java.util.TreeMap.compare(libgcj.so.10)
> at java.util.TreeMap.put(libgcj.so.10)
> at java.util.TreeSet.addAll(libgcj.so.10)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.Settings.setCallbackURIs(Settings.java:403)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.<init>(JobQueue.java:41)
> at
> org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:148)
> at
> org.globus.cog.abstraction.coaster.service.CoasterService.main(CoasterService.java:382)
> java.lang.NullPointerException
> at java.net.URI.compareTo(libgcj.so.10)
> at java.net.URI.compareTo(libgcj.so.10)
> at java.util.TreeMap.compare(libgcj.so.10)
> at java.util.TreeMap.put(libgcj.so.10)
> at java.util.TreeSet.addAll(libgcj.so.10)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.Settings.setCallbackURIs(Settings.java:403)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.<init>(JobQueue.java:41)
> at
> org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:148)
> at
> org.globus.cog.abstraction.coaster.service.CoasterService.main(CoasterService.java:382)
>
>
> 2013-06-15 14:54:22,714-0700 INFO NotificationManager
> biox3.stanford.edu
> 2013-06-15 14:54:22,771-0700 INFO RuntimeStats$ProgressTicker
> Submitted:1
> 2013-06-15 14:54:22,775-0700 DEBUG swift APPLICATION_EXCEPTION
> jobid=uname-d77eqzal - Application exception: Caused by: Could not
> submit
> job
> Caused by: Could not start coaster service
> Caused by: Task ended before registration was received.
>
> Failed to start coaster service
> java.lang.NullPointerException
> at java.net.URI.compareTo(libgcj.so.10)
> at java.net.URI.compareTo(libgcj.so.10)
> at java.util.TreeMap.compare(libgcj.so.10)
> at java.util.TreeMap.put(libgcj.so.10)
> at java.util.TreeSet.addAll(libgcj.so.10)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.Settings.setCallbackURIs(Settings.java:403)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.<init>(JobQueue.java:41)
> at
> org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:148)
> at
> org.globus.cog.abstraction.coaster.service.CoasterService.main(CoasterService.java:382)
> java.lang.NullPointerException
> at java.net.URI.compareTo(libgcj.so.10)
> at java.net.URI.compareTo(libgcj.so.10)
> at java.util.TreeMap.compare(libgcj.so.10)
> at java.util.TreeMap.put(libgcj.so.10)
> at java.util.TreeSet.addAll(libgcj.so.10)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.Settings.setCallbackURIs(Settings.java:403)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.<init>(JobQueue.java:41)
> at
> org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:148)
> at
> org.globus.cog.abstraction.coaster.service.CoasterService.main(CoasterService.java:382)
> <snip>
>
>
> Any help or advice on how to resolve this issue, much much
> appreciated!
>
> Thanks,
>
> TJ
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
More information about the Swift-user
mailing list