[Swift-user] Debugging Swift Coaster ServiceManager

Michael Wilde wilde at mcs.anl.gov
Sat Jun 15 19:11:39 CDT 2013


The more I look at the error the more it looks like the coaster service on biox3 is getting a null pointer exception "at java.net.URI.compareTo(libgcj.so.10)" and that may be the root cause.

We dont to my knowledge test on open source Java's yet, although we have discussed this.

We used to see lots of incompatibilities with Swift on them, but these have been diminishing.

Can you test again after making sure that Java 1.6 or higher is in your PATH on biox3?

Thanks,

- Mike


----- Original Message -----
> From: "Michael Wilde" <wilde at mcs.anl.gov>
> To: "TJ Lane" <tjlane at stanford.edu>
> Cc: swift-user at ci.uchicago.edu
> Sent: Saturday, June 15, 2013 6:52:14 PM
> Subject: Re: [Swift-user] Debugging Swift Coaster ServiceManager
> 
> Hi TJ,
> 
> Mihael would be able to interpret these errors better than I, but
> here are some things to check:
> 
> - Does the host on which you are running Swift (presumably a cluster
> login host?) have multiple interfaces? If they are not all reachable
> from biox3.stanford.edu, then set GLOBUS_HOSTNAME=hostname where
> hostname is either a dns name or an IP address address of swift host
> which is reachable from biox3.
> 
> - If your username on biox3 is different than your username on the
> swift host, set it in $HOME/.ssh/conf:
> 
> Host biox3.stanford.edu
>    Hostname biox3.stanford.edu
>    User TJsOtherUsername
> 
> - also set the workdirectory accordingly for the remote host (ie if
> your username there is not tjlane)
> 
> - If biox3 can not connect back to the swift host on any anonymous
> port (eg due to firewall rules), set the valid port range:
> 
> export GLOBUS_TCP_PORT_RANGE=50000,51000 # for example
> export GLOBUS_TCP_SOURCE_RANGE=50000,51000
> 
> - make sure java is in your path or available on biox3 by default.  I
> *think* that swift coaster bootstrap causes your login shell's
> profile/rc to run. Need to check that.  Make sure its a reasonable
> Java: Sun/Oracle 1.6 or later.
> I see some traces of GCJ in the traceback: thats a possible problem:
> ("libgcj.so.10")
> 
> - try ssh'ing a simple test to biox3 and make sure for example that
> your login there can write to scratch directories.
> 
> - make sure that biox3 has a queue named "batch" and that it accepts
> one-hour jobs (maxtime setting)
> 
> Several of these problems (eg the ast one) wont stop the coaster
> service from starting, but would prevent coaster workers from
> starting.
> 
> Mihael can likely diagnose which of these or other routes are most
> likely the cause, and what evidence to look for on biox3.
> 
> Lastly, I see that both the IP listed of your swift host and biox3
> are pingable on the public net, so its not likely the first problem
> (IP reachability).
> 
> - Mike
> 
> 
> 
> 
> ----- Original Message -----
> > From: "TJ Lane" <tjlane at stanford.edu>
> > To: swift-user at ci.uchicago.edu
> > Sent: Saturday, June 15, 2013 5:15:04 PM
> > Subject: [Swift-user] Debugging Swift Coaster ServiceManager
> > 
> > Swift Users,
> > 
> > Finally back to trying out swift after a delay -- thanks for all
> > your
> > help
> > so far.
> > 
> > I've got a functional swift script up and running, and am now
> > trying
> > to
> > configure my sites.xml to get it running on 4 remote clusters. I've
> > gotten
> > it working on 2, so 2 more to go!
> > 
> > Let's focus on one first. This cluster is running PBS and I'm
> > trying
> > to
> > access it using coasters, via provider="ssh-cl:pbs". Unfortunately,
> > it
> > seems like swift can't boot up the coaster service for some reason,
> > which I
> > haven't been able to figure out. Maybe someone can help me debug
> > this, or
> > at least know where to start poking around!
> > 
> > Here's the site xml entry:
> > 
> >   <pool handle="biox3">
> > 
> >     <execution provider="coaster" jobmanager="ssh-cl:pbs" url="
> > biox3.stanford.edu"/>
> > 
> >     <profile namespace="globus"
> >     key="maxWalltime">00:30:00</profile>
> > 
> >     <profile namespace="globus"
> >     key="lowOverAllocation">100</profile>
> >     <profile namespace="globus"
> >     key="highOverAllocation">100</profile>
> >     <profile namespace="globus" key="maxtime">3600</profile>
> > 
> >     <profile namespace="globus" key="queue">batch</profile>
> >     <profile namespace="globus" key="slots">10</profile>
> >     <profile namespace="globus" key="maxnodes">1</profile>
> >     <profile namespace="globus" key="nodeGranularity">1</profile>
> > 
> >     <profile namespace="globus" key="jobsPerNode">1</profile>
> > 
> >     <profile namespace="karajan" key="jobThrottle">1.0</profile>
> >     <profile namespace="karajan" key="initialScore">10000</profile>
> > 
> >     <!--
> >     <profile namespace="env" key="SWIFT_GEN_SCRIPTS">1</profile>
> >     -->
> > 
> >     <workdirectory>/home/tjlane/swiftwork</workdirectory>
> > 
> >   </pool>
> > 
> > and here's what gets printed when I try and run a very basic "hello
> > cluster" swift script:
> > 
> > tjlane at vspm42 ~/swift_hello
> > $ swift -sites.file ~/opt/swift-0.94/etc/sites.xml -tc.file
> > ~/opt/swift-0.94/etc/tc.data  -config swift.properties uname.swift
> > Swift started
> > Swift 0.94 swift-r6492 cog-r3658
> > 
> > RunID: 20130615-1512-h2fskgme
> > Progress:  time: Sat, 15 Jun 2013 15:12:32 -0700
> > Progress:  time: Sat, 15 Jun 2013 15:12:34 -0700  Submitted:1
> > Execution failed:
> >     Exception in uname:
> >     Arguments: [-a]
> >     Host: biox3
> >     Directory: uname-20130615-1512-h2fskgme/jobs/a/uname-aan4rzal
> > 
> > Caused by:
> >     Could not submit job
> > Caused by:
> >     Could not start coaster service
> > Caused by:
> >     Task ended before registration was received.
> > 
> > Failed to start coaster service
> > java.lang.NullPointerException
> >    at java.net.URI.compareTo(libgcj.so.10)
> >    at java.net.URI.compareTo(libgcj.so.10)
> >    at java.util.TreeMap.compare(libgcj.so.10)
> >    at java.util.TreeMap.put(libgcj.so.10)
> >    at java.util.TreeSet.addAll(libgcj.so.10)
> >    at
> > org.globus.cog.abstraction.coaster.service.job.manager.Settings.setCallbackURIs(Settings.java:403)
> >    at
> > org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.<init>(JobQueue.java:41)
> >    at
> > org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:148)
> >    at
> > org.globus.cog.abstraction.coaster.service.CoasterService.main(CoasterService.java:382)
> > java.lang.NullPointerException
> >    at java.net.URI.compareTo(libgcj.so.10)
> >    at java.net.URI.compareTo(libgcj.so.10)
> >    at java.util.TreeMap.compare(libgcj.so.10)
> >    at java.util.TreeMap.put(libgcj.so.10)
> >    at java.util.TreeSet.addAll(libgcj.so.10)
> >    at
> > org.globus.cog.abstraction.coaster.service.job.manager.Settings.setCallbackURIs(Settings.java:403)
> >    at
> > org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.<init>(JobQueue.java:41)
> >    at
> > org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:148)
> >    at
> > org.globus.cog.abstraction.coaster.service.CoasterService.main(CoasterService.java:382)
> > 
> > 
> >     uname, uname.swift, line 12
> > 
> > Finally, here's part of what gets dumped to my log file:
> > 
> > <snip>
> > 2013-06-15 14:54:22,350-0700 INFO  BootstrapService
> > [/171.67.106.68:39309]
> > GET /coaster-bootstrap.jar HTTP/1.0
> > 2013-06-15 14:54:22,713-0700 INFO  ServiceManager Service task
> > Task(type=JOB_SUBMISSION, identity=urn:cog-1371333260175)
> > terminated.
> > Removing service.
> > 2013-06-15 14:54:22,713-0700 INFO  ServiceManager Service does not
> > appear
> > to be registered with this manager
> > 2013-06-15 14:54:22,713-0700 INFO  ServiceManager Coaster service
> > ended.
> > Reason: null
> >         stdout:
> >         stderr: Failed to start coaster service
> > java.lang.NullPointerException
> >    at java.net.URI.compareTo(libgcj.so.10)
> >    at java.net.URI.compareTo(libgcj.so.10)
> >    at java.util.TreeMap.compare(libgcj.so.10)
> >    at java.util.TreeMap.put(libgcj.so.10)
> >    at java.util.TreeSet.addAll(libgcj.so.10)
> >    at
> > org.globus.cog.abstraction.coaster.service.job.manager.Settings.setCallbackURIs(Settings.java:403)
> >    at
> > org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.<init>(JobQueue.java:41)
> >    at
> > org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:148)
> >    at
> > org.globus.cog.abstraction.coaster.service.CoasterService.main(CoasterService.java:382)
> > java.lang.NullPointerException
> >    at java.net.URI.compareTo(libgcj.so.10)
> >    at java.net.URI.compareTo(libgcj.so.10)
> >    at java.util.TreeMap.compare(libgcj.so.10)
> >    at java.util.TreeMap.put(libgcj.so.10)
> >    at java.util.TreeSet.addAll(libgcj.so.10)
> >    at
> > org.globus.cog.abstraction.coaster.service.job.manager.Settings.setCallbackURIs(Settings.java:403)
> >    at
> > org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.<init>(JobQueue.java:41)
> >    at
> > org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:148)
> >    at
> > org.globus.cog.abstraction.coaster.service.CoasterService.main(CoasterService.java:382)
> > 
> > 
> > 2013-06-15 14:54:22,714-0700 INFO  NotificationManager
> > biox3.stanford.edu
> > 2013-06-15 14:54:22,771-0700 INFO  RuntimeStats$ProgressTicker
> >   Submitted:1
> > 2013-06-15 14:54:22,775-0700 DEBUG swift APPLICATION_EXCEPTION
> > jobid=uname-d77eqzal - Application exception: Caused by: Could not
> > submit
> > job
> > Caused by: Could not start coaster service
> > Caused by: Task ended before registration was received.
> > 
> > Failed to start coaster service
> > java.lang.NullPointerException
> >    at java.net.URI.compareTo(libgcj.so.10)
> >    at java.net.URI.compareTo(libgcj.so.10)
> >    at java.util.TreeMap.compare(libgcj.so.10)
> >    at java.util.TreeMap.put(libgcj.so.10)
> >    at java.util.TreeSet.addAll(libgcj.so.10)
> >    at
> > org.globus.cog.abstraction.coaster.service.job.manager.Settings.setCallbackURIs(Settings.java:403)
> >    at
> > org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.<init>(JobQueue.java:41)
> >    at
> > org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:148)
> >    at
> > org.globus.cog.abstraction.coaster.service.CoasterService.main(CoasterService.java:382)
> > java.lang.NullPointerException
> >    at java.net.URI.compareTo(libgcj.so.10)
> >    at java.net.URI.compareTo(libgcj.so.10)
> >    at java.util.TreeMap.compare(libgcj.so.10)
> >    at java.util.TreeMap.put(libgcj.so.10)
> >    at java.util.TreeSet.addAll(libgcj.so.10)
> >    at
> > org.globus.cog.abstraction.coaster.service.job.manager.Settings.setCallbackURIs(Settings.java:403)
> >    at
> > org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.<init>(JobQueue.java:41)
> >    at
> > org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:148)
> >    at
> > org.globus.cog.abstraction.coaster.service.CoasterService.main(CoasterService.java:382)
> > <snip>
> > 
> > 
> > Any help or advice on how to resolve this issue, much much
> > appreciated!
> > 
> > Thanks,
> > 
> > TJ
> > 
> > _______________________________________________
> > Swift-user mailing list
> > Swift-user at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> 



More information about the Swift-user mailing list