[Swift-user] Remote SGE cluster
Mihael Hategan
hategan at mcs.anl.gov
Tue May 5 14:27:46 CDT 2015
Hi,
Have you modified any jar files or copied them from another swift
package?
The coaster bootstrap stores checksums of the jar files that it needs
(calculated at swift compile time) and checks all jar files that come
over an unsecured network against them. Maybe there should be a tool to
update these checksums when needed, not just at compile time.
Mihael
On Tue, 2015-05-05 at 11:01 -0300, Igor Russo wrote:
> Hi Mihael,
>
> Sorry to bother you again.
>
> You were right, after configuring the port forwarding the script is able to
> connect.
>
> But i still get an error "Checksum does not match".
>
> Here goes the content of the ~/coaster-bootstrap-xxx.log file:
>
> using plain mode
> BS: http://189.12.232.9:50006
> which: no gmd5sum in
> (/opt/openmpi/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/bio/ncbi/bin:/opt/bio/mpiblast/bin:/opt/bio/EMBOSS/bin:/opt/bio/clustalw/bin:/opt/bio/tcoffee/bin:/opt/bio/hmmer/bin:/opt/bio/phylip/exe:/opt/bio/mrbayes:/opt/bio/fasta:/opt/bio/glimmer/bin:/opt/bio/glimmer/scripts:/opt/bio/gromacs/bin:/opt/bio/gmap/bin:/opt/bio/tigr/bin:/opt/bio/autodocksuite/bin:/opt/bio/wgs/bin:/opt/eclipse:/opt/ganglia/bin:/opt/ganglia/sbin:/usr/java/latest/bin:/opt/maven/bin:/opt/pdsh/bin:/opt/rocks/bin:/opt/rocks/sbin:/opt/condor/bin:/opt/condor/sbin:/opt/gridengine/bin/linux-x64)
> Expected checksum: 9b7bd5a96a2912cf8d06d1a2fd891620
> Computed checksum: 9b7bd5a96a2912cf8d06d1a2fd891620
> JAVA=/usr/java/latest/bin/java
> plain /usr/java/latest/bin/java -Djava=/usr/java/latest/bin/java -Xmx64M
> -DGLOBUS_TCP_PORT_RANGE=
> -DX509_USER_PROXY=/home/igor/.globus/sshproxy-1344874142-1432003400
> -DX509_CERT_DIR=/home/igor/.globus/sshCAcert-1344874142-1432003400.pem
> -DGLOBUS_HOSTNAME=cluster.mmc.ufjf.br -Duser.home=/home/igor -jar
> /tmp/bootstrap.xTzo3v http://189.12.232.9:50006 https://189.12.232.9:50005
> 11100954039
> Failed to download cog-provider-coaster-0.3.jar:
> java.lang.RuntimeException: Checksum does not match.
>
>
> Thanks,
> Igor
>
> 2015-05-04 18:52 GMT-03:00 Mihael Hategan <hategan at mcs.anl.gov>:
>
> >
> > Hi,
> >
> > In most cases (globus, coasters), the service side (legion in this case)
> > needs the ability to connect back to the client (your home connection).
> >
> > Correct me if I'm wrong, but you are on a DSL line, behind a router with
> > NAT. If so, you must configure the router to forward some incoming
> > connections to the actual machine from which you are running swift from.
> > Typically this is done by configuring a certain port range forwarding on
> > the router (Yadu suggested GLOBUS_TCP_PORT_RANGE=50000,51000, so that
> > port range should be matched on the router).
> >
> > The gist of it is that swift starts a simple shell script on legion that
> > downloads a small java app from the client side and launches it. Said
> > shell script logs things into ~/coaster-bootstrap-xxx.log files. The
> > contents of the bootstrap logs is probably very useful here.
> >
> > If all of that goes well, the aforementioned small java app downloads
> > the full coaster service from the client and starts it. Once started,
> > the coaster service connects back to Swift. The last two parts log their
> > doings in ~/.globus/coasters/*.log. Those can be useful, too, if they
> > exist.
> >
> > Mihael
> >
> > On Mon, 2015-05-04 at 18:27 -0300, Igor Russo wrote:
> > > Hi Yadu,
> > >
> > > Yes, i can ssh from my laptop to the cluster directly.
> > >
> > > The coaster-bootstrap-*.log files are created in the remote system.
> > >
> > > I'm sending the log file attached.
> > >
> > > Thanks,
> > > Igor
> > >
> > > 2015-05-04 16:57 GMT-03:00 Yadu Nand Babuji <yadunand at uchicago.edu>:
> > >
> > > > Hi Igor,
> > > >
> > > > Are you able to ssh from your machine to legion directly without
> > entering
> > > > passwords ?
> > > > Could you please send us a tarball of the runNNN directories for a
> > failing
> > > > run ?
> > > >
> > > > I've put the following settings in my ~/.ssh/config on my laptop and
> > setup
> > > > ssh keys on
> > > > both socrates and legion. This allows me to use "ssh
> > legion.rc.ucl.ac.uk"
> > > > and connect.
> > > >
> > > > Host legion.rc.ucl.ac.uk
> > > > User YOUR_USERNAME
> > > > Hostname legion.rc.ucl.ac.uk
> > > > ProxyCommand ssh socrates -W %h:%p
> > > >
> > > > Host socrates
> > > > Hostname socrates.ucl.ac.uk
> > > > User YOUR_USERNAME
> > > > ForwardAgent yes
> > > >
> > > > Thanks,
> > > > Yadu
> > > >
> > > >
> > > >
> > > > On 05/04/2015 07:51 AM, Igor Russo wrote:
> > > >
> > > > Hi Yadu,
> > > >
> > > > Thanks again.
> > > >
> > > > I tried your suggestion. Now i'm not getting the previous error, but
> > the
> > > > jobs aren't being submitted:
> > > >
> > > > RunID: run001
> > > > Progress: Seg, 04 Mai 2015 09:32:54-0300
> > > > Progress: Seg, 04 Mai 2015 09:32:55-0300 Submitting:1
> > > > Progress: Seg, 04 Mai 2015 09:33:25-0300 Submitting:1
> > > > Progress: Seg, 04 Mai 2015 09:33:55-0300 Submitting:1
> > > > Progress: Seg, 04 Mai 2015 09:34:25-0300 Submitting:1
> > > > Progress: Seg, 04 Mai 2015 09:34:55-0300 Submitting:1
> > > > Progress: Seg, 04 Mai 2015 09:35:25-0300 Submitting:1
> > > > Progress: Seg, 04 Mai 2015 09:35:55-0300 Submitting:1
> > > > Progress: Seg, 04 Mai 2015 09:36:25-0300 Submitting:1
> > > >
> > > > In the the log file, i notice the following errors:
> > > >
> > > > 2015-05-04 09:24:06,223-0300 INFO ServiceManager Service does not
> > > > appear to be registered with this manager
> > > > 2015-05-04 09:24:06,223-0300 INFO ServiceManager Coaster service
> > ended.
> > > > Reason: null
> > > >
> > > > Thanks,
> > > > Igor
> > > >
> > > >
> > > > 2015-05-01 17:47 GMT-03:00 Yadu Nand Babuji <yadunand at uchicago.edu>:
> > > >
> > > >> Hi Igor,
> > > >>
> > > >> The remote connection system requires that the local machine you run
> > > >> the swift client on has
> > > >> a public ip address. It looks like swift was not able to guess it and
> > set
> > > >> it to http://igor-ubuntu:51251
> > > >>
> > > >> Could you retry running part04 after doing the next step, and please
> > > >> make sure your environment has
> > > >> these variables set whenever you run swift to remote systems :
> > > >> export GLOBUS_HOSTNAME=<PUBLIC_IP_OF_YOUR_MACHINE>
> > > >> export GLOBUS_TCP_PORT_RANGE=50000,51000
> > > >>
> > > >> Thanks,
> > > >> Yadu
> > > >>
> > > >>
> > > >> On 05/01/2015 02:29 PM, Igor Russo wrote:
> > > >>
> > > >> Hi Yadu,
> > > >>
> > > >> Thank you very much!
> > > >>
> > > >> I changed the config file with the data from my cluster.
> > > >>
> > > >> When executing the 4th part of Swift-tutorial, i'm getting the
> > > >> following error:
> > > >> "Failed to download bootstrap jar from ..."
> > > >>
> > > >>
> > > >>
> > > >>
> > --------------------------------------------------------------------------------
> > > >>
> > > >> RunID: run031
> > > >> Progress: Sex, 01 Mai 2015 15:40:42-0300
> > > >> Progress: Sex, 01 Mai 2015 15:40:43-0300 Submitting:1
> > > >>
> > > >> Execution failed:
> > > >> Exception in sort:
> > > >> Arguments: [-n, unsorted.txt]
> > > >> Host: mmc
> > > >> Directory: p4-run031/jobs/s/sort-go28d68m
> > > >> exception @ swift-int-staging.k, line: 165
> > > >> Caused by:
> > > >> exception @ swift-int-staging.k, line: 160
> > > >> Caused by: null
> > > >> Caused by:
> > > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> > Could
> > > >> not submit job
> > > >> Caused by:
> > > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> > Could
> > > >> not start coaster service
> > > >> Caused by:
> > > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> > Task
> > > >> ended before registration was received.
> > > >> Failed to download bootstrap jar from http://igor-ubuntu:51251
> > > >>
> > > >> k:assign @ swift.k, line: 174
> > > >> Caused by: Exception in sort:
> > > >> Arguments: [-n, unsorted.txt]
> > > >> Host: mmc
> > > >> Directory: p4-run031/jobs/s/sort-go28d68m
> > > >> exception @ swift-int-staging.k, line: 165
> > > >> Caused by:
> > > >> exception @ swift-int-staging.k, line: 160
> > > >> Caused by: null
> > > >> Caused by:
> > > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> > Could
> > > >> not submit job
> > > >> Caused by:
> > > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> > Could
> > > >> not start coaster service
> > > >> Caused by:
> > > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> > Task
> > > >> ended before registration was received.
> > > >> Failed to download bootstrap jar from http://igor-ubuntu:51251
> > > >>
> > > >>
> > > >>
> > --------------------------------------------------------------------------------
> > > >>
> > > >> Thanks,
> > > >> Igor
> > > >>
> > > >> 2015-05-01 13:47 GMT-03:00 Yadu Nand Babuji <yadunand at uchicago.edu>:
> > > >>
> > > >>> Hi Igor,
> > > >>>
> > > >>> Swift does support SGE clusters, and you can refer to the
> > swift-tutorial
> > > >>> for sample code and configurations from this link:
> > > >>> https://github.com/swift-lang/swift-tutorial
> > > >>>
> > > >>> Here's a sample config from our test-suite for Godzilla, an SGE
> > cluster
> > > >>> at UChicago:
> > > >>>
> > > >>>
> > https://github.com/swift-lang/swift-k/blob/master/tests/sites/godzilla/swift.conf
> > > >>> You could modify and add this config to the swift.conf file in the
> > > >>> swift-tutorial to run
> > > >>> Swift on any machine and execute on a remote SGE cluster.
> > > >>>
> > > >>> SGE is a widely used resource manager and most sites have
> > differences in
> > > >>> their setups that make each site unique. If you run into issues with
> > the
> > > >>> default
> > > >>> swift package, and could provide help in figuring out specifics of
> > your
> > > >>> cluster, we
> > > >>> will help you adapt the Swift SGE provider to support your cluster.
> > > >>>
> > > >>> Thanks,
> > > >>> Yadu
> > > >>>
> > > >>>
> > > >>>
> > > >>> On 04/28/2015 05:09 PM, Igor Russo wrote:
> > > >>>
> > > >>> Hi All,
> > > >>>
> > > >>> It is possible to use Swift with a remote SGE/OGE cluster?
> > > >>>
> > > >>> Regards,
> > > >>> Igor
> > > >>>
> > > >>>
> > > >>> _______________________________________________
> > > >>> Swift-user mailing listSwift-user at ci.uchicago.eduhttps://
> > lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > >>>
> > > >>>
> > > >>>
> > > >>> _______________________________________________
> > > >>> Swift-user mailing list
> > > >>> Swift-user at ci.uchicago.edu
> > > >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > >>>
> > > >>
> > > >>
> > > >>
> > > >> _______________________________________________
> > > >> Swift-user mailing listSwift-user at ci.uchicago.eduhttps://
> > lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > >>
> > > >>
> > > >>
> > > >> _______________________________________________
> > > >> Swift-user mailing list
> > > >> Swift-user at ci.uchicago.edu
> > > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > >>
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > Swift-user mailing listSwift-user at ci.uchicago.eduhttps://
> > lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > Swift-user mailing list
> > > > Swift-user at ci.uchicago.edu
> > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > >
> > > _______________________________________________
> > > Swift-user mailing list
> > > Swift-user at ci.uchicago.edu
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> >
> >
> >
> >
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
More information about the Swift-user
mailing list