[Swift-user] Remote SGE cluster
Igor Russo
igor.souza.russo at gmail.com
Wed May 6 07:41:24 CDT 2015
Hi,
I've downloaded the package again and it worked just fine.
Thank you very much, Yadu and Mihael!
Igor
2015-05-05 16:27 GMT-03:00 Mihael Hategan <hategan at mcs.anl.gov>:
> Hi,
>
> Have you modified any jar files or copied them from another swift
> package?
>
> The coaster bootstrap stores checksums of the jar files that it needs
> (calculated at swift compile time) and checks all jar files that come
> over an unsecured network against them. Maybe there should be a tool to
> update these checksums when needed, not just at compile time.
>
> Mihael
>
> On Tue, 2015-05-05 at 11:01 -0300, Igor Russo wrote:
> > Hi Mihael,
> >
> > Sorry to bother you again.
> >
> > You were right, after configuring the port forwarding the script is able
> to
> > connect.
> >
> > But i still get an error "Checksum does not match".
> >
> > Here goes the content of the ~/coaster-bootstrap-xxx.log file:
> >
> > using plain mode
> > BS: http://189.12.232.9:50006
> > which: no gmd5sum in
> >
> (/opt/openmpi/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/bio/ncbi/bin:/opt/bio/mpiblast/bin:/opt/bio/EMBOSS/bin:/opt/bio/clustalw/bin:/opt/bio/tcoffee/bin:/opt/bio/hmmer/bin:/opt/bio/phylip/exe:/opt/bio/mrbayes:/opt/bio/fasta:/opt/bio/glimmer/bin:/opt/bio/glimmer/scripts:/opt/bio/gromacs/bin:/opt/bio/gmap/bin:/opt/bio/tigr/bin:/opt/bio/autodocksuite/bin:/opt/bio/wgs/bin:/opt/eclipse:/opt/ganglia/bin:/opt/ganglia/sbin:/usr/java/latest/bin:/opt/maven/bin:/opt/pdsh/bin:/opt/rocks/bin:/opt/rocks/sbin:/opt/condor/bin:/opt/condor/sbin:/opt/gridengine/bin/linux-x64)
> > Expected checksum: 9b7bd5a96a2912cf8d06d1a2fd891620
> > Computed checksum: 9b7bd5a96a2912cf8d06d1a2fd891620
> > JAVA=/usr/java/latest/bin/java
> > plain /usr/java/latest/bin/java -Djava=/usr/java/latest/bin/java -Xmx64M
> > -DGLOBUS_TCP_PORT_RANGE=
> > -DX509_USER_PROXY=/home/igor/.globus/sshproxy-1344874142-1432003400
> > -DX509_CERT_DIR=/home/igor/.globus/sshCAcert-1344874142-1432003400.pem
> > -DGLOBUS_HOSTNAME=cluster.mmc.ufjf.br -Duser.home=/home/igor -jar
> > /tmp/bootstrap.xTzo3v http://189.12.232.9:50006
> https://189.12.232.9:50005
> > 11100954039
> > Failed to download cog-provider-coaster-0.3.jar:
> > java.lang.RuntimeException: Checksum does not match.
> >
> >
> > Thanks,
> > Igor
> >
> > 2015-05-04 18:52 GMT-03:00 Mihael Hategan <hategan at mcs.anl.gov>:
> >
> > >
> > > Hi,
> > >
> > > In most cases (globus, coasters), the service side (legion in this
> case)
> > > needs the ability to connect back to the client (your home connection).
> > >
> > > Correct me if I'm wrong, but you are on a DSL line, behind a router
> with
> > > NAT. If so, you must configure the router to forward some incoming
> > > connections to the actual machine from which you are running swift
> from.
> > > Typically this is done by configuring a certain port range forwarding
> on
> > > the router (Yadu suggested GLOBUS_TCP_PORT_RANGE=50000,51000, so that
> > > port range should be matched on the router).
> > >
> > > The gist of it is that swift starts a simple shell script on legion
> that
> > > downloads a small java app from the client side and launches it. Said
> > > shell script logs things into ~/coaster-bootstrap-xxx.log files. The
> > > contents of the bootstrap logs is probably very useful here.
> > >
> > > If all of that goes well, the aforementioned small java app downloads
> > > the full coaster service from the client and starts it. Once started,
> > > the coaster service connects back to Swift. The last two parts log
> their
> > > doings in ~/.globus/coasters/*.log. Those can be useful, too, if they
> > > exist.
> > >
> > > Mihael
> > >
> > > On Mon, 2015-05-04 at 18:27 -0300, Igor Russo wrote:
> > > > Hi Yadu,
> > > >
> > > > Yes, i can ssh from my laptop to the cluster directly.
> > > >
> > > > The coaster-bootstrap-*.log files are created in the remote system.
> > > >
> > > > I'm sending the log file attached.
> > > >
> > > > Thanks,
> > > > Igor
> > > >
> > > > 2015-05-04 16:57 GMT-03:00 Yadu Nand Babuji <yadunand at uchicago.edu>:
> > > >
> > > > > Hi Igor,
> > > > >
> > > > > Are you able to ssh from your machine to legion directly without
> > > entering
> > > > > passwords ?
> > > > > Could you please send us a tarball of the runNNN directories for a
> > > failing
> > > > > run ?
> > > > >
> > > > > I've put the following settings in my ~/.ssh/config on my laptop
> and
> > > setup
> > > > > ssh keys on
> > > > > both socrates and legion. This allows me to use "ssh
> > > legion.rc.ucl.ac.uk"
> > > > > and connect.
> > > > >
> > > > > Host legion.rc.ucl.ac.uk
> > > > > User YOUR_USERNAME
> > > > > Hostname legion.rc.ucl.ac.uk
> > > > > ProxyCommand ssh socrates -W %h:%p
> > > > >
> > > > > Host socrates
> > > > > Hostname socrates.ucl.ac.uk
> > > > > User YOUR_USERNAME
> > > > > ForwardAgent yes
> > > > >
> > > > > Thanks,
> > > > > Yadu
> > > > >
> > > > >
> > > > >
> > > > > On 05/04/2015 07:51 AM, Igor Russo wrote:
> > > > >
> > > > > Hi Yadu,
> > > > >
> > > > > Thanks again.
> > > > >
> > > > > I tried your suggestion. Now i'm not getting the previous error,
> but
> > > the
> > > > > jobs aren't being submitted:
> > > > >
> > > > > RunID: run001
> > > > > Progress: Seg, 04 Mai 2015 09:32:54-0300
> > > > > Progress: Seg, 04 Mai 2015 09:32:55-0300 Submitting:1
> > > > > Progress: Seg, 04 Mai 2015 09:33:25-0300 Submitting:1
> > > > > Progress: Seg, 04 Mai 2015 09:33:55-0300 Submitting:1
> > > > > Progress: Seg, 04 Mai 2015 09:34:25-0300 Submitting:1
> > > > > Progress: Seg, 04 Mai 2015 09:34:55-0300 Submitting:1
> > > > > Progress: Seg, 04 Mai 2015 09:35:25-0300 Submitting:1
> > > > > Progress: Seg, 04 Mai 2015 09:35:55-0300 Submitting:1
> > > > > Progress: Seg, 04 Mai 2015 09:36:25-0300 Submitting:1
> > > > >
> > > > > In the the log file, i notice the following errors:
> > > > >
> > > > > 2015-05-04 09:24:06,223-0300 INFO ServiceManager Service does not
> > > > > appear to be registered with this manager
> > > > > 2015-05-04 09:24:06,223-0300 INFO ServiceManager Coaster service
> > > ended.
> > > > > Reason: null
> > > > >
> > > > > Thanks,
> > > > > Igor
> > > > >
> > > > >
> > > > > 2015-05-01 17:47 GMT-03:00 Yadu Nand Babuji <yadunand at uchicago.edu
> >:
> > > > >
> > > > >> Hi Igor,
> > > > >>
> > > > >> The remote connection system requires that the local machine you
> run
> > > > >> the swift client on has
> > > > >> a public ip address. It looks like swift was not able to guess it
> and
> > > set
> > > > >> it to http://igor-ubuntu:51251
> > > > >>
> > > > >> Could you retry running part04 after doing the next step, and
> please
> > > > >> make sure your environment has
> > > > >> these variables set whenever you run swift to remote systems :
> > > > >> export GLOBUS_HOSTNAME=<PUBLIC_IP_OF_YOUR_MACHINE>
> > > > >> export GLOBUS_TCP_PORT_RANGE=50000,51000
> > > > >>
> > > > >> Thanks,
> > > > >> Yadu
> > > > >>
> > > > >>
> > > > >> On 05/01/2015 02:29 PM, Igor Russo wrote:
> > > > >>
> > > > >> Hi Yadu,
> > > > >>
> > > > >> Thank you very much!
> > > > >>
> > > > >> I changed the config file with the data from my cluster.
> > > > >>
> > > > >> When executing the 4th part of Swift-tutorial, i'm getting the
> > > > >> following error:
> > > > >> "Failed to download bootstrap jar from ..."
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > >
> --------------------------------------------------------------------------------
> > > > >>
> > > > >> RunID: run031
> > > > >> Progress: Sex, 01 Mai 2015 15:40:42-0300
> > > > >> Progress: Sex, 01 Mai 2015 15:40:43-0300 Submitting:1
> > > > >>
> > > > >> Execution failed:
> > > > >> Exception in sort:
> > > > >> Arguments: [-n, unsorted.txt]
> > > > >> Host: mmc
> > > > >> Directory: p4-run031/jobs/s/sort-go28d68m
> > > > >> exception @ swift-int-staging.k, line: 165
> > > > >> Caused by:
> > > > >> exception @ swift-int-staging.k, line: 160
> > > > >> Caused by: null
> > > > >> Caused by:
> > > > >>
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> > > Could
> > > > >> not submit job
> > > > >> Caused by:
> > > > >>
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> > > Could
> > > > >> not start coaster service
> > > > >> Caused by:
> > > > >>
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> > > Task
> > > > >> ended before registration was received.
> > > > >> Failed to download bootstrap jar from http://igor-ubuntu:51251
> > > > >>
> > > > >> k:assign @ swift.k, line: 174
> > > > >> Caused by: Exception in sort:
> > > > >> Arguments: [-n, unsorted.txt]
> > > > >> Host: mmc
> > > > >> Directory: p4-run031/jobs/s/sort-go28d68m
> > > > >> exception @ swift-int-staging.k, line: 165
> > > > >> Caused by:
> > > > >> exception @ swift-int-staging.k, line: 160
> > > > >> Caused by: null
> > > > >> Caused by:
> > > > >>
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> > > Could
> > > > >> not submit job
> > > > >> Caused by:
> > > > >>
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> > > Could
> > > > >> not start coaster service
> > > > >> Caused by:
> > > > >>
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> > > Task
> > > > >> ended before registration was received.
> > > > >> Failed to download bootstrap jar from http://igor-ubuntu:51251
> > > > >>
> > > > >>
> > > > >>
> > >
> --------------------------------------------------------------------------------
> > > > >>
> > > > >> Thanks,
> > > > >> Igor
> > > > >>
> > > > >> 2015-05-01 13:47 GMT-03:00 Yadu Nand Babuji <
> yadunand at uchicago.edu>:
> > > > >>
> > > > >>> Hi Igor,
> > > > >>>
> > > > >>> Swift does support SGE clusters, and you can refer to the
> > > swift-tutorial
> > > > >>> for sample code and configurations from this link:
> > > > >>> https://github.com/swift-lang/swift-tutorial
> > > > >>>
> > > > >>> Here's a sample config from our test-suite for Godzilla, an SGE
> > > cluster
> > > > >>> at UChicago:
> > > > >>>
> > > > >>>
> > >
> https://github.com/swift-lang/swift-k/blob/master/tests/sites/godzilla/swift.conf
> > > > >>> You could modify and add this config to the swift.conf file in
> the
> > > > >>> swift-tutorial to run
> > > > >>> Swift on any machine and execute on a remote SGE cluster.
> > > > >>>
> > > > >>> SGE is a widely used resource manager and most sites have
> > > differences in
> > > > >>> their setups that make each site unique. If you run into issues
> with
> > > the
> > > > >>> default
> > > > >>> swift package, and could provide help in figuring out specifics
> of
> > > your
> > > > >>> cluster, we
> > > > >>> will help you adapt the Swift SGE provider to support your
> cluster.
> > > > >>>
> > > > >>> Thanks,
> > > > >>> Yadu
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> On 04/28/2015 05:09 PM, Igor Russo wrote:
> > > > >>>
> > > > >>> Hi All,
> > > > >>>
> > > > >>> It is possible to use Swift with a remote SGE/OGE cluster?
> > > > >>>
> > > > >>> Regards,
> > > > >>> Igor
> > > > >>>
> > > > >>>
> > > > >>> _______________________________________________
> > > > >>> Swift-user mailing listSwift-user at ci.uchicago.eduhttps://
> > > lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> _______________________________________________
> > > > >>> Swift-user mailing list
> > > > >>> Swift-user at ci.uchicago.edu
> > > > >>>
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > > >>>
> > > > >>
> > > > >>
> > > > >>
> > > > >> _______________________________________________
> > > > >> Swift-user mailing listSwift-user at ci.uchicago.eduhttps://
> > > lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > > >>
> > > > >>
> > > > >>
> > > > >> _______________________________________________
> > > > >> Swift-user mailing list
> > > > >> Swift-user at ci.uchicago.edu
> > > > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > Swift-user mailing listSwift-user at ci.uchicago.eduhttps://
> > > lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > > >
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > Swift-user mailing list
> > > > > Swift-user at ci.uchicago.edu
> > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > > >
> > > > _______________________________________________
> > > > Swift-user mailing list
> > > > Swift-user at ci.uchicago.edu
> > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > >
> > >
> > >
> > >
> > _______________________________________________
> > Swift-user mailing list
> > Swift-user at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20150506/db329267/attachment.html>
More information about the Swift-user
mailing list