[Swift-devel] alcfbgpnat and BG/P compute-node-to-login-host connectivity

Michael Wilde wilde at mcs.anl.gov
Wed Dec 1 12:18:35 CST 2010


was: Re: [Swift-devel] coaster-service error on Intrepid

Mihael, how does "alcfbgpnat" work, and what does that imply for running manual persisten coasters on BG/P with the workers launched from a single qsub job?

Im probing on surveyor at the moment trying to figure out how worker.pl can reach a persistent coaster service on the login node, and seem unable to ping login6 from a compute node.

Does the worker.pl script (or coaster service) do something special when alcfbgpnat is set to enable connectivity?

- Mike


----- Original Message -----
> Justin, I was experimenting on PADS with the persistent coaster
> service; thats where I tested Mihael's fix, which enabled the service
> to be used repeatedly and to remain up for extended periods of time.
> 
> I just started yesterday trying to move that to the BG/P - I think for
> the same reason as you.
> 
> My script is in /home/wilde/swift/lab/pecos/start-coasters on
> Surveyor.
> 
> I'll stop by to see if we can get this working, as it will help us
> both on the CDM runs.
> 
> One thing to note: I run one artificial job to put the service into
> passive mode, which seems necessary to enable externally started
> workers to connect to it. Ideally we'll soon just make this a command
> line flag to the service.
> 
> - Mike
> 
> 
> ----- Original Message -----
> > Hello all
> > I'm getting started with the coaster-service on Intrepid. I start
> > up the service and the first run completes. The second fails with
> > the
> > trace below. sites.xml is also included below. I'm looking into this
> > but
> > I thought I should post it...
> > Justin
> >
> > Intrepid: ~> coaster-service -p 2390 -nosec
> > Started coaster service: http://140.221.82.115:2390
> > original callback URI is http://10.40.5.144:32907
> > callback URI has been overridden to http://172.17.5.144:32907
> > Failed to send remote log message
> > org.globus.cog.karajan.workflow.service.channels.ChannelException:
> > Channel
> > died and no contact available
> > at
> > org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:235)
> > at
> > org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:257)
> > at
> > org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:227)
> > at
> > org.globus.cog.abstraction.coaster.rlog.RemoteLogger.log(RemoteLogger.java:31)
> > at
> > org.globus.cog.abstraction.coaster.service.job.manager.Block.start(Block.java:87)
> > at
> > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.addBlock(BlockQueueProcessor.java:213)
> > at
> > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.allocateBlocks(BlockQueueProcessor.java:395)
> > at
> > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:518)
> > at
> > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:100)
> >
> > <pool handle="coasters_alcfbgp">
> > <filesystem provider="local" />
> > <execution provider="coaster-persistent"
> > jobmanager="local:cobalt"
> > url="http://140.221.82.115:2390"
> > />
> > <!-- <profile namespace="swift" key="stagingMethod">local</profile>
> > -->
> > <profile namespace="globus"
> > key="internalHostname">172.17.5.144</profile>
> > <profile namespace="globus" key="project">HTCScienceApps</profile>
> > <profile namespace="globus" key="queue">prod-devel</profile>
> > <profile namespace="globus" key="kernelprofile">zeptoos</profile>
> > <profile namespace="globus" key="alcfbgpnat">true</profile>
> > <profile namespace="karajan" key="jobthrottle">21</profile>
> > <profile namespace="karajan" key="initialScore">10000</profile>
> > <profile namespace="globus" key="workersPerNode">1</profile>
> > <profile namespace="globus" key="slots">1</profile>
> > <profile namespace="globus" key="maxTime">3300</profile>
> > <profile namespace="globus" key="nodeGranularity">64</profile>
> > <profile namespace="globus" key="maxNodes">64</profile>
> > <profile namespace="globus"
> > key="hookClass">org.globus.swift.data.policy.AllocationHook
> > </profile>
> > <!-- <scratch>/scratch</scratch> -->
> > <workdirectory>/home/wozniak/work</workdirectory>
> > </pool>
> >
> >
> > --
> > Justin M Wozniak
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list