[Swift-user] Channel Timeout on Beagle?

Mihael Hategan hategan at mcs.anl.gov
Fri May 29 13:11:34 CDT 2015


My initial suspicion would be running out of space on /dev/shm

But I think we need more information. 

0.96 has these worker health probes that periodically check the status
of various things like the filesystem usage. That's choice #1.

#2 is worker logging, which is supported in 0.95. You enable it by
saying 

<profile namespace="globus" key="workerLoggingLevel">DEBUG</profile>

inside the relevant site. This should give you a bunch of logs in
$userHomeOverride/.globus/coasters (I think; Yadu, correct me if I'm
wrong). They may provide more details about why coaster workers are
misbehaving.

Mihael

On Fri, 2015-05-29 at 13:59 -0400, Matthew Shaxted wrote:
> Mihael, please see the Swift run001 folder at the link below:
> 
> http://web.ci.uchicago.edu/~mattshax/epsweep-run001.tar.gz
> 
> MATTHEW SHAXTED
> SKIDMORE, OWINGS & MERRILL LLP
> 224 SOUTH MICHIGAN AVENUE
> CHICAGO, IL 60604
> T  (312) 360-4368
> MATTHEW.SHAXTED at SOM.COM
> 
> 
> The information contained in this communication may be confidential, is intended only for the use of the recipient(s) named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited and may be unlawful. If you have received this communication in error, please return it to the sender immediately and delete the original message and any copy of it from your computer system. If you have any questions concerning this message, please contact the sender.
> 
> 
> 
> -----Original Message-----
> From: swift-user-bounces at ci.uchicago.edu [mailto:swift-user-bounces at ci.uchicago.edu] On Behalf Of Mihael Hategan
> Sent: Friday, May 29, 2015 12:48 PM
> To: Matthew Shaxted
> Cc: 'Swift User'
> Subject: Re: [Swift-user] Channel Timeout on Beagle?
> 
> Hi Matthew,
> 
> Can you send me the full swift log?
> 
> Mihael
> 
> On Fri, 2015-05-29 at 11:39 -0400, Matthew Shaxted wrote:
> > It looks like the timeout problem is not solved actually. For some reason I am having much difficulty running on Beagle, and I have an feeling it is due to slow read/write.
> > 
> > For example, I finished ~1,200 / 12,000 runs before failure (see below paragraph) and moving these results (of not very large result files) to the public_html is taking an hour or so. I'm hoping to scale up to 100-300k runs or so, thus this will become a significant bottleneck. I have emailed beagle-support about this issue just now.
> > 
> > In all test environments my Swift workflow is working well, but when submitting jobs to Beagle queue, it completes some number of simulations before the timeout error occurs and all jobs stop. I'm using Swift-0.95-RC7 (and am in process of updating to 0.95 latest), but think these errors may also be due to this slow read/write.
> > 
> > Any suggestions?
> > 
> > Below is the error I see and the job completely stops:
> > 
> > Host: cluster
> > Directory: epsweep-run004/jobs/a/RunEP-ai2mic9m  exception @ 
> > swift-int-staging.k, line: 181 Caused by: exception @ 
> > swift-int-staging.k, line: 177 Caused by: Block task failed: 
> > Connection to worker lost
> > org.globus.cog.coaster.TimeoutException: Channel timed out. 
> > lastTime=150526-142313.128, 50526-142514.107, channel=TCPChannel 
> > [type: server, contact: 0526-0802460-000014-000456 at org.globus.cog.coaster.channels.AbstractCoasterChannel.checkTimeouts(AbstractCoasterChannel.java:133)
> >         at org.globus.cog.coaster.channels.AbstractCoasterChannel$1.run(AbstractCoasterChannel.java:124)
> >         at java.util.TimerThread.mainLoop(Timer.java:566)
> >         at java.util.TimerThread.run(Timer.java:516)
> > 
> > 
> > MATTHEW SHAXTED
> > SKIDMORE, OWINGS & MERRILL LLP
> > 224 SOUTH MICHIGAN AVENUE
> > CHICAGO, IL 60604
> > T  (312) 360-4368
> > MATTHEW.SHAXTED at SOM.COM<mailto:MATTHEW.SHAXTED at SOM.COM>
> > 
> > [cid:image001.png at 01D099FB.B26BE1C0]<http://www.som.com/>
> > The information contained in this communication may be confidential, is intended only for the use of the recipient(s) named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited and may be unlawful. If you have received this communication in error, please return it to the sender immediately and delete the original message and any copy of it from your computer system. If you have any questions concerning this message, please contact the sender.
> > 
> > [cid:image005.gif at 01D099F2.1E9A2BE0]
> > 
> > From: Matthew Shaxted
> > Sent: Wednesday, May 27, 2015 2:04 PM
> > To: 'Swift User'
> > Subject: RE: Channel Timeout on Beagle?
> > 
> > Hi All: I was able to get the runs working successfully by changing the maxtime flag in the sites file.
> > 
> > Thanks
> > 
> > 
> > From: Matthew Shaxted
> > Sent: Wednesday, May 27, 2015 9:50 AM
> > To: Swift User
> > Subject: Channel Timeout on Beagle?
> > 
> > Hi Swift Users:
> > 
> > I am running some studies on Beagle using Swift, and experiencing a strange error. The Swift scripts run great on cloud and on the Beagle login node, but seems to be timing out for some reason.
> > 
> > Does anyone have insight into the cause of this? Thanks for any help.
> > 
> > Below is the error I am getting:
> > 
> > Host: cluster
> > Directory: epsweep-run004/jobs/a/RunEP-ai2mic9m  exception @ 
> > swift-int-staging.k, line: 181 Caused by: exception @ 
> > swift-int-staging.k, line: 177 Caused by: Block task failed: 
> > Connection to worker lost
> > org.globus.cog.coaster.TimeoutException: Channel timed out. 
> > lastTime=150526-142313.128, 50526-142514.107, channel=TCPChannel 
> > [type: server, contact: 0526-0802460-000014-000456 at org.globus.cog.coaster.channels.AbstractCoasterChannel.checkTimeouts(AbstractCoasterChannel.java:133)
> >         at org.globus.cog.coaster.channels.AbstractCoasterChannel$1.run(AbstractCoasterChannel.java:124)
> >         at java.util.TimerThread.mainLoop(Timer.java:566)
> >         at java.util.TimerThread.run(Timer.java:516)
> > 
> > Below is my sites.xml file:
> > 
> > <pool handle="cluster">
> >     <execution provider="coaster" jobmanager="local:pbs" />
> >     <profile namespace="globus" key="project">CI-SES000178</profile>
> >     <profile namespace="globus" key="jobsPerNode">24</profile>
> >     <profile namespace="globus" key="lowOverAllocation">100</profile>
> >     <profile namespace="globus" key="highOverAllocation">100</profile>
> >     <profile namespace="globus" key="providerAttributes">pbs.aprun;pbs.mpp;depth=24</profile>
> >     <profile namespace="globus" key="maxtime">10800</profile>
> >     <profile namespace="globus" key="maxWalltime">01:25:00</profile>
> >     <profile namespace="globus" key="userHomeOverride">/lustre/beagle2/mattshax/epsweep/swifthome</profile>
> >     <profile namespace="globus" key="slots">20</profile>
> >     <profile namespace="globus" key="maxnodes">600</profile>
> >     <profile namespace="globus" key="nodeGranularity">1</profile>
> >     <profile namespace="karajan" key="jobThrottle">180</profile>
> >     <profile namespace="karajan" key="initialScore">10000</profile>
> >     <!-- <profile namespace="karajan" key="workerLoggingLevel">trace</profile> -->
> >     <workdirectory>/dev/shm/mattshax/swiftapp</workdirectory>
> >   </pool>
> > 
> > 
> > MATTHEW SHAXTED
> > SKIDMORE, OWINGS & MERRILL LLP
> > 224 SOUTH MICHIGAN AVENUE
> > CHICAGO, IL 60604
> > T  (312) 360-4368
> > MATTHEW.SHAXTED at SOM.COM<mailto:MATTHEW.SHAXTED at SOM.COM>
> > 
> > [cid:image006.png at 01D099F2.1E9A2BE0]<http://www.som.com/>
> > The information contained in this communication may be confidential, is intended only for the use of the recipient(s) named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited and may be unlawful. If you have received this communication in error, please return it to the sender immediately and delete the original message and any copy of it from your computer system. If you have any questions concerning this message, please contact the sender.
> > 
> > [cid:image005.gif at 01D099F2.1E9A2BE0]
> > 
> > _______________________________________________
> > Swift-user mailing list
> > Swift-user at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> 
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> 





More information about the Swift-user mailing list