From jozik at uchicago.edu Thu Jul 3 17:18:21 2014 From: jozik at uchicago.edu (Jonathan Ozik) Date: Thu, 3 Jul 2014 17:18:21 -0500 Subject: [Swift-user] Swift 0.95 RC6 on Windows Message-ID: <98581826-0010-422C-B4F5-9CAB8A53B1B3@uchicago.edu> Hi all, I have two colleagues that are trying to run Swift 0.95 RC6 on Windows using the included swift.bat Windows script. They haven?t been successful yet. Trying the quickstart guide http://swift-lang.org/guides/quickstart.html and the hello.swift script under examples/swift/misc/ they see the following error: Swift 0.95 RC6 swift-r7900 cog-r3908 RunID: 20140703-1433-7iq5x697 Progress: Thu, 03 Jul 2014 14:33:49-0500 Execution failed: Exception in echo: Arguments: [hi] Host: localhost Directory: hello-20140703-1433-7iq5x697/jobs/6/echo-6nydmzsl exception @ swift-int.k, line: 530 Caused by: null Caused by: java.io.IOException: Cannot run program "/bin/bash" (in directory "\var\tmp\hello-20140703-1433-7iq5x697"): CreateProcess error=267, The directory name is invalid Caused by: java.io.IOException: CreateProcess error=267, The directory name is invalid Any thoughts on this? Jonathan From hategan at mcs.anl.gov Thu Jul 3 18:09:58 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 3 Jul 2014 16:09:58 -0700 Subject: [Swift-user] Swift 0.95 RC6 on Windows In-Reply-To: <98581826-0010-422C-B4F5-9CAB8A53B1B3@uchicago.edu> References: <98581826-0010-422C-B4F5-9CAB8A53B1B3@uchicago.edu> Message-ID: <1404428998.22636.2.camel@echo> Hi Jonathan, We need to update the quick start guide. There is a missing bit if you are running on Windows. The details are here: http://swift-lang.org/guides/release-0.94/userguide/userguide.html#_running_on_windows That said, I don't think we routinely test swift on Windows, and I haven't maintained it, but I will give it a shot and see if it needs any fixes. Mihael On Thu, 2014-07-03 at 17:18 -0500, Jonathan Ozik wrote: > Hi all, > > I have two colleagues that are trying to run Swift 0.95 RC6 on Windows using the included swift.bat Windows script. They haven?t been successful yet. Trying the quickstart guide http://swift-lang.org/guides/quickstart.html and the hello.swift script under examples/swift/misc/ they see the following error: > > Swift 0.95 RC6 swift-r7900 cog-r3908 > RunID: 20140703-1433-7iq5x697 > Progress: Thu, 03 Jul 2014 14:33:49-0500 > > Execution failed: > Exception in echo: > Arguments: [hi] > Host: localhost > Directory: hello-20140703-1433-7iq5x697/jobs/6/echo-6nydmzsl > exception @ swift-int.k, line: 530 > Caused by: null > Caused by: java.io.IOException: Cannot run program "/bin/bash" (in directory "\var\tmp\hello-20140703-1433-7iq5x697"): CreateProcess error=267, The directory name is invalid > Caused by: java.io.IOException: CreateProcess error=267, The directory name is invalid > > Any thoughts on this? > > Jonathan > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From hategan at mcs.anl.gov Thu Jul 3 21:13:35 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 3 Jul 2014 19:13:35 -0700 Subject: [Swift-user] Swift 0.95 RC6 on Windows In-Reply-To: <1404428998.22636.2.camel@echo> References: <98581826-0010-422C-B4F5-9CAB8A53B1B3@uchicago.edu> <1404428998.22636.2.camel@echo> Message-ID: <1404440015.24691.4.camel@echo> Hi again, There were two problems: One was compiling swift on Windows (i.e. running "ant dist" in the swift directory on a windows machine). This is now fixed, but should not affect you if you are using a binary distribution. The second was an out-of-date _swiftwrap.vbs. I fixed this also and committed to SVN. You can either compile 0.95 from sources, or, alternatively, remove the following lines from libexec/_swiftwrap.vbs: expectArg("k") KICKSTART=getOptArg() Things should work after that. Regarding the new config mechanism and sysinfo="INTEL32::WINDOWS", on a cursory look at the code, it does not seem to be supported. You will have to use the old mechanism (i.e. sites.xml) for now. Mihael On Thu, 2014-07-03 at 16:09 -0700, Mihael Hategan wrote: > Hi Jonathan, > > We need to update the quick start guide. There is a missing bit if you > are running on Windows. The details are here: > http://swift-lang.org/guides/release-0.94/userguide/userguide.html#_running_on_windows > > That said, I don't think we routinely test swift on Windows, and I > haven't maintained it, but I will give it a shot and see if it needs any > fixes. > > Mihael > > On Thu, 2014-07-03 at 17:18 -0500, Jonathan Ozik wrote: > > Hi all, > > > > I have two colleagues that are trying to run Swift 0.95 RC6 on Windows using the included swift.bat Windows script. They haven?t been successful yet. Trying the quickstart guide http://swift-lang.org/guides/quickstart.html and the hello.swift script under examples/swift/misc/ they see the following error: > > > > Swift 0.95 RC6 swift-r7900 cog-r3908 > > RunID: 20140703-1433-7iq5x697 > > Progress: Thu, 03 Jul 2014 14:33:49-0500 > > > > Execution failed: > > Exception in echo: > > Arguments: [hi] > > Host: localhost > > Directory: hello-20140703-1433-7iq5x697/jobs/6/echo-6nydmzsl > > exception @ swift-int.k, line: 530 > > Caused by: null > > Caused by: java.io.IOException: Cannot run program "/bin/bash" (in directory "\var\tmp\hello-20140703-1433-7iq5x697"): CreateProcess error=267, The directory name is invalid > > Caused by: java.io.IOException: CreateProcess error=267, The directory name is invalid > > > > Any thoughts on this? > > > > Jonathan > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > From jozik at uchicago.edu Thu Jul 3 21:35:10 2014 From: jozik at uchicago.edu (Jonathan Ozik) Date: Thu, 3 Jul 2014 21:35:10 -0500 Subject: [Swift-user] Swift 0.95 RC6 on Windows In-Reply-To: <1404440015.24691.4.camel@echo> References: <98581826-0010-422C-B4F5-9CAB8A53B1B3@uchicago.edu> <1404428998.22636.2.camel@echo> <1404440015.24691.4.camel@echo> Message-ID: Thanks Mihael! I'll pass this on. Do you happen to remember if both the new and old configuration files can be used together? I mean including both -properties and -sites command line arguments. Jonathan > On Jul 3, 2014, at 9:13 PM, Mihael Hategan wrote: > > Hi again, > > There were two problems: > > One was compiling swift on Windows (i.e. running "ant dist" in the swift > directory on a windows machine). This is now fixed, but should not > affect you if you are using a binary distribution. > > The second was an out-of-date _swiftwrap.vbs. I fixed this also and > committed to SVN. You can either compile 0.95 from sources, or, > alternatively, remove the following lines from libexec/_swiftwrap.vbs: > > expectArg("k") > KICKSTART=getOptArg() > > Things should work after that. > > Regarding the new config mechanism and sysinfo="INTEL32::WINDOWS", on a > cursory look at the code, it does not seem to be supported. You will > have to use the old mechanism (i.e. sites.xml) for now. > > Mihael > > >> On Thu, 2014-07-03 at 16:09 -0700, Mihael Hategan wrote: >> Hi Jonathan, >> >> We need to update the quick start guide. There is a missing bit if you >> are running on Windows. The details are here: >> http://swift-lang.org/guides/release-0.94/userguide/userguide.html#_running_on_windows >> >> That said, I don't think we routinely test swift on Windows, and I >> haven't maintained it, but I will give it a shot and see if it needs any >> fixes. >> >> Mihael >> >>> On Thu, 2014-07-03 at 17:18 -0500, Jonathan Ozik wrote: >>> Hi all, >>> >>> I have two colleagues that are trying to run Swift 0.95 RC6 on Windows using the included swift.bat Windows script. They haven?t been successful yet. Trying the quickstart guide http://swift-lang.org/guides/quickstart.html and the hello.swift script under examples/swift/misc/ they see the following error: >>> >>> Swift 0.95 RC6 swift-r7900 cog-r3908 >>> RunID: 20140703-1433-7iq5x697 >>> Progress: Thu, 03 Jul 2014 14:33:49-0500 >>> >>> Execution failed: >>> Exception in echo: >>> Arguments: [hi] >>> Host: localhost >>> Directory: hello-20140703-1433-7iq5x697/jobs/6/echo-6nydmzsl >>> exception @ swift-int.k, line: 530 >>> Caused by: null >>> Caused by: java.io.IOException: Cannot run program "/bin/bash" (in directory "\var\tmp\hello-20140703-1433-7iq5x697"): CreateProcess error=267, The directory name is invalid >>> Caused by: java.io.IOException: CreateProcess error=267, The directory name is invalid >>> >>> Any thoughts on this? >>> >>> Jonathan >>> _______________________________________________ >>> Swift-user mailing list >>> Swift-user at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > From hategan at mcs.anl.gov Thu Jul 3 22:46:53 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 3 Jul 2014 20:46:53 -0700 Subject: [Swift-user] Swift 0.95 RC6 on Windows In-Reply-To: References: <98581826-0010-422C-B4F5-9CAB8A53B1B3@uchicago.edu> <1404428998.22636.2.camel@echo> <1404440015.24691.4.camel@echo> Message-ID: <1404445613.25475.3.camel@echo> On Thu, 2014-07-03 at 21:35 -0500, Jonathan Ozik wrote: > Thanks Mihael! > I'll pass this on. Do you happen to remember if both the new and old configuration files can be used together? I mean including both -properties and -sites command line arguments. Probably not. The new mechanism generates a sites file, and you can only feed swift one sites file. In a pinch, you could hack bin/swiftrun which generates the sites file from the properties file. Line 75 might be of particular interest. Although we will most likely have a fix for this issue soon. Mihael From iraicu at cs.iit.edu Thu Jul 10 21:10:27 2014 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Thu, 10 Jul 2014 21:10:27 -0500 Subject: [Swift-user] CFP: 7th IEEE Workshop on Many-Task Computing on Clouds, Grids, and Supercomputers (MTAGS) 2014 @ IEEE/ACM SC14 Message-ID: <53BF4793.7030304@cs.iit.edu> Call for Papers --------------------------------------------------------------------------------------- The 7th Workshop on Many-Task Computing on Clouds, Grids, and Supercomputers (MTAGS) 2014 http://datasys.cs.iit.edu/events/MTAGS14/ --------------------------------------------------------------------------------------- November 16th, 2014 New Orleans, Louisiana, USA Co-located with with IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC14) In cooperation with ACM SIGHPC ======================================================================================= The 7th workshop on Many-Task Computing on Clouds, Grids, and Supercomputers (MTAGS) will provide the scientific community a dedicated forum for presenting new research, development, and deployment efforts of large-scale many-task computing (MTC) applications on large scale clusters, Grids, Supercomputers, and Cloud Computing infrastructure. MTC, the theme of the workshop encompasses loosely coupled applications, which are generally composed of many tasks (both independent and dependent tasks) to achieve some larger application goal. This workshop will cover challenges that can hamper efficiency and utilization in running applications on large-scale systems, such as local resource manager scalability and granularity, efficient utilization of raw hardware, parallel file system contention and scalability, data management, I/O management, reliability at scale, and application scalability. We welcome paper submissions on all theoretical, simulations, and systems topics related to MTC, but we give special consideration to papers addressing petascale to exascale challenges. Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the ACM digital library (pending approval). The workshop will be co-located with the IEEE/ACM Supercomputing 2014 Conference in New Orleans on November 17th, 2014. For more information, please see http://datasys.cs.iit.edu/events/MTAGS14/. For more information on past workshops, please see MTAGS13, MTAGS12, MTAGS11, MTAGS10, MTAGS09, and MTAGS08. We also ran a Special Issue on Many-Task Computing in the IEEE Transactions on Parallel and Distributed Systems (TPDS) which appeared in June 2011; the proceedings can be found online at http://www.computer.org/portal/web/csdl/abs/trans/td/2011/06/ttd201106toc.htm. We, the workshop organizers, also published a highly relevant paper that defines Many-Task Computing which was published in MTAGS08, titled ?Many-Task Computing for Grids and Supercomputers?; we encourage potential authors to read this paper, and to clearly articulate in your paper submissions how your papers are related to Many-Task Computing. Topics --------------------------------------------------------------------------------------- We invite the submission of original work that is related to the topics below. The papers should be 6 pages, including all figures and references. We aim to cover topics related to Many-Task Computing on each of the three major distributed systems paradigms, Cloud Computing, Grid Computing and Supercomputing. Topics of interest include: * Compute Resource Management * Scheduling * Job execution frameworks * Local resource manager extensions * Performance evaluation of resource managers in use on large scale systems * Dynamic resource provisioning * Techniques to manage many-core resources and/or GPUs * Challenges and opportunities in running many-task workloads on HPC systems * Challenges and opportunities in running many-task workloads on Cloud Computing infrastructure * Storage architectures and implementations * Distributed file systems * Parallel file systems * Distributed meta-data management * Content distribution systems for large data * Data caching frameworks and techniques * Data management within and across data centers * Data-aware scheduling * Data-intensive computing applications * Eventual-consistency storage usage and management * Programming models and tools * Map-reduce and its generalizations * Many-task computing middleware and applications * Parallel programming frameworks * Ensemble MPI techniques and frameworks * Service-oriented science applications * Large-Scale Workflow Systems * Workflow system performance and scalability analysis * Scalability of workflow systems * Workflow infrastructure and e-Science middleware * Programming Paradigms and Models * Large-Scale Many-Task Applications * High-throughput computing (HTC) applications * Data-intensive applications * Quasi-supercomputing applications, deployments, and experiences * Performance Evaluation * Performance evaluation * Real systems * Simulations * Reliability of large systems Paper Submission and Publication --------------------------------------------------------------------------------------- Authors are invited to submit papers with unpublished, original work of not more than 6 pages of double column text using single spaced 10 point size on 8.5 x 11 inch pages, as per ACM 8.5 x 11 manuscript guidelines; document templates can be found at http://www.acm.org/sigs/publications/proceedings-templates. The final 6 page papers in PDF format must be submitted online at https://cmt.research.microsoft.com/MTAGS2014/ before the deadline of August 25th, 2014 at 11:59PM PST (note that an abstract must be submitted 1 week prior to the deadline, on August 18th, 2014). Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the ACM digital library (in cooperation with SIGHPC). Notifications of the paper decisions will be sent out by September 22nd, 2014. Accepted workshop papers will be eligible for additional post-conference publication as journal articles in the IEEE Transaction on Cloud Computing, Special Issue on Many-Task Computing in the Cloud (papers will be due in February 2015, file:///C:/Users/iraicu/Documents/Webs/DataSys/events/TCC-MTC15/index.html). Submission implies the willingness of at least one of the authors to register and present the paper. For more information, please see http://datasys.cs.iit.edu/events/MTAGS14/. Important Dates --------------------------------------------------------------------------------------- * Abstract Due: August 18th, 2014 * Papers Due: August 25th, 2014 * Notification of Acceptance: September 22nd, 2014 * Camera Ready Papers Due: October 6th, 2014 * Workshop Date: November 16th, 2014 Committee Members --------------------------------------------------------------------------------------- Workshop Chairs * Ioan Raicu, Illinois Institute of Technology & Argonne National Laboratory * Justin Wozniak, University of Chicago & Argonne National Laboratory * Ian Foster, University of Chicago & Argonne National Laboratory * Yong Zhao, University of Electronic Science and Technology of China Steering Committee * David Abramson, Monash University, Australia * Jack Dongarra, University of Tennessee, USA * Geoffrey Fox, Indiana University, USA * Manish Parashar, Rutgers University, USA * Marc Snir, University of Illinois at Urbana Champaign, USA * Xian-He Sun, Illinois Institute of Technology, USA * Weimin Zheng, Tsinghua University, China Technical Committee * Hasan Abbasi, Oak Ridge National Labs, USA * Tim Armstrong, University of Chicago, USA * Roger Barga, Microsoft, USA * Rajkumar Buyya University of Melbourne, Australia * Kyle Chard, University of Chicago, USA * Evangelinos Constantinos, Massachusetts Institute of Technology, USA * Catalin Dumitrescu, Fermi National Labs, USA * Haryadi Gunawi, University of Chicago, USA * Indranil Gupta, University of Illinois at Urbana Champaign, USA * Alexandru Iosup, Delft University of Technology, Netherlands * Florin Isaila, Universidad Carlos III de Madrid, Spain & Argonne National Laboratory, USA * Kamil Iskra, Argonne National Laboratory, USA * Daniel S. Katz, University of Chicago, USA * Jik-Soo Kim, Kristi, Korea * Scott A. Klasky, Oak Ridge National Labs, USA * Mike Lang, Los Alamos National Laboratory, USA * Tonglin Li, Illinois Institute of Technology, USA * Chris Moretti, Princeton University, USA * David O'Hallaron, Carnegie Mellon University, USA * Marlon Pierce, Indiana University, USA * Judy Qiu, Indiana University, USA * Lavanya Ramakrishnan, Lawrence Berkeley National Laboratory, USA * Matei Ripeanu, University of British Columbia, Canada * Wei Tang, Argonne National Laboratory, USA * Edward Walker, Whitworth University, USA * Ke Wang, Illinois Institute of Technology, USA * Matthew Woitaszek, Walmart Labs, USA * Rich Wolski, University of California, Santa Barbara, USA * Zhifeng Yun, University of Houston, USA * Ziming Zheng, University of Chicago, USA -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Editor: IEEE TCC, Springer Cluster, Springer JoCCASA Chair: IEEE/ACM MTAGS, ACM ScienceCloud ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ LinkedIn: http://www.linkedin.com/in/ioanraicu Google: http://scholar.google.com/citations?user=jE73HYAAAAAJ ================================================================= ================================================================= From iraicu at cs.iit.edu Sat Jul 12 09:20:36 2014 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Sat, 12 Jul 2014 09:20:36 -0500 Subject: [Swift-user] Call for Papers: IEEE Transactions on Cloud Computing - Special Issue on Many-Task Computing in the Cloud Message-ID: <53C14434.3050806@cs.iit.edu> Call for Papers --------------------------------------------------------------------------------------- IEEE Transaction on Cloud Computing Special Issue on Many-Task Computing in the Cloud http://datasys.cs.iit.edu/events/TCC-MTC15/ ======================================================================================= The Special Issue on Many-Task Computing (MTC) in the Cloud will provide the scientific community a dedicated forum, within the prestigious IEEE Transactions on Cloud Computing journal, for presenting new research, development, and deployment efforts of loosely coupled large scale applications on Cloud Computing infrastructure. MTC, the theme of this special issue, encompasses loosely coupled applications, which are generally composed of many tasks to achieve some larger application goal. This special issue will cover challenges that can hamper efficiency and utilization in running applications on large-scale systems, such as local resource manager scalability and granularity, efficient utilization of raw hardware, parallel file-system contention and scalability, data management, I/O management, reliability at scale, and application scalability. We welcome paper submissions in theoretical, simulations, and systems topics with special consideration to papers addressing the intersection of petascale/exascale challenges with large-scale cloud computing. We seek submission of papers that present new, original and innovative ideas for the "first" time in TCC (Transactions on Cloud Computing). That means, submission of "extended versions" of already published works (e.g., conference/workshop papers) is not encouraged unless they contain significant number of "new and original" ideas/contributions along with more than 49% brand "new" material. For more information on past workshops and special issues on Many-Task Computing, see http://datasys.cs.iit.edu/events/MTAGS/index.html. We ran a Special Issue on Many-Task Computing in the IEEE Transactions on Parallel and Distributed Systems (TPDS) which appeared in June 2011; the proceedings can be found online at http://www.computer.org/portal/web/csdl/abs/trans/td/2011/06/ttd201106toc.htm. The special issue editors also published a highly relevant paper that defines Many-Task Computing which was published in the inaugural MTAGS08 workshop, titled "Many-Task Computing for Grids and Supercomputers"; we encourage potential authors to read this paper, and to clearly articulate in your paper submissions how your papers are related to Many-Task Computing. For more information on this special issue, please see http://datasys.cs.iit.edu/events/TCC-MTC15/. Topics --------------------------------------------------------------------------------------- We seek submission of papers that present new, original and innovative ideas for the "first" time in TCC (Transactions on Cloud Computing). That means, submission of "extended versions" of already published works (e.g., conference/workshop papers) will only be encouraged if they contain significant number of "new and original" ideas/contributions along with more than 49% brand "new" material. TCC expects submissions to be complete in all respects including author names, affiliation, bios etc. Manuscript should be 14 double column pages (all regular paper page limits include references and author biographies). We aim to cover topics related to Many-Task Computing and Cloud Computing. Topics of interest include: * Compute Resource Management * Scheduling * Job execution frameworks * Local resource manager extensions * Performance evaluation of resource managers in use on large scale systems * Dynamic resource provisioning * Techniques to manage many-core resources and/or GPUs * Challenges and opportunities in running many-task workloads on Cloud Computing infrastructure * Storage architectures and implementations * Distributed file systems * Parallel file systems * Distributed meta-data management * Content distribution systems for large data * Data caching frameworks and techniques * Data management within and across data centers * Data-aware scheduling * Data-intensive computing applications * Eventual-consistency storage usage and management * Programming models and tools * Map-reduce and its generalizations * Many-task computing middleware and applications * Parallel programming frameworks * Ensemble MPI techniques and frameworks * Service-oriented science applications * Large-Scale Workflow Systems * Workflow system performance and scalability analysis * Scalability of workflow systems * Workflow infrastructure and e-Science middleware * Programming Paradigms and Models * Large-Scale Many-Task Applications * High-throughput computing (HTC) applications * Data-intensive applications * Quasi-supercomputing applications, deployments, and experiences * Performance Evaluation * Performance evaluation * Real systems * Simulations * Reliability of large systems Paper Submission and Publication --------------------------------------------------------------------------------------- Authors are invited to submit unpublished and original work to the IEEE Transactions on Cloud Computing (TCC), Special Issue on Many-Task Computing in the Cloud. If the paper is extended from an initial work, the submission must contain at least 50% new material that can be qualified as ?brand? new ideas and results. The paper must be in the IEEE TCC format, namely 14 double-column pages or 30 single-column pages (Note: All regular paper page limits include references and author biographies). Please note that the double-column format will translate more readily into the final publication format. A double-column page is defined as a 7.875??10.75? page with 10-point type, 12-point vertical spacing, and 0.5 inch margins. A single-column page is defined as an 8.5??11? page with 12-point type and 24-point vertical spacing, containing approximately 250 words. All of the margins should be one inch (top, bottom, right and left). These length limits are taking into account reasonably-sized figures and references. Papers must be submitted using the submission system: https://mc.manuscriptcentral.com/tcc-cs, by selecting the special issue option ?SI-MTC?. For more information, please see http://datasys.cs.iit.edu/events/TCC-MTC15/. Important Dates --------------------------------------------------------------------------------------- * Abstract Due: February 2nd, 2015 * Papers Due: February 9th, 2015 * First round decisions: May 18th, 2015 * Major Revisions if needed: July 18th, 2015 * Final decisions: August 18th, 2015 * Publication Date: Fall 2015 (may vary depending on production queue) Guest Editors --------------------------------------------------------------------------------------- * Ioan Raicu, Illinois Institute of Technology & Argonne National Laboratory * Justin Wozniak, University of Chicago & Argonne National Laboratory * Ian Foster, University of Chicago & Argonne National Laboratory * Yong Zhao, University of Electronic Science and Technology of China -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Editor: IEEE TCC, Springer Cluster, Springer JoCCASA Chair: IEEE/ACM MTAGS, ACM ScienceCloud ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ LinkedIn: http://www.linkedin.com/in/ioanraicu Google: http://scholar.google.com/citations?user=jE73HYAAAAAJ ================================================================= ================================================================= From iraicu at cs.iit.edu Tue Jul 15 09:19:04 2014 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Tue, 15 Jul 2014 09:19:04 -0500 Subject: [Swift-user] Call for Papers: IEEE Transactions on Cloud Computing - Special Issue on Scientific Cloud Computing (deadline Jul 31, 2014) Message-ID: <53C53858.9000904@cs.iit.edu> Dear colleagues, This is just a friendly reminder about the upcoming deadline (July 31st, 2014) for the special issue on Scientific Cloud Computing. ------------------------------------------------------------------------------- Call for Papers IEEE Transactions on Cloud Computing Special Issue on Scientific Cloud Computing http://datasys.cs.iit.edu/events/ScienceCloud2014-TCC/ ------------------------------------------------------------------------------- IMPORTANT DATES Paper Submissions Due: July 31, 2014 First Round Decision: September 30,2014 Major Revisions Due (if necessary): October 31, 2014 Final Decision: December 1, 2014 Journal Publication: TBD ------------------------------------------------------------------------------- OVERVIEW Computational and Data-Driven Sciences have become the third and fourth pillar of scientific discovery in addition to experimental and theoretical sciences. Scientific Computing has already begun to change how science is done, enabling scientific breakthroughs through new kinds of experiments that would have been impossible only a decade ago. It is the key to solving "grand challenges" in many domains and providing breakthroughs in new knowledge, and it comes in many shapes and forms: high-performance computing (HPC) which is heavily focused on compute-intensive applications; high-throughput computing (HTC) which focuses on using many computing resources over long periods of time to accomplish its computational tasks; many-task computing (MTC) which aims to bridge the gap between HPC and HTC by focusing on using many resources over short periods of time; and data-intensive computing which is heavily focused on data distribution, data-parallel execution, and harnessing data locality by scheduling of computations close to the data. Today's "Big Data" trend is generating datasets that are increasing exponentially in both complexity and volume, making their analysis, archival, and sharing one of the grand challenges of the 21st century. Not surprisingly, it becomes increasingly difficult to design and operate large scale systems capable of addressing these grand challenges. This journal Special Issue on Scientific Cloud Computing in the IEEE Transaction on Cloud Computing will provide the scientific community a dedicated forum for discussing new research, development, and deployment efforts in running these kinds of scientific computing workloads on Cloud Computing infrastructures. This special issue will focus on the use of cloud-based technologies to meet new compute-intensive and data-intensive scientific challenges that are not well served by the current supercomputers, grids and HPC clusters. The special issue will aim to address questions such as: What architectural changes to the current cloud frameworks (hardware, operating systems, networking and/or programming models) are needed to support science? Dynamic information derived from remote instruments and coupled simulation, and sensor ensembles that stream data for real-time analysis are important emerging techniques in scientific and cyber-physical engineering systems. How can cloud technologies enable and adapt to these new scientific approaches dealing with dynamism? How are scientists using clouds? Are there scientific HPC/HTC/MTC workloads that are suitable candidates to take advantage of emerging cloud computing resources with high efficiency? Commercial public clouds provide easy access to cloud infrastructure for scientists. What are the gaps in commercial cloud offerings and how can they be adapted for running existing and novel eScience applications? What benefits exist by adopting the cloud model, over clusters, grids, or supercomputers? What factors are limiting clouds use or would make them more usable/efficient? ------------------------------------------------------------------------------- TOPICS The topics of interest are, but not limited to, the application of Cloud in scientific applications: ? Scientific application cases studies on Clouds ? Performance evaluation of Cloud technologies ? Fault tolerance and reliability in cloud systems ? Data-intensive workloads and tools on Clouds ? Programming models such as Map-Reduce ? Storage cloud architectures ? I/O and Data management in the Cloud ? Workflow and resource management in the Cloud ? NoSQL databases for scientific applications ? Data streaming and dynamic applications on Clouds ? Dynamic resource provisioning ? Many-Task Computing in the Cloud ? Application of cloud concepts in HPC environments ? Virtualized High performance parallel file systems ? Virtualized high performance I/O networks ? Virtualization and its Impact on Applications ? Distributed Operating Systems ? Many-core computing and accelerators in the Cloud ? Cloud security ------------------------------------------------------------------------------- SUBMISSION INSTRUCTIONS Authors are invited to submit papers with unpublished, original work to the IEEE Transactions on Cloud Computing, Special Issue on Scientific Cloud Computing. If the paper is extended from a workshop or conference paper, it must contain at least 50% new material with "brand" new ideas and results. The papers should not be longer than 14 double column pages in the IEEE TCC format. Papers should be submitted directly to TCC at https://mc.manuscriptcentral.com/tcc-cs, and "SI-ScienceCloud" should be selected. ------------------------------------------------------------------------------- ORGANIZERS ? Kate Keahey, University of Chicago & Argonne National Laboratory, USA ? Ioan Raicu, Illinois Institute of Technology & Argonne National Lab., USA ? Kyle Chard, University of Chicago & Argonne National Laboratory, USA ? Bogdan Nicolae, IBM Research, Ireland ------------------------------------------------------------------------------- CONTACT Email:sciencecloud2014-tcc-editors at datasys.cs.iit.edu Website:http://datasys.cs.iit.edu/events/ScienceCloud2014-TCC/ ---------------------- -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Editor: IEEE TCC, Springer Cluster, Springer JoCCASA Chair: IEEE/ACM MTAGS, ACM ScienceCloud ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ LinkedIn: http://www.linkedin.com/in/ioanraicu Google: http://scholar.google.com/citations?user=jE73HYAAAAAJ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.iit.edu Wed Jul 23 16:23:56 2014 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Wed, 23 Jul 2014 16:23:56 -0500 Subject: [Swift-user] CFP: 1st International Workshop on Collaborative methodologies to Accelerate Scientific Knowledge discovery in big data (CASK) 2014 @ IEEE BigData 2014 Message-ID: <53D027EC.70108@cs.iit.edu> Call for Papers ------------------------------------------------------------------------------- 1st International Workshop on Collaborative methodologies to Accelerate Scientific Knowledge discovery in big data (CASK) 2014 Oct 27-30, 2014, Washington DC, US In conjunction with: 2014 IEEE International Conference on Big Data (IEEE BigData 2014) http://bigscientificdata.org/cask14/ =============================================================================== Introduction ------------------------------------------------------------------------------- Big Data has become an increasingly important part of life, and has become a common buzzword used to describe many aspects of data intensive computing. One of the unique aspects that we see to this new age of science is in terms of new methods to collaborate and share ideas, data, and services. Service Oriented Architectures have become commonplace in Enterprise computing, but its role in scientific data has been often underplayed. Commonly scientists will write software for themselves, and occasionally share their programs with their colleagues. As computing moves to new levels of performance, by using accelerators and many cores, one must rethink how scientific codes are produced and move to new frameworks which promote collaboration. In this workshop we want researchers to discuss techniques, infrastructure, science drivers, and new ways to promote this new way of computing for scientific applications. We will want to address fundamental issues in workflows on and off large-scale high performance systems, clouds, IPads, and mobile devices. Our overall view is that we can accelerate the scientific knowledge discovery process by embracing new technologies where researchers can share codes, workflows, data, and ultimately knowledge. Topics of interest in this workshop will span from methods to share all aspects of code, data, and workflows. We will investigate the topic of how do you share Big Data, when data gets to extreme sizes. What new services need to be developed in order to promote Big Data for science and engineering aspects? How can we get researchers across the globe to develop shareable code and participate in the greater science community? Just like math was a common language that researchers could share and everyone understand, what are the new pieces of software which must be developed in order to ensure collaboration across the end to end lifecycle of scientific data? The workshop will provide a venue to show what has worked across different communities, and how to bring collaboration to new scientific communities. The workshop will bring together DOE and NSF researchers along with researchers in the enterprise to present papers, along with give invited talks. We will also conclude with a panel with many leading experts in the field. We will also feature one panel which will include experts in scientific and enterprise Big Data. Topics Topics of interest include, but are not limited to: ------------------------------------------------------------------------------- * Data at Rest (Storage) * Data in Motion (Data Streaming) * Storage Systems (Database, file systems) * Resource Management * Query and Search * Acquisition, Integrating, Cleaning and Best Practice * Privacy * Provenance * Algorithms * Analytics * Visualization * Near Real-time Decision Making * Data Fusion * Workflows * Programming Models (e.g. MapReduce, MPI, etc.) Important Dates ------------------------------------------------------------------------------- * Papers Due: September 1st, 2014 * Notification of Acceptance: September 20th, 2014 * Camera Ready Papers Due: October 5th, 2014 Paper Submission ------------------------------------------------------------------------------- Authors are invited to submit papers electronically. Submitted manuscripts should be structured as technical papers and may not exceed 6 letter size (8.5 x 11) pages including figures, tables and references using the IEEE format for conference proceedings (print area of 6-1/2 inches (16.51 cm) wide by 8-7/8 inches (22.51 cm) high, two-column format with columns 3-1/16 inches (7.85 cm) wide with a 3/8 inch (0.81 cm) space between them, single-spaced 10-point Times fully justified text). Submissions not conforming to these guidelines may be returned without review. Authors should submit the manuscript in PDF format and make sure that the file will print on a printer that uses letter size (8.5 x 11) paper. The official language of the meeting is English. All manuscripts will be reviewed and will be judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees. Papers conforming to the above guidelines can be submitted through the CASK 2014 paper submission system (https://cmt.research.microsoft.com/CASK2014). Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date, exceeding length limit, or not appropriately structured may also not be considered. Authors may contact the conference PC Chair for more information. The proceedings will be published through the IEEE Computer Society Press, USA and will be made online through the IEEE Digital Library. Selected papers from CASK 2014 will be invited to extend and submit to the Special Issue on Many-Task Computing in the Cloud in the IEEE Transaction on Cloud Computing (http://datasys.cs.iit.edu/events/TCC-MTC15/CFP_TCC-MTC15.pdf). Chairs and Committees ------------------------------------------------------------------------------- Workshop Co-Chairs: * Chen Jin (Palantir Technology) * Ioan Raicu (Illinois Institute of Technology) * Scott Klasky (Oak Ridge National Laboratory) Program Co-Chairs: * Raju Vatsavai (North Carolina State Univ. & Oak Ridge National Lab.) * Judy Qiu (Indiana University) * George Ostrouchov (Oak Ridge National Laboratory) * Tahsin Kurc (Stony Brook University) * Daniel S. Katz (University of Chicago) * Bogdan Nicolae (IBM Research) * Doug Thain (University of Notre Dame) * Josh Wills (Cloudera) * Zhengzhang Chen (NEC Labs) * Kunpeng Zhang (University of Illinois at Chicago) -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Editor: IEEE TCC, Springer Cluster, Springer JoCCASA Chair: IEEE/ACM MTAGS, ACM ScienceCloud ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ LinkedIn: http://www.linkedin.com/in/ioanraicu Google: http://scholar.google.com/citations?user=jE73HYAAAAAJ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.iit.edu Fri Jul 25 11:38:58 2014 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Fri, 25 Jul 2014 11:38:58 -0500 Subject: [Swift-user] CFP: International Symposium on Big Data Computing (BDC) 2014 Message-ID: <53D28822.1020605@cs.iit.edu> Call for Papers IEEE/ACM International Symposium on Big Data Computing (BDC) 2014 December 8-11, 2014, London, UK http://www.cloudbus.org/bdc2014 In conjunction with: 7th IEEE/ACM International Conference on Utility and Cloud Computing (UCC 2014) Sponsored by: IEEE Computer Society and ACM (Association for Computing Machinery) Introduction =============================================================================== Rapid advances in digital sensors, networks, storage, and computation along with their availability at low cost is leading to the creation of huge collections of data -- dubbed as Big Data. This data has the potential for enabling new insights that can change the way business, science, and governments deliver services to their consumers and can impact society as a whole. This has led to the emergence of the Big Data Computing paradigm focusing on sensing, collection, storage, management and analysis of data from variety of sources to enable new value and insights. To realize the full potential of Big Data Computing, we need to address several challenges and develop suitable conceptual and technological solutions for dealing them. These include life-cycle management of data, large-scale storage, flexible processing infrastructure, data modelling, scalable machine learning and data analysis algorithms, techniques for sampling and making trade-off between data processing time and accuracy, and dealing with privacy and ethical issues involved in data sensing, storage, processing, and actions. The IEEE/ACM International Symposium on Big Data Computing (BDC) 2014 -- held in conjunction with 7th IEEE/ACM International Conference on Utility and Cloud Computing (UCC 2014), December 8-11, 2014, London, UK, aims at bringing together international researchers, developers, policy makers, and users and to provide an international forum to present leading research activities, technical solutions, and results on a broad range of topics related to Big Data Computing paradigms, platforms and their applications. The conference features keynotes, technical presentations, posters, workshops, tutorials, as well as competitions featuring live demonstrations. Topics =============================================================================== Topics of interest include, but are not limited to: I. Big Data Science * Analytics * Algorithms for Big Data * Energy-efficient Algorithms * Big Data Search * Big Data Acquisition, Integration, Cleaning, and Best Practices * Visualization of Big Data II. Big Data Infrastructures and Platforms * Programming Systems * Cyber-Infrastructure * Performance evaluation * Fault tolerance and reliability * I/O and Data management * Storage Systems (including file systems, NoSQL, and RDBMS) * Resource management * Many-Task Computing * Many-core computing and accelerators III. Big Data Security and Policy * Management Policies * Data Privacy * Data Security * Big Data Archival and Preservation * Big Data Provenance IV. Big Data Applications * Scientific application cases studies on Cloud infrastructure * Big Data Applications at Scale * Experience Papers with Big Data Application Deployments * Data streaming applications * Big Data in Social Networks * Healthcare Applications * Enterprise Applications IMPORTANT DATES =============================================================================== * Papers Due: September 15th, 2014 * Notification of Acceptance: October 15th, 2014 * Camera Ready Papers Due: October 31st, 2014 PAPER SUBMISSION =============================================================================== Authors are invited to submit papers electronically. Submitted manuscripts should be structured as technical papers and may not exceed 10 letter size (8.5 x 11) pages including figures, tables and references using the IEEE format for conference proceedings (print area of 6-1/2 inches (16.51 cm) wide by 8-7/8 inches (22.51 cm) high, two-column format with columns 3-1/16 inches (7.85 cm) wide with a 3/8 inch (0.81 cm) space between them, single-spaced 10-point Times fully justified text). Submissions not conforming to these guidelines may be returned without review. Authors should submit the manuscript in PDF format and make sure that the file will print on a printer that uses letter size (8.5 x 11) paper. The official language of the meeting is English. All manuscripts will be reviewed and will be judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees. Papers conforming to the above guidelines can be submitted through the BDC 2014 paper submission system (https://www.easychair.org/conferences/?conf=bdc2014). Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date, exceeding length limit, or not appropriately structured may also not be considered. Authors may contact the conference PC Chair for more information. The proceedings will be published through the IEEE Computer Society Press, USA and will be made online through the IEEE Digital Library. Selected papers from BDC 2014 will be invited to extend and submit to the Special Issue on Many-Task Computing in the Cloud in the IEEE Transaction on Cloud Computing (http://datasys.cs.iit.edu/events/TCC-MTC15/CFP_TCC-MTC15.pdf) CHAIRS & COMMITTEES =============================================================================== General Co-Chairs: * Rajkumar Buyya, University of Melbourne, Australia * Divyakant Agrawal, University of California at Santa Barbara, USA Program Co-Chairs: * Ioan Raicu, Illinois Institute of Technology and Argonne National Lab., USA * Manish Parashar, Rutgers, The State University of New Jersey, USA Area Track Co-Chairs: * Big Data Science o TBA * Data Infrastructures and Platforms o Amy Apon, Clemson University, USA o Jiannong Cao, Honk Kong Polytechnic University * Big Data Security and Policy o Bogdan Carbunar, Florida International University * Big Data Applications o Dennis Gannon, Microsoft Research, USA Cyber Chair * Amir Vahid, University of Melbourne, Australia Publicity Chairs * Carlos Westphall, Federal University of Santa Catarina, Brazil * Ching-Hsien Hsu, Chung Hua Univ., Taiwan & Tianjin Univ. of Technology, China * Rong Ge, Marquette University, USA Organizing Chair: * Ashiq Anjum, University of Derby, UK -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Editor: IEEE TCC, Springer Cluster, Springer JoCCASA Chair: IEEE/ACM MTAGS, ACM ScienceCloud ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ LinkedIn: http://www.linkedin.com/in/ioanraicu Google: http://scholar.google.com/citations?user=jE73HYAAAAAJ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From jozik at uchicago.edu Tue Jul 29 20:56:28 2014 From: jozik at uchicago.edu (Jonathan Ozik) Date: Tue, 29 Jul 2014 20:56:28 -0500 Subject: [Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost Message-ID: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu> Hi all, I?m getting spurious errors in the jobs that I?m running on Blues. The stdout includes exceptions like: exception @ swift-int.k, line: 511 Caused by: Block task failed: Connection to worker lost java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487) at org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168) at org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133) These seem to occur at different parts of the submitted jobs. Let me know if there?s a log file that you?d like to look at. In earlier attempts I was getting these warnings followed by broken pipe errors: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000a0000000, 704643072, 2097152, 0) failed; error='Cannot allocate memory' (errno=12); Cannot allocate large pages, falling back to regular pages Apparently that?s a known precursor of crashes on Java 7 as described here (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html): Area: hotspot/gc Synopsis: Crashes due to failure to allocate large pages. On Linux, failures when allocating large pages can lead to crashes. When running JDK 7u51 or later versions, the issue can be recognized in two ways: ? Before the crash happens one or more lines similar to this will have been printed to the log: os::commit_memory(0x00000006b1600000, 352321536, 2097152, 0) failed; error='Cannot allocate memory' (errno=12); Cannot allocate large pages, falling back to regular pages ? If a hs_err file is generated it will contain a line similar to this: Large page allocation failures have occurred 3 times The problem can be avoided by running with large page support turned off, for example by passing the "-XX:-UseLargePages" option to the java binary. See 8007074 (not public). So I added the -XX:-UseLargePages option in the invocations of Java code that I was responsible for. That seemed to get rid of the warning and the crashes for a while, but perhaps that was just a coincidence. Jonathan From wilde at anl.gov Thu Jul 31 09:13:02 2014 From: wilde at anl.gov (Michael Wilde) Date: Thu, 31 Jul 2014 09:13:02 -0500 Subject: [Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost In-Reply-To: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu> References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu> Message-ID: <53DA4EEE.5010800@anl.gov> Some discussion and diagnosis of this incident has taken place off list. In a quick scan of the worker logs, I don't spot an obvious error that would cause workers to exit. Hopefully others on the Swift team can check those as well. Jonathan, do you have stdout/err files from the PBS scheduler on blues, in your runNNN log dirs? If so, can you point us to them? Thanks, - Mike On 7/29/14, 8:56 PM, Jonathan Ozik wrote: > Hi all, > > I?m getting spurious errors in the jobs that I?m running on Blues. The stdout includes exceptions like: > exception @ swift-int.k, line: 511 > Caused by: Block task failed: Connection to worker lost > java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487) > at org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168) > at org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133) > > These seem to occur at different parts of the submitted jobs. Let me know if there?s a log file that you?d like to look at. > > In earlier attempts I was getting these warnings followed by broken pipe errors: > Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000a0000000, 704643072, 2097152, 0) failed; error='Cannot allocate memory' (errno=12); Cannot allocate large pages, falling back to regular pages > > Apparently that?s a known precursor of crashes on Java 7 as described here (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html): > Area: hotspot/gc > Synopsis: Crashes due to failure to allocate large pages. > > On Linux, failures when allocating large pages can lead to crashes. When running JDK 7u51 or later versions, the issue can be recognized in two ways: > > ? Before the crash happens one or more lines similar to this will have been printed to the log: > os::commit_memory(0x00000006b1600000, 352321536, 2097152, 0) failed; > error='Cannot allocate memory' (errno=12); Cannot allocate large pages, falling back to regular pages > ? If a hs_err file is generated it will contain a line similar to this: > Large page allocation failures have occurred 3 times > The problem can be avoided by running with large page support turned off, for example by passing the "-XX:-UseLargePages" option to the java binary. > > See 8007074 (not public). > > So I added the -XX:-UseLargePages option in the invocations of Java code that I was responsible for. That seemed to get rid of the warning and the crashes for a while, but perhaps that was just a coincidence. > > Jonathan > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Mathematics and Computer Science Computation Institute Argonne National Laboratory The University of Chicago From wilde at anl.gov Thu Jul 31 09:18:08 2014 From: wilde at anl.gov (Michael Wilde) Date: Thu, 31 Jul 2014 09:18:08 -0500 Subject: [Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost In-Reply-To: <53DA4EEE.5010800@anl.gov> References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu> <53DA4EEE.5010800@anl.gov> Message-ID: <53DA5020.7050402@anl.gov> I see this from PBS in your home dir: blues$ cat 583937.bmgt1.lcrc.anl.gov.ER Use of uninitialized value $s in concatenation (.) or string at /home/ozik/.globus/coasters/cscript4312030037430783094.pl line 2220. Use of uninitialized value $s in concatenation (.) or string at /home/ozik/.globus/coasters/cscript4312030037430783094.pl line 2220. blues$ That looks to me like a Swift bug in worker.pl We'll look into this angle. Also I'm curious why these files are not going into your run dir (but perhaps thats because youre running an older trunk release, not 0.95? Or, thats a separate 0.95 bug). - Mike On 7/31/14, 9:13 AM, Michael Wilde wrote: > Some discussion and diagnosis of this incident has taken place off list. > > In a quick scan of the worker logs, I don't spot an obvious error that > would cause workers to exit. > Hopefully others on the Swift team can check those as well. > > Jonathan, do you have stdout/err files from the PBS scheduler on blues, > in your runNNN log dirs? > > If so, can you point us to them? > > Thanks, > > - Mike > > On 7/29/14, 8:56 PM, Jonathan Ozik wrote: >> Hi all, >> >> I?m getting spurious errors in the jobs that I?m running on Blues. The stdout includes exceptions like: >> exception @ swift-int.k, line: 511 >> Caused by: Block task failed: Connection to worker lost >> java.io.IOException: Broken pipe >> at sun.nio.ch.FileDispatcherImpl.write0(Native Method) >> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) >> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) >> at sun.nio.ch.IOUtil.write(IOUtil.java:65) >> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487) >> at org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168) >> at org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133) >> >> These seem to occur at different parts of the submitted jobs. Let me know if there?s a log file that you?d like to look at. >> >> In earlier attempts I was getting these warnings followed by broken pipe errors: >> Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000a0000000, 704643072, 2097152, 0) failed; error='Cannot allocate memory' (errno=12); Cannot allocate large pages, falling back to regular pages >> >> Apparently that?s a known precursor of crashes on Java 7 as described here (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html): >> Area: hotspot/gc >> Synopsis: Crashes due to failure to allocate large pages. >> >> On Linux, failures when allocating large pages can lead to crashes. When running JDK 7u51 or later versions, the issue can be recognized in two ways: >> >> ? Before the crash happens one or more lines similar to this will have been printed to the log: >> os::commit_memory(0x00000006b1600000, 352321536, 2097152, 0) failed; >> error='Cannot allocate memory' (errno=12); Cannot allocate large pages, falling back to regular pages >> ? If a hs_err file is generated it will contain a line similar to this: >> Large page allocation failures have occurred 3 times >> The problem can be avoided by running with large page support turned off, for example by passing the "-XX:-UseLargePages" option to the java binary. >> >> See 8007074 (not public). >> >> So I added the -XX:-UseLargePages option in the invocations of Java code that I was responsible for. That seemed to get rid of the warning and the crashes for a while, but perhaps that was just a coincidence. >> >> Jonathan >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Mathematics and Computer Science Computation Institute Argonne National Laboratory The University of Chicago From hategan at mcs.anl.gov Thu Jul 31 12:09:15 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 31 Jul 2014 10:09:15 -0700 Subject: [Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost In-Reply-To: References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu> <53D96721.6030004@anl.gov> <53D99769.5050107@uchicago.edu> <1406783097.19120.2.camel@echo> <53D9CFCB.8090803@uchicago.edu> <1406783556.19312.1.camel@echo> <28AB72AB-2650-40E2-B084-9D1E20D03AEB@uchicago.edu> Message-ID: <1406826555.22289.4.camel@echo> Hi Jonathan, I can't see anything obvious in the worker logs, but they are pretty large. Can you also post the swift log from this run? It would make it easier to focus on the right time frame. Mihael On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote: > Hi all, > > I?m attaching the stdout and the worker logs below. > > Thanks for looking at these! > > Jonathan > > > On Jul 31, 2014, at 12:37 AM, Jonathan Ozik > wrote: > > > Woops, sorry about that. It?s running now and the logs are being > generated. Once the run is done I?ll send you log files. > > > > Thanks! > > > > Jonathan > > > > On Jul 31, 2014, at 12:12 AM, Mihael Hategan > wrote: > > > >> Right. This isn't your fault. We should, though, probably talk > about > >> addressing the issue. > >> > >> Mihael > >> > >> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote: > >>> Mihael, thanks for spotting that. I added the comments to > highlight the > >>> changes in email. > >>> > >>> -Yadu > >>> > >>> On 07/31/2014 12:04 AM, Mihael Hategan wrote: > >>>> Hi Jonathan, > >>>> > >>>> I suspect that the site config is considering the comment to be > part of > >>>> the value of the workerLogLevel property. We could confirm this > if you > >>>> send us the swift log from this particular run. > >>>> > >>>> To fix it, you could try to remove everything after DEBUG > (including all > >>>> horizontal white space). In other words: > >>>> > >>>> ... > >>>> workerloglevel=DEBUG > >>>> workerlogdirectory=/home/$USER/ > >>>> ... > >>>> > >>>> Mihael > >>>> > >>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote: > >>>>> Hi Yadu, > >>>>> > >>>>> > >>>>> I?m getting errors indicating that DEBUG is an invalid worker > logging > >>>>> level. I?m attaching the stdout below. Let me know if I?m doing > >>>>> something silly. > >>>>> > >>>>> > >>>>> Jonathan > >>>>> > >>>>> > >>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji > > >>>>> wrote: > >>>>> > >>>>>> Hi Jonathan, > >>>>>> > >>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do not > see > >>>>>> anything unusual. > >>>>>> > >>>>>> From your logs, it looks like workers are failing, so getting > worker > >>>>>> logs would help. > >>>>>> Could you try running on Blues with the following > swift.properties > >>>>>> and get us the worker*logs that would show up in the > >>>>>> workerlogdirectory ? > >>>>>> > >>>>>> site=blues > >>>>>> > >>>>>> site.blues { > >>>>>> jobManager=pbs > >>>>>> jobQueue=shared > >>>>>> maxJobs=4 > >>>>>> jobGranularity=1 > >>>>>> maxNodesPerJob=1 > >>>>>> tasksPerWorker=16 > >>>>>> taskThrottle=64 > >>>>>> initialScore=10000 > >>>>>> jobWalltime=00:48:00 > >>>>>> taskWalltime=00:40:00 > >>>>>> workerloglevel=DEBUG # > Adding > >>>>>> debug for workers > >>>>>> workerlogdirectory=/home/$USER/ # Logging > >>>>>> directory on SFS > >>>>>> workdir=$RUNDIRECTORY > >>>>>> filesystem=local > >>>>>> } > >>>>>> > >>>>>> Thanks, > >>>>>> Yadu > >>>>>> > >>>>>> > >>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote: > >>>>>> > >>>>>>> Hi Mike, > >>>>>>> > >>>>>>> > >>>>>>> Sorry, I figured there was some busy-ness involved! > >>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I > didn?t > >>>>>>> get the same issue. That is, the model run completed > successfully. > >>>>>>> For the Blues run, I used a trunk distribution (as of May 29, > >>>>>>> 2014). I?m including one of the log files below. I?m also > >>>>>>> including the swift.properties file that was used for the > blues > >>>>>>> runs below. > >>>>>>> > >>>>>>> > >>>>>>> Thank you! > >>>>>>> > >>>>>>> > >>>>>>> Jonathan > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde > wrote: > >>>>>>> > >>>>>>>> Hi Jonathan, > >>>>>>>> > >>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping! > >>>>>>>> > >>>>>>>> I or one of the team will answer soon, on swift-user. > >>>>>>>> > >>>>>>>> (But the first question is: which Swift release, and can you > >>>>>>>> point us to, or send, the full log file?) > >>>>>>>> > >>>>>>>> Thanks and regards, > >>>>>>>> > >>>>>>>> - Mike > >>>>>>>> > >>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote: > >>>>>>>> > >>>>>>>>> Hi Mike, > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> I didn?t get a response yet so just wanted to make sure that > >>>>>>>>> the message came across. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Jonathan > >>>>>>>>> > >>>>>>>>> Begin forwarded message: > >>>>>>>>> > >>>>>>>>>> From: Jonathan Ozik > >>>>>>>>>> > >>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511, > >>>>>>>>>> Caused by: Block task failed: Connection to worker lost > >>>>>>>>>> > >>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT > >>>>>>>>>> > >>>>>>>>>> To: Mihael Hategan , > >>>>>>>>>> "swift-user at ci.uchicago.edu" > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Hi all, > >>>>>>>>>> > >>>>>>>>>> I?m getting spurious errors in the jobs that I?m running on > >>>>>>>>>> Blues. The stdout includes exceptions like: > >>>>>>>>>> exception @ swift-int.k, line: 511 > >>>>>>>>>> Caused by: Block task failed: Connection to worker lost > >>>>>>>>>> java.io.IOException: Broken pipe > >>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > >>>>>>>>>> at > >>>>>>>>>> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > >>>>>>>>>> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > >>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65) > >>>>>>>>>> at > >>>>>>>>>> > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487) > >>>>>>>>>> at > >>>>>>>>>> > org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168) > >>>>>>>>>> at > >>>>>>>>>> > org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133) > >>>>>>>>>> > >>>>>>>>>> These seem to occur at different parts of the submitted > >>>>>>>>>> jobs. Let me know if there?s a log file that you?d like to > >>>>>>>>>> look at. > >>>>>>>>>> > >>>>>>>>>> In earlier attempts I was getting these warnings followed > by > >>>>>>>>>> broken pipe errors: > >>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO: > >>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152, > 0) > >>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot > >>>>>>>>>> allocate large pages, falling back to regular pages > >>>>>>>>>> > >>>>>>>>>> Apparently that?s a known precursor of crashes on Java 7 as > >>>>>>>>>> described here > >>>>>>>>>> > (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html): > >>>>>>>>>> Area: hotspot/gc > >>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages. > >>>>>>>>>> > >>>>>>>>>> On Linux, failures when allocating large pages can lead to > >>>>>>>>>> crashes. When running JDK 7u51 or later versions, the issue > >>>>>>>>>> can be recognized in two ways: > >>>>>>>>>> > >>>>>>>>>> ? Before the crash happens one or more lines similar to > this > >>>>>>>>>> will have been printed to the log: > >>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152, > 0) > >>>>>>>>>> failed; > >>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot allocate > >>>>>>>>>> large pages, falling back to regular pages > >>>>>>>>>> ? If a hs_err file is generated it will contain a line > >>>>>>>>>> similar to this: > >>>>>>>>>> Large page allocation failures have occurred 3 times > >>>>>>>>>> The problem can be avoided by running with large page > >>>>>>>>>> support turned off, for example by passing the > >>>>>>>>>> "-XX:-UseLargePages" option to the java binary. > >>>>>>>>>> > >>>>>>>>>> See 8007074 (not public). > >>>>>>>>>> > >>>>>>>>>> So I added the -XX:-UseLargePages option in the invocations > >>>>>>>>>> of Java code that I was responsible for. That seemed to get > >>>>>>>>>> rid of the warning and the crashes for a while, but perhaps > >>>>>>>>>> that was just a coincidence. > >>>>>>>>>> > >>>>>>>>>> Jonathan > >>>>>>>>>> > >>>>>>>>>> _______________________________________________ > >>>>>>>>>> Swift-user mailing list > >>>>>>>>>> Swift-user at ci.uchicago.edu > >>>>>>>>>> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> -- > >>>>>>>> Michael Wilde > >>>>>>>> Mathematics and Computer Science Computation > Institute > >>>>>>>> Argonne National Laboratory The University of > Chicago > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > >> > > > > From hategan at mcs.anl.gov Thu Jul 31 12:13:34 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 31 Jul 2014 10:13:34 -0700 Subject: [Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost In-Reply-To: <53DA5020.7050402@anl.gov> References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu> <53DA4EEE.5010800@anl.gov> <53DA5020.7050402@anl.gov> Message-ID: <1406826814.22632.1.camel@echo> On Thu, 2014-07-31 at 09:18 -0500, Michael Wilde wrote: > I see this from PBS in your home dir: > > blues$ cat 583937.bmgt1.lcrc.anl.gov.ER > Use of uninitialized value $s in concatenation (.) or string at > /home/ozik/.globus/coasters/cscript4312030037430783094.pl line 2220. > Use of uninitialized value $s in concatenation (.) or string at > /home/ozik/.globus/coasters/cscript4312030037430783094.pl line 2220. > blues$ > > That looks to me like a Swift bug in worker.pl Line 2220 of worker.pl: wlog DEBUG, "$JOBID Got output from child ($s). Closing pipe.\n"; So I don't think this is the problem (or much of a problem in general), although it could be confusing so it should be fixed. Mihael From jozik at uchicago.edu Thu Jul 31 12:34:57 2014 From: jozik at uchicago.edu (Jonathan Ozik) Date: Thu, 31 Jul 2014 12:34:57 -0500 Subject: [Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost In-Reply-To: <1406826555.22289.4.camel@echo> References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu> <53D96721.6030004@anl.gov> <53D99769.5050107@uchicago.edu> <1406783097.19120.2.camel@echo> <53D9CFCB.8090803@uchicago.edu> <1406783556.19312.1.camel@echo> <28AB72AB-2650-40E2-B084-9D1E20D03AEB@uchicago.edu> <1406826555.22289.4.camel@echo> Message-ID: <22271280-84FF-4C85-8A1E-5002ED165DAE@uchicago.edu> Sure thing, it?s attached below. Jonathan -------------- next part -------------- A non-text attachment was scrubbed... Name: run014.log Type: application/octet-stream Size: 2887616 bytes Desc: not available URL: -------------- next part -------------- On Jul 31, 2014, at 12:09 PM, Mihael Hategan wrote: > Hi Jonathan, > > I can't see anything obvious in the worker logs, but they are pretty > large. Can you also post the swift log from this run? It would make it > easier to focus on the right time frame. > > Mihael > > On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote: >> Hi all, >> >> I?m attaching the stdout and the worker logs below. >> >> Thanks for looking at these! >> >> Jonathan >> >> >> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik >> wrote: >> >>> Woops, sorry about that. It?s running now and the logs are being >> generated. Once the run is done I?ll send you log files. >>> >>> Thanks! >>> >>> Jonathan >>> >>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan >> wrote: >>> >>>> Right. This isn't your fault. We should, though, probably talk >> about >>>> addressing the issue. >>>> >>>> Mihael >>>> >>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote: >>>>> Mihael, thanks for spotting that. I added the comments to >> highlight the >>>>> changes in email. >>>>> >>>>> -Yadu >>>>> >>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote: >>>>>> Hi Jonathan, >>>>>> >>>>>> I suspect that the site config is considering the comment to be >> part of >>>>>> the value of the workerLogLevel property. We could confirm this >> if you >>>>>> send us the swift log from this particular run. >>>>>> >>>>>> To fix it, you could try to remove everything after DEBUG >> (including all >>>>>> horizontal white space). In other words: >>>>>> >>>>>> ... >>>>>> workerloglevel=DEBUG >>>>>> workerlogdirectory=/home/$USER/ >>>>>> ... >>>>>> >>>>>> Mihael >>>>>> >>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote: >>>>>>> Hi Yadu, >>>>>>> >>>>>>> >>>>>>> I?m getting errors indicating that DEBUG is an invalid worker >> logging >>>>>>> level. I?m attaching the stdout below. Let me know if I?m doing >>>>>>> something silly. >>>>>>> >>>>>>> >>>>>>> Jonathan >>>>>>> >>>>>>> >>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji >> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Jonathan, >>>>>>>> >>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do not >> see >>>>>>>> anything unusual. >>>>>>>> >>>>>>>> From your logs, it looks like workers are failing, so getting >> worker >>>>>>>> logs would help. >>>>>>>> Could you try running on Blues with the following >> swift.properties >>>>>>>> and get us the worker*logs that would show up in the >>>>>>>> workerlogdirectory ? >>>>>>>> >>>>>>>> site=blues >>>>>>>> >>>>>>>> site.blues { >>>>>>>> jobManager=pbs >>>>>>>> jobQueue=shared >>>>>>>> maxJobs=4 >>>>>>>> jobGranularity=1 >>>>>>>> maxNodesPerJob=1 >>>>>>>> tasksPerWorker=16 >>>>>>>> taskThrottle=64 >>>>>>>> initialScore=10000 >>>>>>>> jobWalltime=00:48:00 >>>>>>>> taskWalltime=00:40:00 >>>>>>>> workerloglevel=DEBUG # >> Adding >>>>>>>> debug for workers >>>>>>>> workerlogdirectory=/home/$USER/ # Logging >>>>>>>> directory on SFS >>>>>>>> workdir=$RUNDIRECTORY >>>>>>>> filesystem=local >>>>>>>> } >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Yadu >>>>>>>> >>>>>>>> >>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote: >>>>>>>> >>>>>>>>> Hi Mike, >>>>>>>>> >>>>>>>>> >>>>>>>>> Sorry, I figured there was some busy-ness involved! >>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I >> didn?t >>>>>>>>> get the same issue. That is, the model run completed >> successfully. >>>>>>>>> For the Blues run, I used a trunk distribution (as of May 29, >>>>>>>>> 2014). I?m including one of the log files below. I?m also >>>>>>>>> including the swift.properties file that was used for the >> blues >>>>>>>>> runs below. >>>>>>>>> >>>>>>>>> >>>>>>>>> Thank you! >>>>>>>>> >>>>>>>>> >>>>>>>>> Jonathan >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde >> wrote: >>>>>>>>> >>>>>>>>>> Hi Jonathan, >>>>>>>>>> >>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping! >>>>>>>>>> >>>>>>>>>> I or one of the team will answer soon, on swift-user. >>>>>>>>>> >>>>>>>>>> (But the first question is: which Swift release, and can you >>>>>>>>>> point us to, or send, the full log file?) >>>>>>>>>> >>>>>>>>>> Thanks and regards, >>>>>>>>>> >>>>>>>>>> - Mike >>>>>>>>>> >>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote: >>>>>>>>>> >>>>>>>>>>> Hi Mike, >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I didn?t get a response yet so just wanted to make sure that >>>>>>>>>>> the message came across. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Jonathan >>>>>>>>>>> >>>>>>>>>>> Begin forwarded message: >>>>>>>>>>> >>>>>>>>>>>> From: Jonathan Ozik >>>>>>>>>>>> >>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511, >>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost >>>>>>>>>>>> >>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT >>>>>>>>>>>> >>>>>>>>>>>> To: Mihael Hategan , >>>>>>>>>>>> "swift-user at ci.uchicago.edu" >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Hi all, >>>>>>>>>>>> >>>>>>>>>>>> I?m getting spurious errors in the jobs that I?m running on >>>>>>>>>>>> Blues. The stdout includes exceptions like: >>>>>>>>>>>> exception @ swift-int.k, line: 511 >>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost >>>>>>>>>>>> java.io.IOException: Broken pipe >>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method) >>>>>>>>>>>> at >>>>>>>>>>>> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) >>>>>>>>>>>> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) >>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65) >>>>>>>>>>>> at >>>>>>>>>>>> >> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487) >>>>>>>>>>>> at >>>>>>>>>>>> >> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168) >>>>>>>>>>>> at >>>>>>>>>>>> >> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133) >>>>>>>>>>>> >>>>>>>>>>>> These seem to occur at different parts of the submitted >>>>>>>>>>>> jobs. Let me know if there?s a log file that you?d like to >>>>>>>>>>>> look at. >>>>>>>>>>>> >>>>>>>>>>>> In earlier attempts I was getting these warnings followed >> by >>>>>>>>>>>> broken pipe errors: >>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO: >>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152, >> 0) >>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot >>>>>>>>>>>> allocate large pages, falling back to regular pages >>>>>>>>>>>> >>>>>>>>>>>> Apparently that?s a known precursor of crashes on Java 7 as >>>>>>>>>>>> described here >>>>>>>>>>>> >> (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html): >>>>>>>>>>>> Area: hotspot/gc >>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages. >>>>>>>>>>>> >>>>>>>>>>>> On Linux, failures when allocating large pages can lead to >>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the issue >>>>>>>>>>>> can be recognized in two ways: >>>>>>>>>>>> >>>>>>>>>>>> ? Before the crash happens one or more lines similar to >> this >>>>>>>>>>>> will have been printed to the log: >>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152, >> 0) >>>>>>>>>>>> failed; >>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot allocate >>>>>>>>>>>> large pages, falling back to regular pages >>>>>>>>>>>> ? If a hs_err file is generated it will contain a line >>>>>>>>>>>> similar to this: >>>>>>>>>>>> Large page allocation failures have occurred 3 times >>>>>>>>>>>> The problem can be avoided by running with large page >>>>>>>>>>>> support turned off, for example by passing the >>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary. >>>>>>>>>>>> >>>>>>>>>>>> See 8007074 (not public). >>>>>>>>>>>> >>>>>>>>>>>> So I added the -XX:-UseLargePages option in the invocations >>>>>>>>>>>> of Java code that I was responsible for. That seemed to get >>>>>>>>>>>> rid of the warning and the crashes for a while, but perhaps >>>>>>>>>>>> that was just a coincidence. >>>>>>>>>>>> >>>>>>>>>>>> Jonathan >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Swift-user mailing list >>>>>>>>>>>> Swift-user at ci.uchicago.edu >>>>>>>>>>>> >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Michael Wilde >>>>>>>>>> Mathematics and Computer Science Computation >> Institute >>>>>>>>>> Argonne National Laboratory The University of >> Chicago >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>> >> >> > > From yadudoc1729 at gmail.com Thu Jul 31 13:02:18 2014 From: yadudoc1729 at gmail.com (Yadu Nand) Date: Thu, 31 Jul 2014 13:02:18 -0500 Subject: [Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost In-Reply-To: <53DA5020.7050402@anl.gov> References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu> <53DA4EEE.5010800@anl.gov> <53DA5020.7050402@anl.gov> Message-ID: Hi Mike, I checked Jonathan's folders and it looks like the submit scripts and the PBS submit, submit.stdout and submit.stderr files correctly were written under the runNNN/scripts folder. His latest run was using Swift-0.95-RC6 which failed with the logs that you saw. The are also PBS*submit.stderr files which report the same "uninitialized value $s in concatenation" error. -Yadu On Thu, Jul 31, 2014 at 9:18 AM, Michael Wilde wrote: > I see this from PBS in your home dir: > > blues$ cat 583937.bmgt1.lcrc.anl.gov.ER > Use of uninitialized value $s in concatenation (.) or string at > /home/ozik/.globus/coasters/cscript4312030037430783094.pl line 2220. > Use of uninitialized value $s in concatenation (.) or string at > /home/ozik/.globus/coasters/cscript4312030037430783094.pl line 2220. > blues$ > > That looks to me like a Swift bug in worker.pl > > We'll look into this angle. > > Also I'm curious why these files are not going into your run dir (but > perhaps thats because youre running an older trunk release, not 0.95? > Or, thats a separate 0.95 bug). > > - Mike > > On 7/31/14, 9:13 AM, Michael Wilde wrote: > > Some discussion and diagnosis of this incident has taken place off list. > > > > In a quick scan of the worker logs, I don't spot an obvious error that > > would cause workers to exit. > > Hopefully others on the Swift team can check those as well. > > > > Jonathan, do you have stdout/err files from the PBS scheduler on blues, > > in your runNNN log dirs? > > > > If so, can you point us to them? > > > > Thanks, > > > > - Mike > > > > On 7/29/14, 8:56 PM, Jonathan Ozik wrote: > >> Hi all, > >> > >> I?m getting spurious errors in the jobs that I?m running on Blues. The > stdout includes exceptions like: > >> exception @ swift-int.k, line: 511 > >> Caused by: Block task failed: Connection to worker lost > >> java.io.IOException: Broken pipe > >> at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > >> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > >> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > >> at sun.nio.ch.IOUtil.write(IOUtil.java:65) > >> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487) > >> at > org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168) > >> at > org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133) > >> > >> These seem to occur at different parts of the submitted jobs. Let me > know if there?s a log file that you?d like to look at. > >> > >> In earlier attempts I was getting these warnings followed by broken > pipe errors: > >> Java HotSpot(TM) 64-Bit Server VM warning: INFO: > os::commit_memory(0x00000000a0000000, 704643072, 2097152, 0) failed; > error='Cannot allocate memory' (errno=12); Cannot allocate large pages, > falling back to regular pages > >> > >> Apparently that?s a known precursor of crashes on Java 7 as described > here ( > http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html): > >> Area: hotspot/gc > >> Synopsis: Crashes due to failure to allocate large pages. > >> > >> On Linux, failures when allocating large pages can lead to crashes. > When running JDK 7u51 or later versions, the issue can be recognized in two > ways: > >> > >> ? Before the crash happens one or more lines similar to this will > have been printed to the log: > >> os::commit_memory(0x00000006b1600000, 352321536, 2097152, 0) failed; > >> error='Cannot allocate memory' (errno=12); Cannot allocate large pages, > falling back to regular pages > >> ? If a hs_err file is generated it will contain a line similar to > this: > >> Large page allocation failures have occurred 3 times > >> The problem can be avoided by running with large page support turned > off, for example by passing the "-XX:-UseLargePages" option to the java > binary. > >> > >> See 8007074 (not public). > >> > >> So I added the -XX:-UseLargePages option in the invocations of Java > code that I was responsible for. That seemed to get rid of the warning and > the crashes for a while, but perhaps that was just a coincidence. > >> > >> Jonathan > >> > >> _______________________________________________ > >> Swift-user mailing list > >> Swift-user at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > -- > Michael Wilde > Mathematics and Computer Science Computation Institute > Argonne National Laboratory The University of Chicago > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -- Yadu Nand B -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Jul 31 13:18:28 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 31 Jul 2014 11:18:28 -0700 Subject: [Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost In-Reply-To: <22271280-84FF-4C85-8A1E-5002ED165DAE@uchicago.edu> References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu> <53D96721.6030004@anl.gov> <53D99769.5050107@uchicago.edu> <1406783097.19120.2.camel@echo> <53D9CFCB.8090803@uchicago.edu> <1406783556.19312.1.camel@echo> <28AB72AB-2650-40E2-B084-9D1E20D03AEB@uchicago.edu> <1406826555.22289.4.camel@echo> <22271280-84FF-4C85-8A1E-5002ED165DAE@uchicago.edu> Message-ID: <1406830708.23317.3.camel@echo> Ok, so the workers die while the jobs are running and not much else is happening. My money is on the apps eating up all RAM and the kernel killing the worker. The question is how we check whether this is true or not. Ideas? Yadu, can you do me a favor and package all the PBS output files from this run? Jonathan, can you see if you get the same errors with tasksPerWorker=8? Mihael On Thu, 2014-07-31 at 12:34 -0500, Jonathan Ozik wrote: > Sure thing, it?s attached below. > > Jonathan > > > On Jul 31, 2014, at 12:09 PM, Mihael Hategan > wrote: > > > Hi Jonathan, > > > > I can't see anything obvious in the worker logs, but they are pretty > > large. Can you also post the swift log from this run? It would make > it > > easier to focus on the right time frame. > > > > Mihael > > > > On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote: > >> Hi all, > >> > >> I?m attaching the stdout and the worker logs below. > >> > >> Thanks for looking at these! > >> > >> Jonathan > >> > >> > >> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik > >> wrote: > >> > >>> Woops, sorry about that. It?s running now and the logs are being > >> generated. Once the run is done I?ll send you log files. > >>> > >>> Thanks! > >>> > >>> Jonathan > >>> > >>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan > >> wrote: > >>> > >>>> Right. This isn't your fault. We should, though, probably talk > >> about > >>>> addressing the issue. > >>>> > >>>> Mihael > >>>> > >>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote: > >>>>> Mihael, thanks for spotting that. I added the comments to > >> highlight the > >>>>> changes in email. > >>>>> > >>>>> -Yadu > >>>>> > >>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote: > >>>>>> Hi Jonathan, > >>>>>> > >>>>>> I suspect that the site config is considering the comment to be > >> part of > >>>>>> the value of the workerLogLevel property. We could confirm this > >> if you > >>>>>> send us the swift log from this particular run. > >>>>>> > >>>>>> To fix it, you could try to remove everything after DEBUG > >> (including all > >>>>>> horizontal white space). In other words: > >>>>>> > >>>>>> ... > >>>>>> workerloglevel=DEBUG > >>>>>> workerlogdirectory=/home/$USER/ > >>>>>> ... > >>>>>> > >>>>>> Mihael > >>>>>> > >>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote: > >>>>>>> Hi Yadu, > >>>>>>> > >>>>>>> > >>>>>>> I?m getting errors indicating that DEBUG is an invalid worker > >> logging > >>>>>>> level. I?m attaching the stdout below. Let me know if I?m > doing > >>>>>>> something silly. > >>>>>>> > >>>>>>> > >>>>>>> Jonathan > >>>>>>> > >>>>>>> > >>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji > >> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Hi Jonathan, > >>>>>>>> > >>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do > not > >> see > >>>>>>>> anything unusual. > >>>>>>>> > >>>>>>>> From your logs, it looks like workers are failing, so getting > >> worker > >>>>>>>> logs would help. > >>>>>>>> Could you try running on Blues with the following > >> swift.properties > >>>>>>>> and get us the worker*logs that would show up in the > >>>>>>>> workerlogdirectory ? > >>>>>>>> > >>>>>>>> site=blues > >>>>>>>> > >>>>>>>> site.blues { > >>>>>>>> jobManager=pbs > >>>>>>>> jobQueue=shared > >>>>>>>> maxJobs=4 > >>>>>>>> jobGranularity=1 > >>>>>>>> maxNodesPerJob=1 > >>>>>>>> tasksPerWorker=16 > >>>>>>>> taskThrottle=64 > >>>>>>>> initialScore=10000 > >>>>>>>> jobWalltime=00:48:00 > >>>>>>>> taskWalltime=00:40:00 > >>>>>>>> workerloglevel=DEBUG # > >> Adding > >>>>>>>> debug for workers > >>>>>>>> workerlogdirectory=/home/$USER/ # Logging > >>>>>>>> directory on SFS > >>>>>>>> workdir=$RUNDIRECTORY > >>>>>>>> filesystem=local > >>>>>>>> } > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Yadu > >>>>>>>> > >>>>>>>> > >>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote: > >>>>>>>> > >>>>>>>>> Hi Mike, > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Sorry, I figured there was some busy-ness involved! > >>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I > >> didn?t > >>>>>>>>> get the same issue. That is, the model run completed > >> successfully. > >>>>>>>>> For the Blues run, I used a trunk distribution (as of May > 29, > >>>>>>>>> 2014). I?m including one of the log files below. I?m also > >>>>>>>>> including the swift.properties file that was used for the > >> blues > >>>>>>>>> runs below. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Thank you! > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Jonathan > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde > >> wrote: > >>>>>>>>> > >>>>>>>>>> Hi Jonathan, > >>>>>>>>>> > >>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping! > >>>>>>>>>> > >>>>>>>>>> I or one of the team will answer soon, on swift-user. > >>>>>>>>>> > >>>>>>>>>> (But the first question is: which Swift release, and can > you > >>>>>>>>>> point us to, or send, the full log file?) > >>>>>>>>>> > >>>>>>>>>> Thanks and regards, > >>>>>>>>>> > >>>>>>>>>> - Mike > >>>>>>>>>> > >>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote: > >>>>>>>>>> > >>>>>>>>>>> Hi Mike, > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> I didn?t get a response yet so just wanted to make sure > that > >>>>>>>>>>> the message came across. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Jonathan > >>>>>>>>>>> > >>>>>>>>>>> Begin forwarded message: > >>>>>>>>>>> > >>>>>>>>>>>> From: Jonathan Ozik > >>>>>>>>>>>> > >>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511, > >>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost > >>>>>>>>>>>> > >>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT > >>>>>>>>>>>> > >>>>>>>>>>>> To: Mihael Hategan , > >>>>>>>>>>>> "swift-user at ci.uchicago.edu" > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Hi all, > >>>>>>>>>>>> > >>>>>>>>>>>> I?m getting spurious errors in the jobs that I?m running > on > >>>>>>>>>>>> Blues. The stdout includes exceptions like: > >>>>>>>>>>>> exception @ swift-int.k, line: 511 > >>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost > >>>>>>>>>>>> java.io.IOException: Broken pipe > >>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > >>>>>>>>>>>> at > >>>>>>>>>>>> > sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > >>>>>>>>>>>> at > sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > >>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65) > >>>>>>>>>>>> at > >>>>>>>>>>>> > >> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487) > >>>>>>>>>>>> at > >>>>>>>>>>>> > >> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168) > >>>>>>>>>>>> at > >>>>>>>>>>>> > >> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133) > >>>>>>>>>>>> > >>>>>>>>>>>> These seem to occur at different parts of the submitted > >>>>>>>>>>>> jobs. Let me know if there?s a log file that you?d like > to > >>>>>>>>>>>> look at. > >>>>>>>>>>>> > >>>>>>>>>>>> In earlier attempts I was getting these warnings followed > >> by > >>>>>>>>>>>> broken pipe errors: > >>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO: > >>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152, > >> 0) > >>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot > >>>>>>>>>>>> allocate large pages, falling back to regular pages > >>>>>>>>>>>> > >>>>>>>>>>>> Apparently that?s a known precursor of crashes on Java 7 > as > >>>>>>>>>>>> described here > >>>>>>>>>>>> > >> > (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html): > >>>>>>>>>>>> Area: hotspot/gc > >>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages. > >>>>>>>>>>>> > >>>>>>>>>>>> On Linux, failures when allocating large pages can lead > to > >>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the > issue > >>>>>>>>>>>> can be recognized in two ways: > >>>>>>>>>>>> > >>>>>>>>>>>> ? Before the crash happens one or more lines similar to > >> this > >>>>>>>>>>>> will have been printed to the log: > >>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152, > >> 0) > >>>>>>>>>>>> failed; > >>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot > allocate > >>>>>>>>>>>> large pages, falling back to regular pages > >>>>>>>>>>>> ? If a hs_err file is generated it will contain a line > >>>>>>>>>>>> similar to this: > >>>>>>>>>>>> Large page allocation failures have occurred 3 times > >>>>>>>>>>>> The problem can be avoided by running with large page > >>>>>>>>>>>> support turned off, for example by passing the > >>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary. > >>>>>>>>>>>> > >>>>>>>>>>>> See 8007074 (not public). > >>>>>>>>>>>> > >>>>>>>>>>>> So I added the -XX:-UseLargePages option in the > invocations > >>>>>>>>>>>> of Java code that I was responsible for. That seemed to > get > >>>>>>>>>>>> rid of the warning and the crashes for a while, but > perhaps > >>>>>>>>>>>> that was just a coincidence. > >>>>>>>>>>>> > >>>>>>>>>>>> Jonathan > >>>>>>>>>>>> > >>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>> Swift-user mailing list > >>>>>>>>>>>> Swift-user at ci.uchicago.edu > >>>>>>>>>>>> > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> Michael Wilde > >>>>>>>>>> Mathematics and Computer Science Computation > >> Institute > >>>>>>>>>> Argonne National Laboratory The University of > >> Chicago > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>>> > >>> > >> > >> > > > > > > From yadudoc1729 at gmail.com Thu Jul 31 13:26:02 2014 From: yadudoc1729 at gmail.com (Yadu Nand) Date: Thu, 31 Jul 2014 13:26:02 -0500 Subject: [Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost In-Reply-To: <1406830708.23317.3.camel@echo> References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu> <53D96721.6030004@anl.gov> <53D99769.5050107@uchicago.edu> <1406783097.19120.2.camel@echo> <53D9CFCB.8090803@uchicago.edu> <1406783556.19312.1.camel@echo> <28AB72AB-2650-40E2-B084-9D1E20D03AEB@uchicago.edu> <1406826555.22289.4.camel@echo> <22271280-84FF-4C85-8A1E-5002ED165DAE@uchicago.edu> <1406830708.23317.3.camel@echo> Message-ID: ?Here's a link to the scripts folder tarred up. http://users.rcc.uchicago.edu/~yadunand/pbs_ozik_blues.tar.gz A couple files couldn't be copied due to permission issues. -Yadu? On Thu, Jul 31, 2014 at 1:18 PM, Mihael Hategan wrote: > Ok, so the workers die while the jobs are running and not much else is > happening. > My money is on the apps eating up all RAM and the kernel killing the > worker. > > The question is how we check whether this is true or not. Ideas? > > Yadu, can you do me a favor and package all the PBS output files from > this run? > > Jonathan, can you see if you get the same errors with tasksPerWorker=8? > > Mihael > > On Thu, 2014-07-31 at 12:34 -0500, Jonathan Ozik wrote: > > Sure thing, it?s attached below. > > > > Jonathan > > > > > > On Jul 31, 2014, at 12:09 PM, Mihael Hategan > > wrote: > > > > > Hi Jonathan, > > > > > > I can't see anything obvious in the worker logs, but they are pretty > > > large. Can you also post the swift log from this run? It would make > > it > > > easier to focus on the right time frame. > > > > > > Mihael > > > > > > On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote: > > >> Hi all, > > >> > > >> I?m attaching the stdout and the worker logs below. > > >> > > >> Thanks for looking at these! > > >> > > >> Jonathan > > >> > > >> > > >> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik > > >> wrote: > > >> > > >>> Woops, sorry about that. It?s running now and the logs are being > > >> generated. Once the run is done I?ll send you log files. > > >>> > > >>> Thanks! > > >>> > > >>> Jonathan > > >>> > > >>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan > > >> wrote: > > >>> > > >>>> Right. This isn't your fault. We should, though, probably talk > > >> about > > >>>> addressing the issue. > > >>>> > > >>>> Mihael > > >>>> > > >>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote: > > >>>>> Mihael, thanks for spotting that. I added the comments to > > >> highlight the > > >>>>> changes in email. > > >>>>> > > >>>>> -Yadu > > >>>>> > > >>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote: > > >>>>>> Hi Jonathan, > > >>>>>> > > >>>>>> I suspect that the site config is considering the comment to be > > >> part of > > >>>>>> the value of the workerLogLevel property. We could confirm this > > >> if you > > >>>>>> send us the swift log from this particular run. > > >>>>>> > > >>>>>> To fix it, you could try to remove everything after DEBUG > > >> (including all > > >>>>>> horizontal white space). In other words: > > >>>>>> > > >>>>>> ... > > >>>>>> workerloglevel=DEBUG > > >>>>>> workerlogdirectory=/home/$USER/ > > >>>>>> ... > > >>>>>> > > >>>>>> Mihael > > >>>>>> > > >>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote: > > >>>>>>> Hi Yadu, > > >>>>>>> > > >>>>>>> > > >>>>>>> I?m getting errors indicating that DEBUG is an invalid worker > > >> logging > > >>>>>>> level. I?m attaching the stdout below. Let me know if I?m > > doing > > >>>>>>> something silly. > > >>>>>>> > > >>>>>>> > > >>>>>>> Jonathan > > >>>>>>> > > >>>>>>> > > >>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji > > >> > > >>>>>>> wrote: > > >>>>>>> > > >>>>>>>> Hi Jonathan, > > >>>>>>>> > > >>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do > > not > > >> see > > >>>>>>>> anything unusual. > > >>>>>>>> > > >>>>>>>> From your logs, it looks like workers are failing, so getting > > >> worker > > >>>>>>>> logs would help. > > >>>>>>>> Could you try running on Blues with the following > > >> swift.properties > > >>>>>>>> and get us the worker*logs that would show up in the > > >>>>>>>> workerlogdirectory ? > > >>>>>>>> > > >>>>>>>> site=blues > > >>>>>>>> > > >>>>>>>> site.blues { > > >>>>>>>> jobManager=pbs > > >>>>>>>> jobQueue=shared > > >>>>>>>> maxJobs=4 > > >>>>>>>> jobGranularity=1 > > >>>>>>>> maxNodesPerJob=1 > > >>>>>>>> tasksPerWorker=16 > > >>>>>>>> taskThrottle=64 > > >>>>>>>> initialScore=10000 > > >>>>>>>> jobWalltime=00:48:00 > > >>>>>>>> taskWalltime=00:40:00 > > >>>>>>>> workerloglevel=DEBUG # > > >> Adding > > >>>>>>>> debug for workers > > >>>>>>>> workerlogdirectory=/home/$USER/ # Logging > > >>>>>>>> directory on SFS > > >>>>>>>> workdir=$RUNDIRECTORY > > >>>>>>>> filesystem=local > > >>>>>>>> } > > >>>>>>>> > > >>>>>>>> Thanks, > > >>>>>>>> Yadu > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote: > > >>>>>>>> > > >>>>>>>>> Hi Mike, > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Sorry, I figured there was some busy-ness involved! > > >>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I > > >> didn?t > > >>>>>>>>> get the same issue. That is, the model run completed > > >> successfully. > > >>>>>>>>> For the Blues run, I used a trunk distribution (as of May > > 29, > > >>>>>>>>> 2014). I?m including one of the log files below. I?m also > > >>>>>>>>> including the swift.properties file that was used for the > > >> blues > > >>>>>>>>> runs below. > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Thank you! > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Jonathan > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde > > >> wrote: > > >>>>>>>>> > > >>>>>>>>>> Hi Jonathan, > > >>>>>>>>>> > > >>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping! > > >>>>>>>>>> > > >>>>>>>>>> I or one of the team will answer soon, on swift-user. > > >>>>>>>>>> > > >>>>>>>>>> (But the first question is: which Swift release, and can > > you > > >>>>>>>>>> point us to, or send, the full log file?) > > >>>>>>>>>> > > >>>>>>>>>> Thanks and regards, > > >>>>>>>>>> > > >>>>>>>>>> - Mike > > >>>>>>>>>> > > >>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote: > > >>>>>>>>>> > > >>>>>>>>>>> Hi Mike, > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> I didn?t get a response yet so just wanted to make sure > > that > > >>>>>>>>>>> the message came across. > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> Jonathan > > >>>>>>>>>>> > > >>>>>>>>>>> Begin forwarded message: > > >>>>>>>>>>> > > >>>>>>>>>>>> From: Jonathan Ozik > > >>>>>>>>>>>> > > >>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511, > > >>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost > > >>>>>>>>>>>> > > >>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT > > >>>>>>>>>>>> > > >>>>>>>>>>>> To: Mihael Hategan , > > >>>>>>>>>>>> "swift-user at ci.uchicago.edu" > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> Hi all, > > >>>>>>>>>>>> > > >>>>>>>>>>>> I?m getting spurious errors in the jobs that I?m running > > on > > >>>>>>>>>>>> Blues. The stdout includes exceptions like: > > >>>>>>>>>>>> exception @ swift-int.k, line: 511 > > >>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost > > >>>>>>>>>>>> java.io.IOException: Broken pipe > > >>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > > >>>>>>>>>>>> at > > >>>>>>>>>>>> > > sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > > >>>>>>>>>>>> at > > sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > > >>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65) > > >>>>>>>>>>>> at > > >>>>>>>>>>>> > > >> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487) > > >>>>>>>>>>>> at > > >>>>>>>>>>>> > > >> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168) > > >>>>>>>>>>>> at > > >>>>>>>>>>>> > > >> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133) > > >>>>>>>>>>>> > > >>>>>>>>>>>> These seem to occur at different parts of the submitted > > >>>>>>>>>>>> jobs. Let me know if there?s a log file that you?d like > > to > > >>>>>>>>>>>> look at. > > >>>>>>>>>>>> > > >>>>>>>>>>>> In earlier attempts I was getting these warnings followed > > >> by > > >>>>>>>>>>>> broken pipe errors: > > >>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO: > > >>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152, > > >> 0) > > >>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot > > >>>>>>>>>>>> allocate large pages, falling back to regular pages > > >>>>>>>>>>>> > > >>>>>>>>>>>> Apparently that?s a known precursor of crashes on Java 7 > > as > > >>>>>>>>>>>> described here > > >>>>>>>>>>>> > > >> > > ( > http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html): > > >>>>>>>>>>>> Area: hotspot/gc > > >>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages. > > >>>>>>>>>>>> > > >>>>>>>>>>>> On Linux, failures when allocating large pages can lead > > to > > >>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the > > issue > > >>>>>>>>>>>> can be recognized in two ways: > > >>>>>>>>>>>> > > >>>>>>>>>>>> ? Before the crash happens one or more lines similar to > > >> this > > >>>>>>>>>>>> will have been printed to the log: > > >>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152, > > >> 0) > > >>>>>>>>>>>> failed; > > >>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot > > allocate > > >>>>>>>>>>>> large pages, falling back to regular pages > > >>>>>>>>>>>> ? If a hs_err file is generated it will contain a line > > >>>>>>>>>>>> similar to this: > > >>>>>>>>>>>> Large page allocation failures have occurred 3 times > > >>>>>>>>>>>> The problem can be avoided by running with large page > > >>>>>>>>>>>> support turned off, for example by passing the > > >>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary. > > >>>>>>>>>>>> > > >>>>>>>>>>>> See 8007074 (not public). > > >>>>>>>>>>>> > > >>>>>>>>>>>> So I added the -XX:-UseLargePages option in the > > invocations > > >>>>>>>>>>>> of Java code that I was responsible for. That seemed to > > get > > >>>>>>>>>>>> rid of the warning and the crashes for a while, but > > perhaps > > >>>>>>>>>>>> that was just a coincidence. > > >>>>>>>>>>>> > > >>>>>>>>>>>> Jonathan > > >>>>>>>>>>>> > > >>>>>>>>>>>> _______________________________________________ > > >>>>>>>>>>>> Swift-user mailing list > > >>>>>>>>>>>> Swift-user at ci.uchicago.edu > > >>>>>>>>>>>> > > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> -- > > >>>>>>>>>> Michael Wilde > > >>>>>>>>>> Mathematics and Computer Science Computation > > >> Institute > > >>>>>>>>>> Argonne National Laboratory The University of > > >> Chicago > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>>> > > >>> > > >> > > >> > > > > > > > > > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -- Yadu Nand B -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Jul 31 13:36:27 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 31 Jul 2014 11:36:27 -0700 Subject: [Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost In-Reply-To: References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu> <53D96721.6030004@anl.gov> <53D99769.5050107@uchicago.edu> <1406783097.19120.2.camel@echo> <53D9CFCB.8090803@uchicago.edu> <1406783556.19312.1.camel@echo> <28AB72AB-2650-40E2-B084-9D1E20D03AEB@uchicago.edu> <1406826555.22289.4.camel@echo> <22271280-84FF-4C85-8A1E-5002ED165DAE@uchicago.edu> <1406830708.23317.3.camel@echo> Message-ID: <1406831787.23534.8.camel@echo> Thanks Yadu! I see nothing interesting in those logs. Again, the absence of any kind* of problem logged by the worker points to some abrupt termination of the process, which is most likely explained by an OOM. Mihael (*) uninitialized variable concatenation warning aside On Thu, 2014-07-31 at 13:26 -0500, Yadu Nand wrote: > ?Here's a link to the scripts folder tarred up. > http://users.rcc.uchicago.edu/~yadunand/pbs_ozik_blues.tar.gz > > A couple files couldn't be copied due to permission issues. > > -Yadu? > > > On Thu, Jul 31, 2014 at 1:18 PM, Mihael Hategan wrote: > > > Ok, so the workers die while the jobs are running and not much else is > > happening. > > My money is on the apps eating up all RAM and the kernel killing the > > worker. > > > > The question is how we check whether this is true or not. Ideas? > > > > Yadu, can you do me a favor and package all the PBS output files from > > this run? > > > > Jonathan, can you see if you get the same errors with tasksPerWorker=8? > > > > Mihael > > > > On Thu, 2014-07-31 at 12:34 -0500, Jonathan Ozik wrote: > > > Sure thing, it?s attached below. > > > > > > Jonathan > > > > > > > > > On Jul 31, 2014, at 12:09 PM, Mihael Hategan > > > wrote: > > > > > > > Hi Jonathan, > > > > > > > > I can't see anything obvious in the worker logs, but they are pretty > > > > large. Can you also post the swift log from this run? It would make > > > it > > > > easier to focus on the right time frame. > > > > > > > > Mihael > > > > > > > > On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote: > > > >> Hi all, > > > >> > > > >> I?m attaching the stdout and the worker logs below. > > > >> > > > >> Thanks for looking at these! > > > >> > > > >> Jonathan > > > >> > > > >> > > > >> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik > > > >> wrote: > > > >> > > > >>> Woops, sorry about that. It?s running now and the logs are being > > > >> generated. Once the run is done I?ll send you log files. > > > >>> > > > >>> Thanks! > > > >>> > > > >>> Jonathan > > > >>> > > > >>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan > > > >> wrote: > > > >>> > > > >>>> Right. This isn't your fault. We should, though, probably talk > > > >> about > > > >>>> addressing the issue. > > > >>>> > > > >>>> Mihael > > > >>>> > > > >>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote: > > > >>>>> Mihael, thanks for spotting that. I added the comments to > > > >> highlight the > > > >>>>> changes in email. > > > >>>>> > > > >>>>> -Yadu > > > >>>>> > > > >>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote: > > > >>>>>> Hi Jonathan, > > > >>>>>> > > > >>>>>> I suspect that the site config is considering the comment to be > > > >> part of > > > >>>>>> the value of the workerLogLevel property. We could confirm this > > > >> if you > > > >>>>>> send us the swift log from this particular run. > > > >>>>>> > > > >>>>>> To fix it, you could try to remove everything after DEBUG > > > >> (including all > > > >>>>>> horizontal white space). In other words: > > > >>>>>> > > > >>>>>> ... > > > >>>>>> workerloglevel=DEBUG > > > >>>>>> workerlogdirectory=/home/$USER/ > > > >>>>>> ... > > > >>>>>> > > > >>>>>> Mihael > > > >>>>>> > > > >>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote: > > > >>>>>>> Hi Yadu, > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> I?m getting errors indicating that DEBUG is an invalid worker > > > >> logging > > > >>>>>>> level. I?m attaching the stdout below. Let me know if I?m > > > doing > > > >>>>>>> something silly. > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> Jonathan > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji > > > >> > > > >>>>>>> wrote: > > > >>>>>>> > > > >>>>>>>> Hi Jonathan, > > > >>>>>>>> > > > >>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do > > > not > > > >> see > > > >>>>>>>> anything unusual. > > > >>>>>>>> > > > >>>>>>>> From your logs, it looks like workers are failing, so getting > > > >> worker > > > >>>>>>>> logs would help. > > > >>>>>>>> Could you try running on Blues with the following > > > >> swift.properties > > > >>>>>>>> and get us the worker*logs that would show up in the > > > >>>>>>>> workerlogdirectory ? > > > >>>>>>>> > > > >>>>>>>> site=blues > > > >>>>>>>> > > > >>>>>>>> site.blues { > > > >>>>>>>> jobManager=pbs > > > >>>>>>>> jobQueue=shared > > > >>>>>>>> maxJobs=4 > > > >>>>>>>> jobGranularity=1 > > > >>>>>>>> maxNodesPerJob=1 > > > >>>>>>>> tasksPerWorker=16 > > > >>>>>>>> taskThrottle=64 > > > >>>>>>>> initialScore=10000 > > > >>>>>>>> jobWalltime=00:48:00 > > > >>>>>>>> taskWalltime=00:40:00 > > > >>>>>>>> workerloglevel=DEBUG # > > > >> Adding > > > >>>>>>>> debug for workers > > > >>>>>>>> workerlogdirectory=/home/$USER/ # Logging > > > >>>>>>>> directory on SFS > > > >>>>>>>> workdir=$RUNDIRECTORY > > > >>>>>>>> filesystem=local > > > >>>>>>>> } > > > >>>>>>>> > > > >>>>>>>> Thanks, > > > >>>>>>>> Yadu > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote: > > > >>>>>>>> > > > >>>>>>>>> Hi Mike, > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> Sorry, I figured there was some busy-ness involved! > > > >>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I > > > >> didn?t > > > >>>>>>>>> get the same issue. That is, the model run completed > > > >> successfully. > > > >>>>>>>>> For the Blues run, I used a trunk distribution (as of May > > > 29, > > > >>>>>>>>> 2014). I?m including one of the log files below. I?m also > > > >>>>>>>>> including the swift.properties file that was used for the > > > >> blues > > > >>>>>>>>> runs below. > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> Thank you! > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> Jonathan > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde > > > >> wrote: > > > >>>>>>>>> > > > >>>>>>>>>> Hi Jonathan, > > > >>>>>>>>>> > > > >>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping! > > > >>>>>>>>>> > > > >>>>>>>>>> I or one of the team will answer soon, on swift-user. > > > >>>>>>>>>> > > > >>>>>>>>>> (But the first question is: which Swift release, and can > > > you > > > >>>>>>>>>> point us to, or send, the full log file?) > > > >>>>>>>>>> > > > >>>>>>>>>> Thanks and regards, > > > >>>>>>>>>> > > > >>>>>>>>>> - Mike > > > >>>>>>>>>> > > > >>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote: > > > >>>>>>>>>> > > > >>>>>>>>>>> Hi Mike, > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> I didn?t get a response yet so just wanted to make sure > > > that > > > >>>>>>>>>>> the message came across. > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> Jonathan > > > >>>>>>>>>>> > > > >>>>>>>>>>> Begin forwarded message: > > > >>>>>>>>>>> > > > >>>>>>>>>>>> From: Jonathan Ozik > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511, > > > >>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> To: Mihael Hategan , > > > >>>>>>>>>>>> "swift-user at ci.uchicago.edu" > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Hi all, > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> I?m getting spurious errors in the jobs that I?m running > > > on > > > >>>>>>>>>>>> Blues. The stdout includes exceptions like: > > > >>>>>>>>>>>> exception @ swift-int.k, line: 511 > > > >>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost > > > >>>>>>>>>>>> java.io.IOException: Broken pipe > > > >>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > > > >>>>>>>>>>>> at > > > >>>>>>>>>>>> > > > sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > > > >>>>>>>>>>>> at > > > sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > > > >>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65) > > > >>>>>>>>>>>> at > > > >>>>>>>>>>>> > > > >> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487) > > > >>>>>>>>>>>> at > > > >>>>>>>>>>>> > > > >> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168) > > > >>>>>>>>>>>> at > > > >>>>>>>>>>>> > > > >> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133) > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> These seem to occur at different parts of the submitted > > > >>>>>>>>>>>> jobs. Let me know if there?s a log file that you?d like > > > to > > > >>>>>>>>>>>> look at. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> In earlier attempts I was getting these warnings followed > > > >> by > > > >>>>>>>>>>>> broken pipe errors: > > > >>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO: > > > >>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152, > > > >> 0) > > > >>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot > > > >>>>>>>>>>>> allocate large pages, falling back to regular pages > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Apparently that?s a known precursor of crashes on Java 7 > > > as > > > >>>>>>>>>>>> described here > > > >>>>>>>>>>>> > > > >> > > > ( > > http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html): > > > >>>>>>>>>>>> Area: hotspot/gc > > > >>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> On Linux, failures when allocating large pages can lead > > > to > > > >>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the > > > issue > > > >>>>>>>>>>>> can be recognized in two ways: > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> ? Before the crash happens one or more lines similar to > > > >> this > > > >>>>>>>>>>>> will have been printed to the log: > > > >>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152, > > > >> 0) > > > >>>>>>>>>>>> failed; > > > >>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot > > > allocate > > > >>>>>>>>>>>> large pages, falling back to regular pages > > > >>>>>>>>>>>> ? If a hs_err file is generated it will contain a line > > > >>>>>>>>>>>> similar to this: > > > >>>>>>>>>>>> Large page allocation failures have occurred 3 times > > > >>>>>>>>>>>> The problem can be avoided by running with large page > > > >>>>>>>>>>>> support turned off, for example by passing the > > > >>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> See 8007074 (not public). > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> So I added the -XX:-UseLargePages option in the > > > invocations > > > >>>>>>>>>>>> of Java code that I was responsible for. That seemed to > > > get > > > >>>>>>>>>>>> rid of the warning and the crashes for a while, but > > > perhaps > > > >>>>>>>>>>>> that was just a coincidence. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Jonathan > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> _______________________________________________ > > > >>>>>>>>>>>> Swift-user mailing list > > > >>>>>>>>>>>> Swift-user at ci.uchicago.edu > > > >>>>>>>>>>>> > > > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > >>>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>> -- > > > >>>>>>>>>> Michael Wilde > > > >>>>>>>>>> Mathematics and Computer Science Computation > > > >> Institute > > > >>>>>>>>>> Argonne National Laboratory The University of > > > >> Chicago > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >>>> > > > >>> > > > >> > > > >> > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > From jozik at uchicago.edu Thu Jul 31 13:42:15 2014 From: jozik at uchicago.edu (Jonathan Ozik) Date: Thu, 31 Jul 2014 13:42:15 -0500 Subject: [Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost In-Reply-To: <1406830708.23317.3.camel@echo> References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu> <53D96721.6030004@anl.gov> <53D99769.5050107@uchicago.edu> <1406783097.19120.2.camel@echo> <53D9CFCB.8090803@uchicago.edu> <1406783556.19312.1.camel@echo> <28AB72AB-2650-40E2-B084-9D1E20D03AEB@uchicago.edu> <1406826555.22289.4.camel@echo> <22271280-84FF-4C85-8A1E-5002ED165DAE@uchicago.edu> <1406830708.23317.3.camel@echo> Message-ID: Okay, I?ve launched a new job, with tasksPerWorker=8. This is running on the sSerial queue rather than the shared queue for the other runs. Just wanted to note that for comparison. Each of the java processes that are launched with -Xmx1536m. I believe that Blues advertises each node having access to 64GB (http://www.lcrc.anl.gov/about/Blues), so at least at first glance the memory issue doesn?t seem like it could be an issue. Jonathan On Jul 31, 2014, at 1:18 PM, Mihael Hategan wrote: > Ok, so the workers die while the jobs are running and not much else is > happening. > My money is on the apps eating up all RAM and the kernel killing the > worker. > > The question is how we check whether this is true or not. Ideas? > > Yadu, can you do me a favor and package all the PBS output files from > this run? > > Jonathan, can you see if you get the same errors with tasksPerWorker=8? > > Mihael > > On Thu, 2014-07-31 at 12:34 -0500, Jonathan Ozik wrote: >> Sure thing, it?s attached below. >> >> Jonathan >> >> >> On Jul 31, 2014, at 12:09 PM, Mihael Hategan >> wrote: >> >>> Hi Jonathan, >>> >>> I can't see anything obvious in the worker logs, but they are pretty >>> large. Can you also post the swift log from this run? It would make >> it >>> easier to focus on the right time frame. >>> >>> Mihael >>> >>> On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote: >>>> Hi all, >>>> >>>> I?m attaching the stdout and the worker logs below. >>>> >>>> Thanks for looking at these! >>>> >>>> Jonathan >>>> >>>> >>>> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik >>>> wrote: >>>> >>>>> Woops, sorry about that. It?s running now and the logs are being >>>> generated. Once the run is done I?ll send you log files. >>>>> >>>>> Thanks! >>>>> >>>>> Jonathan >>>>> >>>>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan >>>> wrote: >>>>> >>>>>> Right. This isn't your fault. We should, though, probably talk >>>> about >>>>>> addressing the issue. >>>>>> >>>>>> Mihael >>>>>> >>>>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote: >>>>>>> Mihael, thanks for spotting that. I added the comments to >>>> highlight the >>>>>>> changes in email. >>>>>>> >>>>>>> -Yadu >>>>>>> >>>>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote: >>>>>>>> Hi Jonathan, >>>>>>>> >>>>>>>> I suspect that the site config is considering the comment to be >>>> part of >>>>>>>> the value of the workerLogLevel property. We could confirm this >>>> if you >>>>>>>> send us the swift log from this particular run. >>>>>>>> >>>>>>>> To fix it, you could try to remove everything after DEBUG >>>> (including all >>>>>>>> horizontal white space). In other words: >>>>>>>> >>>>>>>> ... >>>>>>>> workerloglevel=DEBUG >>>>>>>> workerlogdirectory=/home/$USER/ >>>>>>>> ... >>>>>>>> >>>>>>>> Mihael >>>>>>>> >>>>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote: >>>>>>>>> Hi Yadu, >>>>>>>>> >>>>>>>>> >>>>>>>>> I?m getting errors indicating that DEBUG is an invalid worker >>>> logging >>>>>>>>> level. I?m attaching the stdout below. Let me know if I?m >> doing >>>>>>>>> something silly. >>>>>>>>> >>>>>>>>> >>>>>>>>> Jonathan >>>>>>>>> >>>>>>>>> >>>>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji >>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Jonathan, >>>>>>>>>> >>>>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do >> not >>>> see >>>>>>>>>> anything unusual. >>>>>>>>>> >>>>>>>>>> From your logs, it looks like workers are failing, so getting >>>> worker >>>>>>>>>> logs would help. >>>>>>>>>> Could you try running on Blues with the following >>>> swift.properties >>>>>>>>>> and get us the worker*logs that would show up in the >>>>>>>>>> workerlogdirectory ? >>>>>>>>>> >>>>>>>>>> site=blues >>>>>>>>>> >>>>>>>>>> site.blues { >>>>>>>>>> jobManager=pbs >>>>>>>>>> jobQueue=shared >>>>>>>>>> maxJobs=4 >>>>>>>>>> jobGranularity=1 >>>>>>>>>> maxNodesPerJob=1 >>>>>>>>>> tasksPerWorker=16 >>>>>>>>>> taskThrottle=64 >>>>>>>>>> initialScore=10000 >>>>>>>>>> jobWalltime=00:48:00 >>>>>>>>>> taskWalltime=00:40:00 >>>>>>>>>> workerloglevel=DEBUG # >>>> Adding >>>>>>>>>> debug for workers >>>>>>>>>> workerlogdirectory=/home/$USER/ # Logging >>>>>>>>>> directory on SFS >>>>>>>>>> workdir=$RUNDIRECTORY >>>>>>>>>> filesystem=local >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Yadu >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote: >>>>>>>>>> >>>>>>>>>>> Hi Mike, >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Sorry, I figured there was some busy-ness involved! >>>>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I >>>> didn?t >>>>>>>>>>> get the same issue. That is, the model run completed >>>> successfully. >>>>>>>>>>> For the Blues run, I used a trunk distribution (as of May >> 29, >>>>>>>>>>> 2014). I?m including one of the log files below. I?m also >>>>>>>>>>> including the swift.properties file that was used for the >>>> blues >>>>>>>>>>> runs below. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thank you! >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Jonathan >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde >>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Jonathan, >>>>>>>>>>>> >>>>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping! >>>>>>>>>>>> >>>>>>>>>>>> I or one of the team will answer soon, on swift-user. >>>>>>>>>>>> >>>>>>>>>>>> (But the first question is: which Swift release, and can >> you >>>>>>>>>>>> point us to, or send, the full log file?) >>>>>>>>>>>> >>>>>>>>>>>> Thanks and regards, >>>>>>>>>>>> >>>>>>>>>>>> - Mike >>>>>>>>>>>> >>>>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Mike, >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I didn?t get a response yet so just wanted to make sure >> that >>>>>>>>>>>>> the message came across. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Jonathan >>>>>>>>>>>>> >>>>>>>>>>>>> Begin forwarded message: >>>>>>>>>>>>> >>>>>>>>>>>>>> From: Jonathan Ozik >>>>>>>>>>>>>> >>>>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511, >>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost >>>>>>>>>>>>>> >>>>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT >>>>>>>>>>>>>> >>>>>>>>>>>>>> To: Mihael Hategan , >>>>>>>>>>>>>> "swift-user at ci.uchicago.edu" >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I?m getting spurious errors in the jobs that I?m running >> on >>>>>>>>>>>>>> Blues. The stdout includes exceptions like: >>>>>>>>>>>>>> exception @ swift-int.k, line: 511 >>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost >>>>>>>>>>>>>> java.io.IOException: Broken pipe >>>>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method) >>>>>>>>>>>>>> at >>>>>>>>>>>>>> >> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) >>>>>>>>>>>>>> at >> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) >>>>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65) >>>>>>>>>>>>>> at >>>>>>>>>>>>>> >>>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487) >>>>>>>>>>>>>> at >>>>>>>>>>>>>> >>>> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168) >>>>>>>>>>>>>> at >>>>>>>>>>>>>> >>>> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133) >>>>>>>>>>>>>> >>>>>>>>>>>>>> These seem to occur at different parts of the submitted >>>>>>>>>>>>>> jobs. Let me know if there?s a log file that you?d like >> to >>>>>>>>>>>>>> look at. >>>>>>>>>>>>>> >>>>>>>>>>>>>> In earlier attempts I was getting these warnings followed >>>> by >>>>>>>>>>>>>> broken pipe errors: >>>>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO: >>>>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152, >>>> 0) >>>>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot >>>>>>>>>>>>>> allocate large pages, falling back to regular pages >>>>>>>>>>>>>> >>>>>>>>>>>>>> Apparently that?s a known precursor of crashes on Java 7 >> as >>>>>>>>>>>>>> described here >>>>>>>>>>>>>> >>>> >> (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html): >>>>>>>>>>>>>> Area: hotspot/gc >>>>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Linux, failures when allocating large pages can lead >> to >>>>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the >> issue >>>>>>>>>>>>>> can be recognized in two ways: >>>>>>>>>>>>>> >>>>>>>>>>>>>> ? Before the crash happens one or more lines similar to >>>> this >>>>>>>>>>>>>> will have been printed to the log: >>>>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152, >>>> 0) >>>>>>>>>>>>>> failed; >>>>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot >> allocate >>>>>>>>>>>>>> large pages, falling back to regular pages >>>>>>>>>>>>>> ? If a hs_err file is generated it will contain a line >>>>>>>>>>>>>> similar to this: >>>>>>>>>>>>>> Large page allocation failures have occurred 3 times >>>>>>>>>>>>>> The problem can be avoided by running with large page >>>>>>>>>>>>>> support turned off, for example by passing the >>>>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary. >>>>>>>>>>>>>> >>>>>>>>>>>>>> See 8007074 (not public). >>>>>>>>>>>>>> >>>>>>>>>>>>>> So I added the -XX:-UseLargePages option in the >> invocations >>>>>>>>>>>>>> of Java code that I was responsible for. That seemed to >> get >>>>>>>>>>>>>> rid of the warning and the crashes for a while, but >> perhaps >>>>>>>>>>>>>> that was just a coincidence. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Jonathan >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> Swift-user mailing list >>>>>>>>>>>>>> Swift-user at ci.uchicago.edu >>>>>>>>>>>>>> >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Michael Wilde >>>>>>>>>>>> Mathematics and Computer Science Computation >>>> Institute >>>>>>>>>>>> Argonne National Laboratory The University of >>>> Chicago >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>> >>> >> >> > > From davidkelly at uchicago.edu Thu Jul 31 13:55:33 2014 From: davidkelly at uchicago.edu (David Kelly) Date: Thu, 31 Jul 2014 13:55:33 -0500 Subject: [Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost In-Reply-To: References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu> <53D96721.6030004@anl.gov> <53D99769.5050107@uchicago.edu> <1406783097.19120.2.camel@echo> <53D9CFCB.8090803@uchicago.edu> <1406783556.19312.1.camel@echo> <28AB72AB-2650-40E2-B084-9D1E20D03AEB@uchicago.edu> <1406826555.22289.4.camel@echo> <22271280-84FF-4C85-8A1E-5002ED165DAE@uchicago.edu> <1406830708.23317.3.camel@echo> Message-ID: Is each invocation of the Java app creating a large number of threads? I ran into an issue like that (on another cluster) where I was hitting the maximum number of processes per node, and the scheduler ended up killing my jobs. On Thu, Jul 31, 2014 at 1:42 PM, Jonathan Ozik wrote: > Okay, I?ve launched a new job, with tasksPerWorker=8. This is running on > the sSerial queue rather than the shared queue for the other runs. Just > wanted to note that for comparison. Each of the java processes that are > launched with -Xmx1536m. I believe that Blues advertises each node having > access to 64GB (http://www.lcrc.anl.gov/about/Blues), so at least at > first glance the memory issue doesn?t seem like it could be an issue. > > Jonathan > > On Jul 31, 2014, at 1:18 PM, Mihael Hategan wrote: > > > Ok, so the workers die while the jobs are running and not much else is > > happening. > > My money is on the apps eating up all RAM and the kernel killing the > > worker. > > > > The question is how we check whether this is true or not. Ideas? > > > > Yadu, can you do me a favor and package all the PBS output files from > > this run? > > > > Jonathan, can you see if you get the same errors with tasksPerWorker=8? > > > > Mihael > > > > On Thu, 2014-07-31 at 12:34 -0500, Jonathan Ozik wrote: > >> Sure thing, it?s attached below. > >> > >> Jonathan > >> > >> > >> On Jul 31, 2014, at 12:09 PM, Mihael Hategan > >> wrote: > >> > >>> Hi Jonathan, > >>> > >>> I can't see anything obvious in the worker logs, but they are pretty > >>> large. Can you also post the swift log from this run? It would make > >> it > >>> easier to focus on the right time frame. > >>> > >>> Mihael > >>> > >>> On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote: > >>>> Hi all, > >>>> > >>>> I?m attaching the stdout and the worker logs below. > >>>> > >>>> Thanks for looking at these! > >>>> > >>>> Jonathan > >>>> > >>>> > >>>> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik > >>>> wrote: > >>>> > >>>>> Woops, sorry about that. It?s running now and the logs are being > >>>> generated. Once the run is done I?ll send you log files. > >>>>> > >>>>> Thanks! > >>>>> > >>>>> Jonathan > >>>>> > >>>>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan > >>>> wrote: > >>>>> > >>>>>> Right. This isn't your fault. We should, though, probably talk > >>>> about > >>>>>> addressing the issue. > >>>>>> > >>>>>> Mihael > >>>>>> > >>>>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote: > >>>>>>> Mihael, thanks for spotting that. I added the comments to > >>>> highlight the > >>>>>>> changes in email. > >>>>>>> > >>>>>>> -Yadu > >>>>>>> > >>>>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote: > >>>>>>>> Hi Jonathan, > >>>>>>>> > >>>>>>>> I suspect that the site config is considering the comment to be > >>>> part of > >>>>>>>> the value of the workerLogLevel property. We could confirm this > >>>> if you > >>>>>>>> send us the swift log from this particular run. > >>>>>>>> > >>>>>>>> To fix it, you could try to remove everything after DEBUG > >>>> (including all > >>>>>>>> horizontal white space). In other words: > >>>>>>>> > >>>>>>>> ... > >>>>>>>> workerloglevel=DEBUG > >>>>>>>> workerlogdirectory=/home/$USER/ > >>>>>>>> ... > >>>>>>>> > >>>>>>>> Mihael > >>>>>>>> > >>>>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote: > >>>>>>>>> Hi Yadu, > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> I?m getting errors indicating that DEBUG is an invalid worker > >>>> logging > >>>>>>>>> level. I?m attaching the stdout below. Let me know if I?m > >> doing > >>>>>>>>> something silly. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Jonathan > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji > >>>> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hi Jonathan, > >>>>>>>>>> > >>>>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do > >> not > >>>> see > >>>>>>>>>> anything unusual. > >>>>>>>>>> > >>>>>>>>>> From your logs, it looks like workers are failing, so getting > >>>> worker > >>>>>>>>>> logs would help. > >>>>>>>>>> Could you try running on Blues with the following > >>>> swift.properties > >>>>>>>>>> and get us the worker*logs that would show up in the > >>>>>>>>>> workerlogdirectory ? > >>>>>>>>>> > >>>>>>>>>> site=blues > >>>>>>>>>> > >>>>>>>>>> site.blues { > >>>>>>>>>> jobManager=pbs > >>>>>>>>>> jobQueue=shared > >>>>>>>>>> maxJobs=4 > >>>>>>>>>> jobGranularity=1 > >>>>>>>>>> maxNodesPerJob=1 > >>>>>>>>>> tasksPerWorker=16 > >>>>>>>>>> taskThrottle=64 > >>>>>>>>>> initialScore=10000 > >>>>>>>>>> jobWalltime=00:48:00 > >>>>>>>>>> taskWalltime=00:40:00 > >>>>>>>>>> workerloglevel=DEBUG # > >>>> Adding > >>>>>>>>>> debug for workers > >>>>>>>>>> workerlogdirectory=/home/$USER/ # Logging > >>>>>>>>>> directory on SFS > >>>>>>>>>> workdir=$RUNDIRECTORY > >>>>>>>>>> filesystem=local > >>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>>> Thanks, > >>>>>>>>>> Yadu > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote: > >>>>>>>>>> > >>>>>>>>>>> Hi Mike, > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Sorry, I figured there was some busy-ness involved! > >>>>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I > >>>> didn?t > >>>>>>>>>>> get the same issue. That is, the model run completed > >>>> successfully. > >>>>>>>>>>> For the Blues run, I used a trunk distribution (as of May > >> 29, > >>>>>>>>>>> 2014). I?m including one of the log files below. I?m also > >>>>>>>>>>> including the swift.properties file that was used for the > >>>> blues > >>>>>>>>>>> runs below. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Thank you! > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Jonathan > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde > >>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hi Jonathan, > >>>>>>>>>>>> > >>>>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping! > >>>>>>>>>>>> > >>>>>>>>>>>> I or one of the team will answer soon, on swift-user. > >>>>>>>>>>>> > >>>>>>>>>>>> (But the first question is: which Swift release, and can > >> you > >>>>>>>>>>>> point us to, or send, the full log file?) > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks and regards, > >>>>>>>>>>>> > >>>>>>>>>>>> - Mike > >>>>>>>>>>>> > >>>>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Hi Mike, > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> I didn?t get a response yet so just wanted to make sure > >> that > >>>>>>>>>>>>> the message came across. > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> Jonathan > >>>>>>>>>>>>> > >>>>>>>>>>>>> Begin forwarded message: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> From: Jonathan Ozik > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511, > >>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> To: Mihael Hategan , > >>>>>>>>>>>>>> "swift-user at ci.uchicago.edu" > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi all, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I?m getting spurious errors in the jobs that I?m running > >> on > >>>>>>>>>>>>>> Blues. The stdout includes exceptions like: > >>>>>>>>>>>>>> exception @ swift-int.k, line: 511 > >>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost > >>>>>>>>>>>>>> java.io.IOException: Broken pipe > >>>>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > >>>>>>>>>>>>>> at > >>>>>>>>>>>>>> > >> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > >>>>>>>>>>>>>> at > >> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > >>>>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65) > >>>>>>>>>>>>>> at > >>>>>>>>>>>>>> > >>>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487) > >>>>>>>>>>>>>> at > >>>>>>>>>>>>>> > >>>> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168) > >>>>>>>>>>>>>> at > >>>>>>>>>>>>>> > >>>> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133) > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> These seem to occur at different parts of the submitted > >>>>>>>>>>>>>> jobs. Let me know if there?s a log file that you?d like > >> to > >>>>>>>>>>>>>> look at. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> In earlier attempts I was getting these warnings followed > >>>> by > >>>>>>>>>>>>>> broken pipe errors: > >>>>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO: > >>>>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152, > >>>> 0) > >>>>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot > >>>>>>>>>>>>>> allocate large pages, falling back to regular pages > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Apparently that?s a known precursor of crashes on Java 7 > >> as > >>>>>>>>>>>>>> described here > >>>>>>>>>>>>>> > >>>> > >> ( > http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html): > >>>>>>>>>>>>>> Area: hotspot/gc > >>>>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Linux, failures when allocating large pages can lead > >> to > >>>>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the > >> issue > >>>>>>>>>>>>>> can be recognized in two ways: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> ? Before the crash happens one or more lines similar to > >>>> this > >>>>>>>>>>>>>> will have been printed to the log: > >>>>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152, > >>>> 0) > >>>>>>>>>>>>>> failed; > >>>>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot > >> allocate > >>>>>>>>>>>>>> large pages, falling back to regular pages > >>>>>>>>>>>>>> ? If a hs_err file is generated it will contain a line > >>>>>>>>>>>>>> similar to this: > >>>>>>>>>>>>>> Large page allocation failures have occurred 3 times > >>>>>>>>>>>>>> The problem can be avoided by running with large page > >>>>>>>>>>>>>> support turned off, for example by passing the > >>>>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> See 8007074 (not public). > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> So I added the -XX:-UseLargePages option in the > >> invocations > >>>>>>>>>>>>>> of Java code that I was responsible for. That seemed to > >> get > >>>>>>>>>>>>>> rid of the warning and the crashes for a while, but > >> perhaps > >>>>>>>>>>>>>> that was just a coincidence. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Jonathan > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>>> Swift-user mailing list > >>>>>>>>>>>>>> Swift-user at ci.uchicago.edu > >>>>>>>>>>>>>> > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> -- > >>>>>>>>>>>> Michael Wilde > >>>>>>>>>>>> Mathematics and Computer Science Computation > >>>> Institute > >>>>>>>>>>>> Argonne National Laboratory The University of > >>>> Chicago > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>> > >>>> > >>>> > >>> > >>> > >> > >> > > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jozik at uchicago.edu Thu Jul 31 14:28:53 2014 From: jozik at uchicago.edu (Jonathan Ozik) Date: Thu, 31 Jul 2014 14:28:53 -0500 Subject: [Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost In-Reply-To: References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu> <53D96721.6030004@anl.gov> <53D99769.5050107@uchicago.edu> <1406783097.19120.2.camel@echo> <53D9CFCB.8090803@uchicago.edu> <1406783556.19312.1.camel@echo> <28AB72AB-2650-40E2-B084-9D1E20D03AEB@uchicago.edu> <1406826555.22289.4.camel@echo> <22271280-84FF-4C85-8A1E-5002ED165DAE@uchicago.edu> <1406830708.23317.3.camel@echo> Message-ID: <2FE8C351-A0F2-4868-8488-BE66D753B5FD@uchicago.edu> The tasksPerWorker=8 on sSerial seems to have worked. That?s good because it worked but not so good because it didn?t add any definitive information? As for the large number of threads question, I believe that each Java application isn?t creating any additional threads or, if so, maybe one additional. Let me know if you?d like to see the successful log files (and which ones). Jonathan On Jul 31, 2014, at 1:55 PM, David Kelly wrote: > Is each invocation of the Java app creating a large number of threads? I ran into an issue like that (on another cluster) where I was hitting the maximum number of processes per node, and the scheduler ended up killing my jobs. > > > On Thu, Jul 31, 2014 at 1:42 PM, Jonathan Ozik wrote: > Okay, I?ve launched a new job, with tasksPerWorker=8. This is running on the sSerial queue rather than the shared queue for the other runs. Just wanted to note that for comparison. Each of the java processes that are launched with -Xmx1536m. I believe that Blues advertises each node having access to 64GB (http://www.lcrc.anl.gov/about/Blues), so at least at first glance the memory issue doesn?t seem like it could be an issue. > > Jonathan > > On Jul 31, 2014, at 1:18 PM, Mihael Hategan wrote: > > > Ok, so the workers die while the jobs are running and not much else is > > happening. > > My money is on the apps eating up all RAM and the kernel killing the > > worker. > > > > The question is how we check whether this is true or not. Ideas? > > > > Yadu, can you do me a favor and package all the PBS output files from > > this run? > > > > Jonathan, can you see if you get the same errors with tasksPerWorker=8? > > > > Mihael > > > > On Thu, 2014-07-31 at 12:34 -0500, Jonathan Ozik wrote: > >> Sure thing, it?s attached below. > >> > >> Jonathan > >> > >> > >> On Jul 31, 2014, at 12:09 PM, Mihael Hategan > >> wrote: > >> > >>> Hi Jonathan, > >>> > >>> I can't see anything obvious in the worker logs, but they are pretty > >>> large. Can you also post the swift log from this run? It would make > >> it > >>> easier to focus on the right time frame. > >>> > >>> Mihael > >>> > >>> On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote: > >>>> Hi all, > >>>> > >>>> I?m attaching the stdout and the worker logs below. > >>>> > >>>> Thanks for looking at these! > >>>> > >>>> Jonathan > >>>> > >>>> > >>>> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik > >>>> wrote: > >>>> > >>>>> Woops, sorry about that. It?s running now and the logs are being > >>>> generated. Once the run is done I?ll send you log files. > >>>>> > >>>>> Thanks! > >>>>> > >>>>> Jonathan > >>>>> > >>>>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan > >>>> wrote: > >>>>> > >>>>>> Right. This isn't your fault. We should, though, probably talk > >>>> about > >>>>>> addressing the issue. > >>>>>> > >>>>>> Mihael > >>>>>> > >>>>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote: > >>>>>>> Mihael, thanks for spotting that. I added the comments to > >>>> highlight the > >>>>>>> changes in email. > >>>>>>> > >>>>>>> -Yadu > >>>>>>> > >>>>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote: > >>>>>>>> Hi Jonathan, > >>>>>>>> > >>>>>>>> I suspect that the site config is considering the comment to be > >>>> part of > >>>>>>>> the value of the workerLogLevel property. We could confirm this > >>>> if you > >>>>>>>> send us the swift log from this particular run. > >>>>>>>> > >>>>>>>> To fix it, you could try to remove everything after DEBUG > >>>> (including all > >>>>>>>> horizontal white space). In other words: > >>>>>>>> > >>>>>>>> ... > >>>>>>>> workerloglevel=DEBUG > >>>>>>>> workerlogdirectory=/home/$USER/ > >>>>>>>> ... > >>>>>>>> > >>>>>>>> Mihael > >>>>>>>> > >>>>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote: > >>>>>>>>> Hi Yadu, > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> I?m getting errors indicating that DEBUG is an invalid worker > >>>> logging > >>>>>>>>> level. I?m attaching the stdout below. Let me know if I?m > >> doing > >>>>>>>>> something silly. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Jonathan > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji > >>>> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hi Jonathan, > >>>>>>>>>> > >>>>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do > >> not > >>>> see > >>>>>>>>>> anything unusual. > >>>>>>>>>> > >>>>>>>>>> From your logs, it looks like workers are failing, so getting > >>>> worker > >>>>>>>>>> logs would help. > >>>>>>>>>> Could you try running on Blues with the following > >>>> swift.properties > >>>>>>>>>> and get us the worker*logs that would show up in the > >>>>>>>>>> workerlogdirectory ? > >>>>>>>>>> > >>>>>>>>>> site=blues > >>>>>>>>>> > >>>>>>>>>> site.blues { > >>>>>>>>>> jobManager=pbs > >>>>>>>>>> jobQueue=shared > >>>>>>>>>> maxJobs=4 > >>>>>>>>>> jobGranularity=1 > >>>>>>>>>> maxNodesPerJob=1 > >>>>>>>>>> tasksPerWorker=16 > >>>>>>>>>> taskThrottle=64 > >>>>>>>>>> initialScore=10000 > >>>>>>>>>> jobWalltime=00:48:00 > >>>>>>>>>> taskWalltime=00:40:00 > >>>>>>>>>> workerloglevel=DEBUG # > >>>> Adding > >>>>>>>>>> debug for workers > >>>>>>>>>> workerlogdirectory=/home/$USER/ # Logging > >>>>>>>>>> directory on SFS > >>>>>>>>>> workdir=$RUNDIRECTORY > >>>>>>>>>> filesystem=local > >>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>>> Thanks, > >>>>>>>>>> Yadu > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote: > >>>>>>>>>> > >>>>>>>>>>> Hi Mike, > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Sorry, I figured there was some busy-ness involved! > >>>>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I > >>>> didn?t > >>>>>>>>>>> get the same issue. That is, the model run completed > >>>> successfully. > >>>>>>>>>>> For the Blues run, I used a trunk distribution (as of May > >> 29, > >>>>>>>>>>> 2014). I?m including one of the log files below. I?m also > >>>>>>>>>>> including the swift.properties file that was used for the > >>>> blues > >>>>>>>>>>> runs below. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Thank you! > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Jonathan > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde > >>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hi Jonathan, > >>>>>>>>>>>> > >>>>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping! > >>>>>>>>>>>> > >>>>>>>>>>>> I or one of the team will answer soon, on swift-user. > >>>>>>>>>>>> > >>>>>>>>>>>> (But the first question is: which Swift release, and can > >> you > >>>>>>>>>>>> point us to, or send, the full log file?) > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks and regards, > >>>>>>>>>>>> > >>>>>>>>>>>> - Mike > >>>>>>>>>>>> > >>>>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Hi Mike, > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> I didn?t get a response yet so just wanted to make sure > >> that > >>>>>>>>>>>>> the message came across. > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> Jonathan > >>>>>>>>>>>>> > >>>>>>>>>>>>> Begin forwarded message: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> From: Jonathan Ozik > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511, > >>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> To: Mihael Hategan , > >>>>>>>>>>>>>> "swift-user at ci.uchicago.edu" > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi all, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I?m getting spurious errors in the jobs that I?m running > >> on > >>>>>>>>>>>>>> Blues. The stdout includes exceptions like: > >>>>>>>>>>>>>> exception @ swift-int.k, line: 511 > >>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost > >>>>>>>>>>>>>> java.io.IOException: Broken pipe > >>>>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > >>>>>>>>>>>>>> at > >>>>>>>>>>>>>> > >> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > >>>>>>>>>>>>>> at > >> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > >>>>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65) > >>>>>>>>>>>>>> at > >>>>>>>>>>>>>> > >>>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487) > >>>>>>>>>>>>>> at > >>>>>>>>>>>>>> > >>>> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168) > >>>>>>>>>>>>>> at > >>>>>>>>>>>>>> > >>>> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133) > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> These seem to occur at different parts of the submitted > >>>>>>>>>>>>>> jobs. Let me know if there?s a log file that you?d like > >> to > >>>>>>>>>>>>>> look at. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> In earlier attempts I was getting these warnings followed > >>>> by > >>>>>>>>>>>>>> broken pipe errors: > >>>>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO: > >>>>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152, > >>>> 0) > >>>>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot > >>>>>>>>>>>>>> allocate large pages, falling back to regular pages > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Apparently that?s a known precursor of crashes on Java 7 > >> as > >>>>>>>>>>>>>> described here > >>>>>>>>>>>>>> > >>>> > >> (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html): > >>>>>>>>>>>>>> Area: hotspot/gc > >>>>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Linux, failures when allocating large pages can lead > >> to > >>>>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the > >> issue > >>>>>>>>>>>>>> can be recognized in two ways: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> ? Before the crash happens one or more lines similar to > >>>> this > >>>>>>>>>>>>>> will have been printed to the log: > >>>>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152, > >>>> 0) > >>>>>>>>>>>>>> failed; > >>>>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot > >> allocate > >>>>>>>>>>>>>> large pages, falling back to regular pages > >>>>>>>>>>>>>> ? If a hs_err file is generated it will contain a line > >>>>>>>>>>>>>> similar to this: > >>>>>>>>>>>>>> Large page allocation failures have occurred 3 times > >>>>>>>>>>>>>> The problem can be avoided by running with large page > >>>>>>>>>>>>>> support turned off, for example by passing the > >>>>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> See 8007074 (not public). > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> So I added the -XX:-UseLargePages option in the > >> invocations > >>>>>>>>>>>>>> of Java code that I was responsible for. That seemed to > >> get > >>>>>>>>>>>>>> rid of the warning and the crashes for a while, but > >> perhaps > >>>>>>>>>>>>>> that was just a coincidence. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Jonathan > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>>> Swift-user mailing list > >>>>>>>>>>>>>> Swift-user at ci.uchicago.edu > >>>>>>>>>>>>>> > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> -- > >>>>>>>>>>>> Michael Wilde > >>>>>>>>>>>> Mathematics and Computer Science Computation > >>>> Institute > >>>>>>>>>>>> Argonne National Laboratory The University of > >>>> Chicago > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>> > >>>> > >>>> > >>> > >>> > >> > >> > > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at anl.gov Thu Jul 31 15:18:53 2014 From: wilde at anl.gov (Michael Wilde) Date: Thu, 31 Jul 2014 15:18:53 -0500 Subject: [Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost In-Reply-To: References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu> <53D96721.6030004@anl.gov> <53D99769.5050107@uchicago.edu> <1406783097.19120.2.camel@echo> <53D9CFCB.8090803@uchicago.edu> <1406783556.19312.1.camel@echo> <28AB72AB-2650-40E2-B084-9D1E20D03AEB@uchicago.edu> <1406826555.22289.4.camel@echo> <22271280-84FF-4C85-8A1E-5002ED165DAE@uchicago.edu> <1406830708.23317.3.camel@echo> Message-ID: <53DAA4AD.3010602@anl.gov> Its odd that no errors like OOM or OOT would be logged to stdout of the PBS job. Jonathan, can you check with the Blues Sysadmins if they have any other error logs (e.g. dmesg/syslogs) for your jobs, or for the nodes on which they ran? Thanks, - Mike On 7/31/14, 1:55 PM, David Kelly wrote: > Is each invocation of the Java app creating a large number of threads? > I ran into an issue like that (on another cluster) where I was hitting > the maximum number of processes per node, and the scheduler ended up > killing my jobs. > > > On Thu, Jul 31, 2014 at 1:42 PM, Jonathan Ozik > wrote: > > Okay, I've launched a new job, with tasksPerWorker=8. This is > running on the sSerial queue rather than the shared queue for the > other runs. Just wanted to note that for comparison. Each of the > java processes that are launched with -Xmx1536m. I believe that > Blues advertises each node having access to 64GB > (http://www.lcrc.anl.gov/about/Blues), so at least at first glance > the memory issue doesn't seem like it could be an issue. > > Jonathan > > On Jul 31, 2014, at 1:18 PM, Mihael Hategan > wrote: > > > Ok, so the workers die while the jobs are running and not much > else is > > happening. > > My money is on the apps eating up all RAM and the kernel killing the > > worker. > > > > The question is how we check whether this is true or not. Ideas? > > > > Yadu, can you do me a favor and package all the PBS output files > from > > this run? > > > > Jonathan, can you see if you get the same errors with > tasksPerWorker=8? > > > > Mihael > > > > On Thu, 2014-07-31 at 12:34 -0500, Jonathan Ozik wrote: > >> Sure thing, it's attached below. > >> > >> Jonathan > >> > >> > >> On Jul 31, 2014, at 12:09 PM, Mihael Hategan > > > >> wrote: > >> > >>> Hi Jonathan, > >>> > >>> I can't see anything obvious in the worker logs, but they are > pretty > >>> large. Can you also post the swift log from this run? It would > make > >> it > >>> easier to focus on the right time frame. > >>> > >>> Mihael > >>> > >>> On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote: > >>>> Hi all, > >>>> > >>>> I'm attaching the stdout and the worker logs below. > >>>> > >>>> Thanks for looking at these! > >>>> > >>>> Jonathan > >>>> > >>>> > >>>> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik > > > >>>> wrote: > >>>> > >>>>> Woops, sorry about that. It's running now and the logs are being > >>>> generated. Once the run is done I'll send you log files. > >>>>> > >>>>> Thanks! > >>>>> > >>>>> Jonathan > >>>>> > >>>>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan > > > >>>> wrote: > >>>>> > >>>>>> Right. This isn't your fault. We should, though, probably talk > >>>> about > >>>>>> addressing the issue. > >>>>>> > >>>>>> Mihael > >>>>>> > >>>>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote: > >>>>>>> Mihael, thanks for spotting that. I added the comments to > >>>> highlight the > >>>>>>> changes in email. > >>>>>>> > >>>>>>> -Yadu > >>>>>>> > >>>>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote: > >>>>>>>> Hi Jonathan, > >>>>>>>> > >>>>>>>> I suspect that the site config is considering the comment > to be > >>>> part of > >>>>>>>> the value of the workerLogLevel property. We could > confirm this > >>>> if you > >>>>>>>> send us the swift log from this particular run. > >>>>>>>> > >>>>>>>> To fix it, you could try to remove everything after DEBUG > >>>> (including all > >>>>>>>> horizontal white space). In other words: > >>>>>>>> > >>>>>>>> ... > >>>>>>>> workerloglevel=DEBUG > >>>>>>>> workerlogdirectory=/home/$USER/ > >>>>>>>> ... > >>>>>>>> > >>>>>>>> Mihael > >>>>>>>> > >>>>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote: > >>>>>>>>> Hi Yadu, > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> I'm getting errors indicating that DEBUG is an invalid > worker > >>>> logging > >>>>>>>>> level. I'm attaching the stdout below. Let me know if I'm > >> doing > >>>>>>>>> something silly. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Jonathan > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji > >>>> > > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hi Jonathan, > >>>>>>>>>> > >>>>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do > >> not > >>>> see > >>>>>>>>>> anything unusual. > >>>>>>>>>> > >>>>>>>>>> From your logs, it looks like workers are failing, so > getting > >>>> worker > >>>>>>>>>> logs would help. > >>>>>>>>>> Could you try running on Blues with the following > >>>> swift.properties > >>>>>>>>>> and get us the worker*logs that would show up in the > >>>>>>>>>> workerlogdirectory ? > >>>>>>>>>> > >>>>>>>>>> site=blues > >>>>>>>>>> > >>>>>>>>>> site.blues { > >>>>>>>>>> jobManager=pbs > >>>>>>>>>> jobQueue=shared > >>>>>>>>>> maxJobs=4 > >>>>>>>>>> jobGranularity=1 > >>>>>>>>>> maxNodesPerJob=1 > >>>>>>>>>> tasksPerWorker=16 > >>>>>>>>>> taskThrottle=64 > >>>>>>>>>> initialScore=10000 > >>>>>>>>>> jobWalltime=00:48:00 > >>>>>>>>>> taskWalltime=00:40:00 > >>>>>>>>>> workerloglevel=DEBUG # > >>>> Adding > >>>>>>>>>> debug for workers > >>>>>>>>>> workerlogdirectory=/home/$USER/ # Logging > >>>>>>>>>> directory on SFS > >>>>>>>>>> workdir=$RUNDIRECTORY > >>>>>>>>>> filesystem=local > >>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>>> Thanks, > >>>>>>>>>> Yadu > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote: > >>>>>>>>>> > >>>>>>>>>>> Hi Mike, > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Sorry, I figured there was some busy-ness involved! > >>>>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I > >>>> didn't > >>>>>>>>>>> get the same issue. That is, the model run completed > >>>> successfully. > >>>>>>>>>>> For the Blues run, I used a trunk distribution (as of May > >> 29, > >>>>>>>>>>> 2014). I'm including one of the log files below. I'm also > >>>>>>>>>>> including the swift.properties file that was used for the > >>>> blues > >>>>>>>>>>> runs below. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Thank you! > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Jonathan > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde > > > >>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hi Jonathan, > >>>>>>>>>>>> > >>>>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the > ping! > >>>>>>>>>>>> > >>>>>>>>>>>> I or one of the team will answer soon, on swift-user. > >>>>>>>>>>>> > >>>>>>>>>>>> (But the first question is: which Swift release, and can > >> you > >>>>>>>>>>>> point us to, or send, the full log file?) > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks and regards, > >>>>>>>>>>>> > >>>>>>>>>>>> - Mike > >>>>>>>>>>>> > >>>>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Hi Mike, > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> I didn't get a response yet so just wanted to make sure > >> that > >>>>>>>>>>>>> the message came across. > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> Jonathan > >>>>>>>>>>>>> > >>>>>>>>>>>>> Begin forwarded message: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> From: Jonathan Ozik > > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, > line: 511, > >>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> To: Mihael Hategan >, > >>>>>>>>>>>>>> "swift-user at ci.uchicago.edu > " > > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi all, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I'm getting spurious errors in the jobs that I'm > running > >> on > >>>>>>>>>>>>>> Blues. The stdout includes exceptions like: > >>>>>>>>>>>>>> exception @ swift-int.k, line: 511 > >>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost > >>>>>>>>>>>>>> java.io.IOException: Broken pipe > >>>>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > >>>>>>>>>>>>>> at > >>>>>>>>>>>>>> > >> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > >>>>>>>>>>>>>> at > >> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > >>>>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65) > >>>>>>>>>>>>>> at > >>>>>>>>>>>>>> > >>>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487) > >>>>>>>>>>>>>> at > >>>>>>>>>>>>>> > >>>> > org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168) > >>>>>>>>>>>>>> at > >>>>>>>>>>>>>> > >>>> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133) > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> These seem to occur at different parts of the submitted > >>>>>>>>>>>>>> jobs. Let me know if there's a log file that you'd like > >> to > >>>>>>>>>>>>>> look at. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> In earlier attempts I was getting these warnings > followed > >>>> by > >>>>>>>>>>>>>> broken pipe errors: > >>>>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO: > >>>>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, > 2097152, > >>>> 0) > >>>>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); > Cannot > >>>>>>>>>>>>>> allocate large pages, falling back to regular pages > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Apparently that's a known precursor of crashes on > Java 7 > >> as > >>>>>>>>>>>>>> described here > >>>>>>>>>>>>>> > >>>> > >> > (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html): > >>>>>>>>>>>>>> Area: hotspot/gc > >>>>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large > pages. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Linux, failures when allocating large pages can lead > >> to > >>>>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the > >> issue > >>>>>>>>>>>>>> can be recognized in two ways: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> . Before the crash happens one or more lines similar to > >>>> this > >>>>>>>>>>>>>> will have been printed to the log: > >>>>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, > 2097152, > >>>> 0) > >>>>>>>>>>>>>> failed; > >>>>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot > >> allocate > >>>>>>>>>>>>>> large pages, falling back to regular pages > >>>>>>>>>>>>>> . If a hs_err file is generated it will contain a line > >>>>>>>>>>>>>> similar to this: > >>>>>>>>>>>>>> Large page allocation failures have occurred 3 times > >>>>>>>>>>>>>> The problem can be avoided by running with large page > >>>>>>>>>>>>>> support turned off, for example by passing the > >>>>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> See 8007074 (not public). > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> So I added the -XX:-UseLargePages option in the > >> invocations > >>>>>>>>>>>>>> of Java code that I was responsible for. That seemed to > >> get > >>>>>>>>>>>>>> rid of the warning and the crashes for a while, but > >> perhaps > >>>>>>>>>>>>>> that was just a coincidence. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Jonathan > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>>> Swift-user mailing list > >>>>>>>>>>>>>> Swift-user at ci.uchicago.edu > > >>>>>>>>>>>>>> > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> -- > >>>>>>>>>>>> Michael Wilde > >>>>>>>>>>>> Mathematics and Computer Science Computation > >>>> Institute > >>>>>>>>>>>> Argonne National Laboratory The > University of > >>>> Chicago > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>> > >>>> > >>>> > >>> > >>> > >> > >> > > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Mathematics and Computer Science Computation Institute Argonne National Laboratory The University of Chicago -------------- next part -------------- An HTML attachment was scrubbed... URL: From jozik at uchicago.edu Thu Jul 31 16:42:27 2014 From: jozik at uchicago.edu (Jonathan Ozik) Date: Thu, 31 Jul 2014 16:42:27 -0500 Subject: [Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost In-Reply-To: <53DAA4AD.3010602@anl.gov> References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu> <53D96721.6030004@anl.gov> <53D99769.5050107@uchicago.edu> <1406783097.19120.2.camel@echo> <53D9CFCB.8090803@uchicago.edu> <1406783556.19312.1.camel@echo> <28AB72AB-2650-40E2-B084-9D1E20D03AEB@uchicago.edu> <1406826555.22289.4.camel@echo> <22271280-84FF-4C85-8A1E-5002ED165DAE@uchicago.edu> <1406830708.23317.3.camel@echo> <53DAA4AD.3010602@anl.gov> Message-ID: Mike, all, I?ve done a few more experiments and realized that while I had added the -XX:-UseLargePages command line argument to the main Java invocation, each app also included a post process Java app which I hadn?t added the UseLargePages argument to. Perhaps because this was running at the end of an app, the warnings that I?d seen previously prior to the crashes weren?t making it to stdout before the crashes. Purely speculation. In any case, I did put the tasksPerWorker back to 16 and was able to successfully run all the jobs. I guess one bit of consolation in all this is that if there are any future ?mystery? worker failings, this will be one issue to check. Thank you all for helping with tracking this down. Jonathan On Jul 31, 2014, at 3:18 PM, Michael Wilde wrote: > Its odd that no errors like OOM or OOT would be logged to stdout of the PBS job. > > Jonathan, can you check with the Blues Sysadmins if they have any other error logs (e.g. dmesg/syslogs) for your jobs, or for the nodes on which they ran? > > Thanks, > > - Mike > > On 7/31/14, 1:55 PM, David Kelly wrote: >> Is each invocation of the Java app creating a large number of threads? I ran into an issue like that (on another cluster) where I was hitting the maximum number of processes per node, and the scheduler ended up killing my jobs. >> >> >> On Thu, Jul 31, 2014 at 1:42 PM, Jonathan Ozik wrote: >> Okay, I?ve launched a new job, with tasksPerWorker=8. This is running on the sSerial queue rather than the shared queue for the other runs. Just wanted to note that for comparison. Each of the java processes that are launched with -Xmx1536m. I believe that Blues advertises each node having access to 64GB (http://www.lcrc.anl.gov/about/Blues), so at least at first glance the memory issue doesn?t seem like it could be an issue. >> >> Jonathan >> >> On Jul 31, 2014, at 1:18 PM, Mihael Hategan wrote: >> >> > Ok, so the workers die while the jobs are running and not much else is >> > happening. >> > My money is on the apps eating up all RAM and the kernel killing the >> > worker. >> > >> > The question is how we check whether this is true or not. Ideas? >> > >> > Yadu, can you do me a favor and package all the PBS output files from >> > this run? >> > >> > Jonathan, can you see if you get the same errors with tasksPerWorker=8? >> > >> > Mihael >> > >> > On Thu, 2014-07-31 at 12:34 -0500, Jonathan Ozik wrote: >> >> Sure thing, it?s attached below. >> >> >> >> Jonathan >> >> >> >> >> >> On Jul 31, 2014, at 12:09 PM, Mihael Hategan >> >> wrote: >> >> >> >>> Hi Jonathan, >> >>> >> >>> I can't see anything obvious in the worker logs, but they are pretty >> >>> large. Can you also post the swift log from this run? It would make >> >> it >> >>> easier to focus on the right time frame. >> >>> >> >>> Mihael >> >>> >> >>> On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote: >> >>>> Hi all, >> >>>> >> >>>> I?m attaching the stdout and the worker logs below. >> >>>> >> >>>> Thanks for looking at these! >> >>>> >> >>>> Jonathan >> >>>> >> >>>> >> >>>> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik >> >>>> wrote: >> >>>> >> >>>>> Woops, sorry about that. It?s running now and the logs are being >> >>>> generated. Once the run is done I?ll send you log files. >> >>>>> >> >>>>> Thanks! >> >>>>> >> >>>>> Jonathan >> >>>>> >> >>>>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan >> >>>> wrote: >> >>>>> >> >>>>>> Right. This isn't your fault. We should, though, probably talk >> >>>> about >> >>>>>> addressing the issue. >> >>>>>> >> >>>>>> Mihael >> >>>>>> >> >>>>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote: >> >>>>>>> Mihael, thanks for spotting that. I added the comments to >> >>>> highlight the >> >>>>>>> changes in email. >> >>>>>>> >> >>>>>>> -Yadu >> >>>>>>> >> >>>>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote: >> >>>>>>>> Hi Jonathan, >> >>>>>>>> >> >>>>>>>> I suspect that the site config is considering the comment to be >> >>>> part of >> >>>>>>>> the value of the workerLogLevel property. We could confirm this >> >>>> if you >> >>>>>>>> send us the swift log from this particular run. >> >>>>>>>> >> >>>>>>>> To fix it, you could try to remove everything after DEBUG >> >>>> (including all >> >>>>>>>> horizontal white space). In other words: >> >>>>>>>> >> >>>>>>>> ... >> >>>>>>>> workerloglevel=DEBUG >> >>>>>>>> workerlogdirectory=/home/$USER/ >> >>>>>>>> ... >> >>>>>>>> >> >>>>>>>> Mihael >> >>>>>>>> >> >>>>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote: >> >>>>>>>>> Hi Yadu, >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> I?m getting errors indicating that DEBUG is an invalid worker >> >>>> logging >> >>>>>>>>> level. I?m attaching the stdout below. Let me know if I?m >> >> doing >> >>>>>>>>> something silly. >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> Jonathan >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji >> >>>> >> >>>>>>>>> wrote: >> >>>>>>>>> >> >>>>>>>>>> Hi Jonathan, >> >>>>>>>>>> >> >>>>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do >> >> not >> >>>> see >> >>>>>>>>>> anything unusual. >> >>>>>>>>>> >> >>>>>>>>>> From your logs, it looks like workers are failing, so getting >> >>>> worker >> >>>>>>>>>> logs would help. >> >>>>>>>>>> Could you try running on Blues with the following >> >>>> swift.properties >> >>>>>>>>>> and get us the worker*logs that would show up in the >> >>>>>>>>>> workerlogdirectory ? >> >>>>>>>>>> >> >>>>>>>>>> site=blues >> >>>>>>>>>> >> >>>>>>>>>> site.blues { >> >>>>>>>>>> jobManager=pbs >> >>>>>>>>>> jobQueue=shared >> >>>>>>>>>> maxJobs=4 >> >>>>>>>>>> jobGranularity=1 >> >>>>>>>>>> maxNodesPerJob=1 >> >>>>>>>>>> tasksPerWorker=16 >> >>>>>>>>>> taskThrottle=64 >> >>>>>>>>>> initialScore=10000 >> >>>>>>>>>> jobWalltime=00:48:00 >> >>>>>>>>>> taskWalltime=00:40:00 >> >>>>>>>>>> workerloglevel=DEBUG # >> >>>> Adding >> >>>>>>>>>> debug for workers >> >>>>>>>>>> workerlogdirectory=/home/$USER/ # Logging >> >>>>>>>>>> directory on SFS >> >>>>>>>>>> workdir=$RUNDIRECTORY >> >>>>>>>>>> filesystem=local >> >>>>>>>>>> } >> >>>>>>>>>> >> >>>>>>>>>> Thanks, >> >>>>>>>>>> Yadu >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote: >> >>>>>>>>>> >> >>>>>>>>>>> Hi Mike, >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> Sorry, I figured there was some busy-ness involved! >> >>>>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I >> >>>> didn?t >> >>>>>>>>>>> get the same issue. That is, the model run completed >> >>>> successfully. >> >>>>>>>>>>> For the Blues run, I used a trunk distribution (as of May >> >> 29, >> >>>>>>>>>>> 2014). I?m including one of the log files below. I?m also >> >>>>>>>>>>> including the swift.properties file that was used for the >> >>>> blues >> >>>>>>>>>>> runs below. >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> Thank you! >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> Jonathan >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde >> >>>> wrote: >> >>>>>>>>>>> >> >>>>>>>>>>>> Hi Jonathan, >> >>>>>>>>>>>> >> >>>>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping! >> >>>>>>>>>>>> >> >>>>>>>>>>>> I or one of the team will answer soon, on swift-user. >> >>>>>>>>>>>> >> >>>>>>>>>>>> (But the first question is: which Swift release, and can >> >> you >> >>>>>>>>>>>> point us to, or send, the full log file?) >> >>>>>>>>>>>> >> >>>>>>>>>>>> Thanks and regards, >> >>>>>>>>>>>> >> >>>>>>>>>>>> - Mike >> >>>>>>>>>>>> >> >>>>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote: >> >>>>>>>>>>>> >> >>>>>>>>>>>>> Hi Mike, >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> I didn?t get a response yet so just wanted to make sure >> >> that >> >>>>>>>>>>>>> the message came across. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> Jonathan >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> Begin forwarded message: >> >>>>>>>>>>>>> >> >>>>>>>>>>>>>> From: Jonathan Ozik >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511, >> >>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> To: Mihael Hategan , >> >>>>>>>>>>>>>> "swift-user at ci.uchicago.edu" >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Hi all, >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> I?m getting spurious errors in the jobs that I?m running >> >> on >> >>>>>>>>>>>>>> Blues. The stdout includes exceptions like: >> >>>>>>>>>>>>>> exception @ swift-int.k, line: 511 >> >>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost >> >>>>>>>>>>>>>> java.io.IOException: Broken pipe >> >>>>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method) >> >>>>>>>>>>>>>> at >> >>>>>>>>>>>>>> >> >> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) >> >>>>>>>>>>>>>> at >> >> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) >> >>>>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65) >> >>>>>>>>>>>>>> at >> >>>>>>>>>>>>>> >> >>>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487) >> >>>>>>>>>>>>>> at >> >>>>>>>>>>>>>> >> >>>> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168) >> >>>>>>>>>>>>>> at >> >>>>>>>>>>>>>> >> >>>> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133) >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> These seem to occur at different parts of the submitted >> >>>>>>>>>>>>>> jobs. Let me know if there?s a log file that you?d like >> >> to >> >>>>>>>>>>>>>> look at. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> In earlier attempts I was getting these warnings followed >> >>>> by >> >>>>>>>>>>>>>> broken pipe errors: >> >>>>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO: >> >>>>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152, >> >>>> 0) >> >>>>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot >> >>>>>>>>>>>>>> allocate large pages, falling back to regular pages >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Apparently that?s a known precursor of crashes on Java 7 >> >> as >> >>>>>>>>>>>>>> described here >> >>>>>>>>>>>>>> >> >>>> >> >> (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html): >> >>>>>>>>>>>>>> Area: hotspot/gc >> >>>>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> On Linux, failures when allocating large pages can lead >> >> to >> >>>>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the >> >> issue >> >>>>>>>>>>>>>> can be recognized in two ways: >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> ? Before the crash happens one or more lines similar to >> >>>> this >> >>>>>>>>>>>>>> will have been printed to the log: >> >>>>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152, >> >>>> 0) >> >>>>>>>>>>>>>> failed; >> >>>>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot >> >> allocate >> >>>>>>>>>>>>>> large pages, falling back to regular pages >> >>>>>>>>>>>>>> ? If a hs_err file is generated it will contain a line >> >>>>>>>>>>>>>> similar to this: >> >>>>>>>>>>>>>> Large page allocation failures have occurred 3 times >> >>>>>>>>>>>>>> The problem can be avoided by running with large page >> >>>>>>>>>>>>>> support turned off, for example by passing the >> >>>>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> See 8007074 (not public). >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> So I added the -XX:-UseLargePages option in the >> >> invocations >> >>>>>>>>>>>>>> of Java code that I was responsible for. That seemed to >> >> get >> >>>>>>>>>>>>>> rid of the warning and the crashes for a while, but >> >> perhaps >> >>>>>>>>>>>>>> that was just a coincidence. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Jonathan >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> _______________________________________________ >> >>>>>>>>>>>>>> Swift-user mailing list >> >>>>>>>>>>>>>> Swift-user at ci.uchicago.edu >> >>>>>>>>>>>>>> >> >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>> -- >> >>>>>>>>>>>> Michael Wilde >> >>>>>>>>>>>> Mathematics and Computer Science Computation >> >>>> Institute >> >>>>>>>>>>>> Argonne National Laboratory The University of >> >>>> Chicago >> >>>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>>> >> >>> >> >>> >> >> >> >> >> > >> > >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> >> >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > -- > Michael Wilde > Mathematics and Computer Science Computation Institute > Argonne National Laboratory The University of Chicago > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Jul 31 17:11:44 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 31 Jul 2014 15:11:44 -0700 Subject: [Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost In-Reply-To: References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu> <53D96721.6030004@anl.gov> <53D99769.5050107@uchicago.edu> <1406783097.19120.2.camel@echo> <53D9CFCB.8090803@uchicago.edu> <1406783556.19312.1.camel@echo> <28AB72AB-2650-40E2-B084-9D1E20D03AEB@uchicago.edu> <1406826555.22289.4.camel@echo> <22271280-84FF-4C85-8A1E-5002ED165DAE@uchicago.edu> <1406830708.23317.3.camel@echo> <53DAA4AD.3010602@anl.gov> Message-ID: <1406844704.25609.4.camel@echo> The worker is still being killed though, and there is little feedback to the user. If possible, we should fix that. Do you have an idea what is happening? Is it a process similar to an OOM where the kernel just decides to kill some process? Mihael On Thu, 2014-07-31 at 16:42 -0500, Jonathan Ozik wrote: > Mike, all, > > I?ve done a few more experiments and realized that while I had added the -XX:-UseLargePages command line argument to the main Java invocation, each app also included a post process Java app which I hadn?t added the UseLargePages argument to. Perhaps because this was running at the end of an app, the warnings that I?d seen previously prior to the crashes weren?t making it to stdout before the crashes. Purely speculation. > In any case, I did put the tasksPerWorker back to 16 and was able to successfully run all the jobs. I guess one bit of consolation in all this is that if there are any future ?mystery? worker failings, this will be one issue to check. > > Thank you all for helping with tracking this down. > > Jonathan > > On Jul 31, 2014, at 3:18 PM, Michael Wilde wrote: > > > Its odd that no errors like OOM or OOT would be logged to stdout of the PBS job. > > > > Jonathan, can you check with the Blues Sysadmins if they have any other error logs (e.g. dmesg/syslogs) for your jobs, or for the nodes on which they ran? > > > > Thanks, > > > > - Mike > > > > On 7/31/14, 1:55 PM, David Kelly wrote: > >> Is each invocation of the Java app creating a large number of threads? I ran into an issue like that (on another cluster) where I was hitting the maximum number of processes per node, and the scheduler ended up killing my jobs. > >> > >> > >> On Thu, Jul 31, 2014 at 1:42 PM, Jonathan Ozik wrote: > >> Okay, I?ve launched a new job, with tasksPerWorker=8. This is running on the sSerial queue rather than the shared queue for the other runs. Just wanted to note that for comparison. Each of the java processes that are launched with -Xmx1536m. I believe that Blues advertises each node having access to 64GB (http://www.lcrc.anl.gov/about/Blues), so at least at first glance the memory issue doesn?t seem like it could be an issue. > >> > >> Jonathan > >> > >> On Jul 31, 2014, at 1:18 PM, Mihael Hategan wrote: > >> > >> > Ok, so the workers die while the jobs are running and not much else is > >> > happening. > >> > My money is on the apps eating up all RAM and the kernel killing the > >> > worker. > >> > > >> > The question is how we check whether this is true or not. Ideas? > >> > > >> > Yadu, can you do me a favor and package all the PBS output files from > >> > this run? > >> > > >> > Jonathan, can you see if you get the same errors with tasksPerWorker=8? > >> > > >> > Mihael > >> > > >> > On Thu, 2014-07-31 at 12:34 -0500, Jonathan Ozik wrote: > >> >> Sure thing, it?s attached below. > >> >> > >> >> Jonathan > >> >> > >> >> > >> >> On Jul 31, 2014, at 12:09 PM, Mihael Hategan > >> >> wrote: > >> >> > >> >>> Hi Jonathan, > >> >>> > >> >>> I can't see anything obvious in the worker logs, but they are pretty > >> >>> large. Can you also post the swift log from this run? It would make > >> >> it > >> >>> easier to focus on the right time frame. > >> >>> > >> >>> Mihael > >> >>> > >> >>> On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote: > >> >>>> Hi all, > >> >>>> > >> >>>> I?m attaching the stdout and the worker logs below. > >> >>>> > >> >>>> Thanks for looking at these! > >> >>>> > >> >>>> Jonathan > >> >>>> > >> >>>> > >> >>>> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik > >> >>>> wrote: > >> >>>> > >> >>>>> Woops, sorry about that. It?s running now and the logs are being > >> >>>> generated. Once the run is done I?ll send you log files. > >> >>>>> > >> >>>>> Thanks! > >> >>>>> > >> >>>>> Jonathan > >> >>>>> > >> >>>>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan > >> >>>> wrote: > >> >>>>> > >> >>>>>> Right. This isn't your fault. We should, though, probably talk > >> >>>> about > >> >>>>>> addressing the issue. > >> >>>>>> > >> >>>>>> Mihael > >> >>>>>> > >> >>>>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote: > >> >>>>>>> Mihael, thanks for spotting that. I added the comments to > >> >>>> highlight the > >> >>>>>>> changes in email. > >> >>>>>>> > >> >>>>>>> -Yadu > >> >>>>>>> > >> >>>>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote: > >> >>>>>>>> Hi Jonathan, > >> >>>>>>>> > >> >>>>>>>> I suspect that the site config is considering the comment to be > >> >>>> part of > >> >>>>>>>> the value of the workerLogLevel property. We could confirm this > >> >>>> if you > >> >>>>>>>> send us the swift log from this particular run. > >> >>>>>>>> > >> >>>>>>>> To fix it, you could try to remove everything after DEBUG > >> >>>> (including all > >> >>>>>>>> horizontal white space). In other words: > >> >>>>>>>> > >> >>>>>>>> ... > >> >>>>>>>> workerloglevel=DEBUG > >> >>>>>>>> workerlogdirectory=/home/$USER/ > >> >>>>>>>> ... > >> >>>>>>>> > >> >>>>>>>> Mihael > >> >>>>>>>> > >> >>>>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote: > >> >>>>>>>>> Hi Yadu, > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> I?m getting errors indicating that DEBUG is an invalid worker > >> >>>> logging > >> >>>>>>>>> level. I?m attaching the stdout below. Let me know if I?m > >> >> doing > >> >>>>>>>>> something silly. > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> Jonathan > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji > >> >>>> > >> >>>>>>>>> wrote: > >> >>>>>>>>> > >> >>>>>>>>>> Hi Jonathan, > >> >>>>>>>>>> > >> >>>>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do > >> >> not > >> >>>> see > >> >>>>>>>>>> anything unusual. > >> >>>>>>>>>> > >> >>>>>>>>>> From your logs, it looks like workers are failing, so getting > >> >>>> worker > >> >>>>>>>>>> logs would help. > >> >>>>>>>>>> Could you try running on Blues with the following > >> >>>> swift.properties > >> >>>>>>>>>> and get us the worker*logs that would show up in the > >> >>>>>>>>>> workerlogdirectory ? > >> >>>>>>>>>> > >> >>>>>>>>>> site=blues > >> >>>>>>>>>> > >> >>>>>>>>>> site.blues { > >> >>>>>>>>>> jobManager=pbs > >> >>>>>>>>>> jobQueue=shared > >> >>>>>>>>>> maxJobs=4 > >> >>>>>>>>>> jobGranularity=1 > >> >>>>>>>>>> maxNodesPerJob=1 > >> >>>>>>>>>> tasksPerWorker=16 > >> >>>>>>>>>> taskThrottle=64 > >> >>>>>>>>>> initialScore=10000 > >> >>>>>>>>>> jobWalltime=00:48:00 > >> >>>>>>>>>> taskWalltime=00:40:00 > >> >>>>>>>>>> workerloglevel=DEBUG # > >> >>>> Adding > >> >>>>>>>>>> debug for workers > >> >>>>>>>>>> workerlogdirectory=/home/$USER/ # Logging > >> >>>>>>>>>> directory on SFS > >> >>>>>>>>>> workdir=$RUNDIRECTORY > >> >>>>>>>>>> filesystem=local > >> >>>>>>>>>> } > >> >>>>>>>>>> > >> >>>>>>>>>> Thanks, > >> >>>>>>>>>> Yadu > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote: > >> >>>>>>>>>> > >> >>>>>>>>>>> Hi Mike, > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> Sorry, I figured there was some busy-ness involved! > >> >>>>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I > >> >>>> didn?t > >> >>>>>>>>>>> get the same issue. That is, the model run completed > >> >>>> successfully. > >> >>>>>>>>>>> For the Blues run, I used a trunk distribution (as of May > >> >> 29, > >> >>>>>>>>>>> 2014). I?m including one of the log files below. I?m also > >> >>>>>>>>>>> including the swift.properties file that was used for the > >> >>>> blues > >> >>>>>>>>>>> runs below. > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> Thank you! > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> Jonathan > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde > >> >>>> wrote: > >> >>>>>>>>>>> > >> >>>>>>>>>>>> Hi Jonathan, > >> >>>>>>>>>>>> > >> >>>>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping! > >> >>>>>>>>>>>> > >> >>>>>>>>>>>> I or one of the team will answer soon, on swift-user. > >> >>>>>>>>>>>> > >> >>>>>>>>>>>> (But the first question is: which Swift release, and can > >> >> you > >> >>>>>>>>>>>> point us to, or send, the full log file?) > >> >>>>>>>>>>>> > >> >>>>>>>>>>>> Thanks and regards, > >> >>>>>>>>>>>> > >> >>>>>>>>>>>> - Mike > >> >>>>>>>>>>>> > >> >>>>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote: > >> >>>>>>>>>>>> > >> >>>>>>>>>>>>> Hi Mike, > >> >>>>>>>>>>>>> > >> >>>>>>>>>>>>> > >> >>>>>>>>>>>>> I didn?t get a response yet so just wanted to make sure > >> >> that > >> >>>>>>>>>>>>> the message came across. > >> >>>>>>>>>>>>> > >> >>>>>>>>>>>>> > >> >>>>>>>>>>>>> Jonathan > >> >>>>>>>>>>>>> > >> >>>>>>>>>>>>> Begin forwarded message: > >> >>>>>>>>>>>>> > >> >>>>>>>>>>>>>> From: Jonathan Ozik > >> >>>>>>>>>>>>>> > >> >>>>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511, > >> >>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost > >> >>>>>>>>>>>>>> > >> >>>>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT > >> >>>>>>>>>>>>>> > >> >>>>>>>>>>>>>> To: Mihael Hategan , > >> >>>>>>>>>>>>>> "swift-user at ci.uchicago.edu" > >> >>>>>>>>>>>>>> > >> >>>>>>>>>>>>>> > >> >>>>>>>>>>>>>> Hi all, > >> >>>>>>>>>>>>>> > >> >>>>>>>>>>>>>> I?m getting spurious errors in the jobs that I?m running > >> >> on > >> >>>>>>>>>>>>>> Blues. The stdout includes exceptions like: > >> >>>>>>>>>>>>>> exception @ swift-int.k, line: 511 > >> >>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost > >> >>>>>>>>>>>>>> java.io.IOException: Broken pipe > >> >>>>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > >> >>>>>>>>>>>>>> at > >> >>>>>>>>>>>>>> > >> >> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > >> >>>>>>>>>>>>>> at > >> >> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > >> >>>>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65) > >> >>>>>>>>>>>>>> at > >> >>>>>>>>>>>>>> > >> >>>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487) > >> >>>>>>>>>>>>>> at > >> >>>>>>>>>>>>>> > >> >>>> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168) > >> >>>>>>>>>>>>>> at > >> >>>>>>>>>>>>>> > >> >>>> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133) > >> >>>>>>>>>>>>>> > >> >>>>>>>>>>>>>> These seem to occur at different parts of the submitted > >> >>>>>>>>>>>>>> jobs. Let me know if there?s a log file that you?d like > >> >> to > >> >>>>>>>>>>>>>> look at. > >> >>>>>>>>>>>>>> > >> >>>>>>>>>>>>>> In earlier attempts I was getting these warnings followed > >> >>>> by > >> >>>>>>>>>>>>>> broken pipe errors: > >> >>>>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO: > >> >>>>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152, > >> >>>> 0) > >> >>>>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot > >> >>>>>>>>>>>>>> allocate large pages, falling back to regular pages > >> >>>>>>>>>>>>>> > >> >>>>>>>>>>>>>> Apparently that?s a known precursor of crashes on Java 7 > >> >> as > >> >>>>>>>>>>>>>> described here > >> >>>>>>>>>>>>>> > >> >>>> > >> >> (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html): > >> >>>>>>>>>>>>>> Area: hotspot/gc > >> >>>>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages. > >> >>>>>>>>>>>>>> > >> >>>>>>>>>>>>>> On Linux, failures when allocating large pages can lead > >> >> to > >> >>>>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the > >> >> issue > >> >>>>>>>>>>>>>> can be recognized in two ways: > >> >>>>>>>>>>>>>> > >> >>>>>>>>>>>>>> ? Before the crash happens one or more lines similar to > >> >>>> this > >> >>>>>>>>>>>>>> will have been printed to the log: > >> >>>>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152, > >> >>>> 0) > >> >>>>>>>>>>>>>> failed; > >> >>>>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot > >> >> allocate > >> >>>>>>>>>>>>>> large pages, falling back to regular pages > >> >>>>>>>>>>>>>> ? If a hs_err file is generated it will contain a line > >> >>>>>>>>>>>>>> similar to this: > >> >>>>>>>>>>>>>> Large page allocation failures have occurred 3 times > >> >>>>>>>>>>>>>> The problem can be avoided by running with large page > >> >>>>>>>>>>>>>> support turned off, for example by passing the > >> >>>>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary. > >> >>>>>>>>>>>>>> > >> >>>>>>>>>>>>>> See 8007074 (not public). > >> >>>>>>>>>>>>>> > >> >>>>>>>>>>>>>> So I added the -XX:-UseLargePages option in the > >> >> invocations > >> >>>>>>>>>>>>>> of Java code that I was responsible for. That seemed to > >> >> get > >> >>>>>>>>>>>>>> rid of the warning and the crashes for a while, but > >> >> perhaps > >> >>>>>>>>>>>>>> that was just a coincidence. > >> >>>>>>>>>>>>>> > >> >>>>>>>>>>>>>> Jonathan > >> >>>>>>>>>>>>>> > >> >>>>>>>>>>>>>> _______________________________________________ > >> >>>>>>>>>>>>>> Swift-user mailing list > >> >>>>>>>>>>>>>> Swift-user at ci.uchicago.edu > >> >>>>>>>>>>>>>> > >> >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > >> >>>>>>>>>>>>>> > >> >>>>>>>>>>>>> > >> >>>>>>>>>>>> -- > >> >>>>>>>>>>>> Michael Wilde > >> >>>>>>>>>>>> Mathematics and Computer Science Computation > >> >>>> Institute > >> >>>>>>>>>>>> Argonne National Laboratory The University of > >> >>>> Chicago > >> >>>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> > > >> > > >> > >> _______________________________________________ > >> Swift-user mailing list > >> Swift-user at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > >> > >> > >> > >> _______________________________________________ > >> Swift-user mailing list > >> Swift-user at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > -- > > Michael Wilde > > Mathematics and Computer Science Computation Institute > > Argonne National Laboratory The University of Chicago > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From wilde at anl.gov Thu Jul 31 17:14:42 2014 From: wilde at anl.gov (Michael Wilde) Date: Thu, 31 Jul 2014 17:14:42 -0500 Subject: [Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost In-Reply-To: <1406844704.25609.4.camel@echo> References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu> <53D96721.6030004@anl.gov> <53D99769.5050107@uchicago.edu> <1406783097.19120.2.camel@echo> <53D9CFCB.8090803@uchicago.edu> <1406783556.19312.1.camel@echo> <28AB72AB-2650-40E2-B084-9D1E20D03AEB@uchicago.edu> <1406826555.22289.4.camel@echo> <22271280-84FF-4C85-8A1E-5002ED165DAE@uchicago.edu> <1406830708.23317.3.camel@echo> <53DAA4AD.3010602@anl.gov> <1406844704.25609.4.camel@echo> Message-ID: <53DABFD2.4030602@anl.gov> Im also wondering - did these jobs run more child processes that they should have, under a shared-queue policy? I thought the shared queue on blues (and fusion) allowed multiple single-processor jobs to run per node, one processor per job? Or, are we stumbling back into an old bug which ran N^2 tasks per node instead of N tasks per node? - MIke On 7/31/14, 5:11 PM, Mihael Hategan wrote: > The worker is still being killed though, and there is little feedback to > the user. If possible, we should fix that. > > Do you have an idea what is happening? Is it a process similar to an OOM > where the kernel just decides to kill some process? > > Mihael > > On Thu, 2014-07-31 at 16:42 -0500, Jonathan Ozik wrote: >> Mike, all, >> >> I?ve done a few more experiments and realized that while I had added the -XX:-UseLargePages command line argument to the main Java invocation, each app also included a post process Java app which I hadn?t added the UseLargePages argument to. Perhaps because this was running at the end of an app, the warnings that I?d seen previously prior to the crashes weren?t making it to stdout before the crashes. Purely speculation. >> In any case, I did put the tasksPerWorker back to 16 and was able to successfully run all the jobs. I guess one bit of consolation in all this is that if there are any future ?mystery? worker failings, this will be one issue to check. >> >> Thank you all for helping with tracking this down. >> >> Jonathan >> >> On Jul 31, 2014, at 3:18 PM, Michael Wilde wrote: >> >>> Its odd that no errors like OOM or OOT would be logged to stdout of the PBS job. >>> >>> Jonathan, can you check with the Blues Sysadmins if they have any other error logs (e.g. dmesg/syslogs) for your jobs, or for the nodes on which they ran? >>> >>> Thanks, >>> >>> - Mike >>> >>> On 7/31/14, 1:55 PM, David Kelly wrote: >>>> Is each invocation of the Java app creating a large number of threads? I ran into an issue like that (on another cluster) where I was hitting the maximum number of processes per node, and the scheduler ended up killing my jobs. >>>> >>>> >>>> On Thu, Jul 31, 2014 at 1:42 PM, Jonathan Ozik wrote: >>>> Okay, I?ve launched a new job, with tasksPerWorker=8. This is running on the sSerial queue rather than the shared queue for the other runs. Just wanted to note that for comparison. Each of the java processes that are launched with -Xmx1536m. I believe that Blues advertises each node having access to 64GB (http://www.lcrc.anl.gov/about/Blues), so at least at first glance the memory issue doesn?t seem like it could be an issue. >>>> >>>> Jonathan >>>> >>>> On Jul 31, 2014, at 1:18 PM, Mihael Hategan wrote: >>>> >>>>> Ok, so the workers die while the jobs are running and not much else is >>>>> happening. >>>>> My money is on the apps eating up all RAM and the kernel killing the >>>>> worker. >>>>> >>>>> The question is how we check whether this is true or not. Ideas? >>>>> >>>>> Yadu, can you do me a favor and package all the PBS output files from >>>>> this run? >>>>> >>>>> Jonathan, can you see if you get the same errors with tasksPerWorker=8? >>>>> >>>>> Mihael >>>>> >>>>> On Thu, 2014-07-31 at 12:34 -0500, Jonathan Ozik wrote: >>>>>> Sure thing, it?s attached below. >>>>>> >>>>>> Jonathan >>>>>> >>>>>> >>>>>> On Jul 31, 2014, at 12:09 PM, Mihael Hategan >>>>>> wrote: >>>>>> >>>>>>> Hi Jonathan, >>>>>>> >>>>>>> I can't see anything obvious in the worker logs, but they are pretty >>>>>>> large. Can you also post the swift log from this run? It would make >>>>>> it >>>>>>> easier to focus on the right time frame. >>>>>>> >>>>>>> Mihael >>>>>>> >>>>>>> On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote: >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I?m attaching the stdout and the worker logs below. >>>>>>>> >>>>>>>> Thanks for looking at these! >>>>>>>> >>>>>>>> Jonathan >>>>>>>> >>>>>>>> >>>>>>>> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Woops, sorry about that. It?s running now and the logs are being >>>>>>>> generated. Once the run is done I?ll send you log files. >>>>>>>>> Thanks! >>>>>>>>> >>>>>>>>> Jonathan >>>>>>>>> >>>>>>>>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan >>>>>>>> wrote: >>>>>>>>>> Right. This isn't your fault. We should, though, probably talk >>>>>>>> about >>>>>>>>>> addressing the issue. >>>>>>>>>> >>>>>>>>>> Mihael >>>>>>>>>> >>>>>>>>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote: >>>>>>>>>>> Mihael, thanks for spotting that. I added the comments to >>>>>>>> highlight the >>>>>>>>>>> changes in email. >>>>>>>>>>> >>>>>>>>>>> -Yadu >>>>>>>>>>> >>>>>>>>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote: >>>>>>>>>>>> Hi Jonathan, >>>>>>>>>>>> >>>>>>>>>>>> I suspect that the site config is considering the comment to be >>>>>>>> part of >>>>>>>>>>>> the value of the workerLogLevel property. We could confirm this >>>>>>>> if you >>>>>>>>>>>> send us the swift log from this particular run. >>>>>>>>>>>> >>>>>>>>>>>> To fix it, you could try to remove everything after DEBUG >>>>>>>> (including all >>>>>>>>>>>> horizontal white space). In other words: >>>>>>>>>>>> >>>>>>>>>>>> ... >>>>>>>>>>>> workerloglevel=DEBUG >>>>>>>>>>>> workerlogdirectory=/home/$USER/ >>>>>>>>>>>> ... >>>>>>>>>>>> >>>>>>>>>>>> Mihael >>>>>>>>>>>> >>>>>>>>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote: >>>>>>>>>>>>> Hi Yadu, >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I?m getting errors indicating that DEBUG is an invalid worker >>>>>>>> logging >>>>>>>>>>>>> level. I?m attaching the stdout below. Let me know if I?m >>>>>> doing >>>>>>>>>>>>> something silly. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Jonathan >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji >>>>>>>> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Jonathan, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do >>>>>> not >>>>>>>> see >>>>>>>>>>>>>> anything unusual. >>>>>>>>>>>>>> >>>>>>>>>>>>>> From your logs, it looks like workers are failing, so getting >>>>>>>> worker >>>>>>>>>>>>>> logs would help. >>>>>>>>>>>>>> Could you try running on Blues with the following >>>>>>>> swift.properties >>>>>>>>>>>>>> and get us the worker*logs that would show up in the >>>>>>>>>>>>>> workerlogdirectory ? >>>>>>>>>>>>>> >>>>>>>>>>>>>> site=blues >>>>>>>>>>>>>> >>>>>>>>>>>>>> site.blues { >>>>>>>>>>>>>> jobManager=pbs >>>>>>>>>>>>>> jobQueue=shared >>>>>>>>>>>>>> maxJobs=4 >>>>>>>>>>>>>> jobGranularity=1 >>>>>>>>>>>>>> maxNodesPerJob=1 >>>>>>>>>>>>>> tasksPerWorker=16 >>>>>>>>>>>>>> taskThrottle=64 >>>>>>>>>>>>>> initialScore=10000 >>>>>>>>>>>>>> jobWalltime=00:48:00 >>>>>>>>>>>>>> taskWalltime=00:40:00 >>>>>>>>>>>>>> workerloglevel=DEBUG # >>>>>>>> Adding >>>>>>>>>>>>>> debug for workers >>>>>>>>>>>>>> workerlogdirectory=/home/$USER/ # Logging >>>>>>>>>>>>>> directory on SFS >>>>>>>>>>>>>> workdir=$RUNDIRECTORY >>>>>>>>>>>>>> filesystem=local >>>>>>>>>>>>>> } >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Yadu >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Mike, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Sorry, I figured there was some busy-ness involved! >>>>>>>>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I >>>>>>>> didn?t >>>>>>>>>>>>>>> get the same issue. That is, the model run completed >>>>>>>> successfully. >>>>>>>>>>>>>>> For the Blues run, I used a trunk distribution (as of May >>>>>> 29, >>>>>>>>>>>>>>> 2014). I?m including one of the log files below. I?m also >>>>>>>>>>>>>>> including the swift.properties file that was used for the >>>>>>>> blues >>>>>>>>>>>>>>> runs below. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thank you! >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Jonathan >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde >>>>>>>> wrote: >>>>>>>>>>>>>>>> Hi Jonathan, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping! >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I or one of the team will answer soon, on swift-user. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> (But the first question is: which Swift release, and can >>>>>> you >>>>>>>>>>>>>>>> point us to, or send, the full log file?) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks and regards, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - Mike >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Mike, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I didn?t get a response yet so just wanted to make sure >>>>>> that >>>>>>>>>>>>>>>>> the message came across. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Jonathan >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Begin forwarded message: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> From: Jonathan Ozik >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511, >>>>>>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> To: Mihael Hategan , >>>>>>>>>>>>>>>>>> "swift-user at ci.uchicago.edu" >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I?m getting spurious errors in the jobs that I?m running >>>>>> on >>>>>>>>>>>>>>>>>> Blues. The stdout includes exceptions like: >>>>>>>>>>>>>>>>>> exception @ swift-int.k, line: 511 >>>>>>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost >>>>>>>>>>>>>>>>>> java.io.IOException: Broken pipe >>>>>>>>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method) >>>>>>>>>>>>>>>>>> at >>>>>>>>>>>>>>>>>> >>>>>> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) >>>>>>>>>>>>>>>>>> at >>>>>> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) >>>>>>>>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65) >>>>>>>>>>>>>>>>>> at >>>>>>>>>>>>>>>>>> >>>>>>>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487) >>>>>>>>>>>>>>>>>> at >>>>>>>>>>>>>>>>>> >>>>>>>> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168) >>>>>>>>>>>>>>>>>> at >>>>>>>>>>>>>>>>>> >>>>>>>> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133) >>>>>>>>>>>>>>>>>> These seem to occur at different parts of the submitted >>>>>>>>>>>>>>>>>> jobs. Let me know if there?s a log file that you?d like >>>>>> to >>>>>>>>>>>>>>>>>> look at. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> In earlier attempts I was getting these warnings followed >>>>>>>> by >>>>>>>>>>>>>>>>>> broken pipe errors: >>>>>>>>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO: >>>>>>>>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152, >>>>>>>> 0) >>>>>>>>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot >>>>>>>>>>>>>>>>>> allocate large pages, falling back to regular pages >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Apparently that?s a known precursor of crashes on Java 7 >>>>>> as >>>>>>>>>>>>>>>>>> described here >>>>>>>>>>>>>>>>>> >>>>>> (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html): >>>>>>>>>>>>>>>>>> Area: hotspot/gc >>>>>>>>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Linux, failures when allocating large pages can lead >>>>>> to >>>>>>>>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the >>>>>> issue >>>>>>>>>>>>>>>>>> can be recognized in two ways: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> ? Before the crash happens one or more lines similar to >>>>>>>> this >>>>>>>>>>>>>>>>>> will have been printed to the log: >>>>>>>>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152, >>>>>>>> 0) >>>>>>>>>>>>>>>>>> failed; >>>>>>>>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot >>>>>> allocate >>>>>>>>>>>>>>>>>> large pages, falling back to regular pages >>>>>>>>>>>>>>>>>> ? If a hs_err file is generated it will contain a line >>>>>>>>>>>>>>>>>> similar to this: >>>>>>>>>>>>>>>>>> Large page allocation failures have occurred 3 times >>>>>>>>>>>>>>>>>> The problem can be avoided by running with large page >>>>>>>>>>>>>>>>>> support turned off, for example by passing the >>>>>>>>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> See 8007074 (not public). >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> So I added the -XX:-UseLargePages option in the >>>>>> invocations >>>>>>>>>>>>>>>>>> of Java code that I was responsible for. That seemed to >>>>>> get >>>>>>>>>>>>>>>>>> rid of the warning and the crashes for a while, but >>>>>> perhaps >>>>>>>>>>>>>>>>>> that was just a coincidence. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Jonathan >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>> Swift-user mailing list >>>>>>>>>>>>>>>>>> Swift-user at ci.uchicago.edu >>>>>>>>>>>>>>>>>> >>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Michael Wilde >>>>>>>>>>>>>>>> Mathematics and Computer Science Computation >>>>>>>> Institute >>>>>>>>>>>>>>>> Argonne National Laboratory The University of >>>>>>>> Chicago >>>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> _______________________________________________ >>>> Swift-user mailing list >>>> Swift-user at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>>> >>>> >>>> >>>> _______________________________________________ >>>> Swift-user mailing list >>>> Swift-user at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>> -- >>> Michael Wilde >>> Mathematics and Computer Science Computation Institute >>> Argonne National Laboratory The University of Chicago >>> _______________________________________________ >>> Swift-user mailing list >>> Swift-user at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -- Michael Wilde Mathematics and Computer Science Computation Institute Argonne National Laboratory The University of Chicago From jozik at anl.gov Thu Jul 3 14:36:02 2014 From: jozik at anl.gov (Ozik, Jonathan) Date: Thu, 03 Jul 2014 19:36:02 -0000 Subject: [Swift-user] Swift 0.95 RC6 on Windows Message-ID: <97A406BA-6E0C-4F2E-AEAA-F9F0D5E9DCDE@anl.gov> Hi all, I have two colleagues that are trying to run Swift 0.95 RC6 on Windows using the included swift.bat Windows script. They haven?t been successful yet. Trying the quickstart guide http://swift-lang.org/guides/quickstart.html and the hello.swift script under examples/swift/misc/ they see the following error: Swift 0.95 RC6 swift-r7900 cog-r3908 RunID: 20140703-1433-7iq5x697 Progress: Thu, 03 Jul 2014 14:33:49-0500 Execution failed: Exception in echo: Arguments: [hi] Host: localhost Directory: hello-20140703-1433-7iq5x697/jobs/6/echo-6nydmzsl exception @ swift-int.k, line: 530 Caused by: null Caused by: java.io.IOException: Cannot run program "/bin/bash" (in directory "\var\tmp\hello-20140703-1433-7iq5x697"): CreateProcess error=267, The directory name is invalid Caused by: java.io.IOException: CreateProcess error=267, The directory name is invalid Any thoughts on this? Jonathan From jozik at anl.gov Thu Jul 31 10:41:39 2014 From: jozik at anl.gov (Ozik, Jonathan) Date: Thu, 31 Jul 2014 15:41:39 -0000 Subject: [Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost In-Reply-To: <53DA5020.7050402@anl.gov> References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu> <53DA4EEE.5010800@anl.gov> <53DA5020.7050402@anl.gov> Message-ID: <85241D2A-C5EC-4538-9D74-9E8439F60C0C@anl.gov> Thank you Mike. Regarding the location of the ER files, to reduce variables the last few runs were done with 0.95-RC6. Jonathan On Jul 31, 2014, at 9:18 AM, Michael Wilde wrote: > I see this from PBS in your home dir: > > blues$ cat 583937.bmgt1.lcrc.anl.gov.ER > Use of uninitialized value $s in concatenation (.) or string at > /home/ozik/.globus/coasters/cscript4312030037430783094.pl line 2220. > Use of uninitialized value $s in concatenation (.) or string at > /home/ozik/.globus/coasters/cscript4312030037430783094.pl line 2220. > blues$ > > That looks to me like a Swift bug in worker.pl > > We'll look into this angle. > > Also I'm curious why these files are not going into your run dir (but > perhaps thats because youre running an older trunk release, not 0.95? > Or, thats a separate 0.95 bug). > > - Mike > > On 7/31/14, 9:13 AM, Michael Wilde wrote: >> Some discussion and diagnosis of this incident has taken place off list. >> >> In a quick scan of the worker logs, I don't spot an obvious error that >> would cause workers to exit. >> Hopefully others on the Swift team can check those as well. >> >> Jonathan, do you have stdout/err files from the PBS scheduler on blues, >> in your runNNN log dirs? >> >> If so, can you point us to them? >> >> Thanks, >> >> - Mike >> >> On 7/29/14, 8:56 PM, Jonathan Ozik wrote: >>> Hi all, >>> >>> I?m getting spurious errors in the jobs that I?m running on Blues. The stdout includes exceptions like: >>> exception @ swift-int.k, line: 511 >>> Caused by: Block task failed: Connection to worker lost >>> java.io.IOException: Broken pipe >>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method) >>> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) >>> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) >>> at sun.nio.ch.IOUtil.write(IOUtil.java:65) >>> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487) >>> at org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168) >>> at org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133) >>> >>> These seem to occur at different parts of the submitted jobs. Let me know if there?s a log file that you?d like to look at. >>> >>> In earlier attempts I was getting these warnings followed by broken pipe errors: >>> Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000a0000000, 704643072, 2097152, 0) failed; error='Cannot allocate memory' (errno=12); Cannot allocate large pages, falling back to regular pages >>> >>> Apparently that?s a known precursor of crashes on Java 7 as described here (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html): >>> Area: hotspot/gc >>> Synopsis: Crashes due to failure to allocate large pages. >>> >>> On Linux, failures when allocating large pages can lead to crashes. When running JDK 7u51 or later versions, the issue can be recognized in two ways: >>> >>> ? Before the crash happens one or more lines similar to this will have been printed to the log: >>> os::commit_memory(0x00000006b1600000, 352321536, 2097152, 0) failed; >>> error='Cannot allocate memory' (errno=12); Cannot allocate large pages, falling back to regular pages >>> ? If a hs_err file is generated it will contain a line similar to this: >>> Large page allocation failures have occurred 3 times >>> The problem can be avoided by running with large page support turned off, for example by passing the "-XX:-UseLargePages" option to the java binary. >>> >>> See 8007074 (not public). >>> >>> So I added the -XX:-UseLargePages option in the invocations of Java code that I was responsible for. That seemed to get rid of the warning and the crashes for a while, but perhaps that was just a coincidence. >>> >>> Jonathan >>> >>> _______________________________________________ >>> Swift-user mailing list >>> Swift-user at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > -- > Michael Wilde > Mathematics and Computer Science Computation Institute > Argonne National Laboratory The University of Chicago > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user