From jozik at uchicago.edu  Thu Jul  3 17:18:21 2014
From: jozik at uchicago.edu (Jonathan Ozik)
Date: Thu, 3 Jul 2014 17:18:21 -0500
Subject: [Swift-user] Swift 0.95 RC6 on Windows
Message-ID: <98581826-0010-422C-B4F5-9CAB8A53B1B3@uchicago.edu>

Hi all,

I have two colleagues that are trying to run Swift 0.95 RC6 on Windows using the included swift.bat Windows script. They haven?t been successful yet. Trying the quickstart guide http://swift-lang.org/guides/quickstart.html and the hello.swift script under examples/swift/misc/ they see the following error:

Swift 0.95 RC6 swift-r7900 cog-r3908
RunID: 20140703-1433-7iq5x697
Progress: Thu, 03 Jul 2014 14:33:49-0500

Execution failed:
Exception in echo:
   Arguments: [hi]
   Host: localhost
   Directory: hello-20140703-1433-7iq5x697/jobs/6/echo-6nydmzsl
       exception @ swift-int.k, line: 530
Caused by: null
Caused by: java.io.IOException: Cannot run program "/bin/bash" (in directory "\var\tmp\hello-20140703-1433-7iq5x697"): CreateProcess error=267, The directory name is invalid
Caused by: java.io.IOException: CreateProcess error=267, The directory name is invalid

Any thoughts on this?

Jonathan

From hategan at mcs.anl.gov  Thu Jul  3 18:09:58 2014
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 3 Jul 2014 16:09:58 -0700
Subject: [Swift-user] Swift 0.95 RC6 on Windows
In-Reply-To: <98581826-0010-422C-B4F5-9CAB8A53B1B3@uchicago.edu>
References: <98581826-0010-422C-B4F5-9CAB8A53B1B3@uchicago.edu>
Message-ID: <1404428998.22636.2.camel@echo>

Hi Jonathan,

We need to update the quick start guide. There is a missing bit if you
are running on Windows. The details are here:
http://swift-lang.org/guides/release-0.94/userguide/userguide.html#_running_on_windows

That said, I don't think we routinely test swift on Windows, and I
haven't maintained it, but I will give it a shot and see if it needs any
fixes.

Mihael

On Thu, 2014-07-03 at 17:18 -0500, Jonathan Ozik wrote:
> Hi all,
> 
> I have two colleagues that are trying to run Swift 0.95 RC6 on Windows using the included swift.bat Windows script. They haven?t been successful yet. Trying the quickstart guide http://swift-lang.org/guides/quickstart.html and the hello.swift script under examples/swift/misc/ they see the following error:
> 
> Swift 0.95 RC6 swift-r7900 cog-r3908
> RunID: 20140703-1433-7iq5x697
> Progress: Thu, 03 Jul 2014 14:33:49-0500
> 
> Execution failed:
> Exception in echo:
>    Arguments: [hi]
>    Host: localhost
>    Directory: hello-20140703-1433-7iq5x697/jobs/6/echo-6nydmzsl
>        exception @ swift-int.k, line: 530
> Caused by: null
> Caused by: java.io.IOException: Cannot run program "/bin/bash" (in directory "\var\tmp\hello-20140703-1433-7iq5x697"): CreateProcess error=267, The directory name is invalid
> Caused by: java.io.IOException: CreateProcess error=267, The directory name is invalid
> 
> Any thoughts on this?
> 
> Jonathan
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user


From hategan at mcs.anl.gov  Thu Jul  3 21:13:35 2014
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 3 Jul 2014 19:13:35 -0700
Subject: [Swift-user] Swift 0.95 RC6 on Windows
In-Reply-To: <1404428998.22636.2.camel@echo>
References: <98581826-0010-422C-B4F5-9CAB8A53B1B3@uchicago.edu>
	<1404428998.22636.2.camel@echo>
Message-ID: <1404440015.24691.4.camel@echo>

Hi again,

There were two problems:

One was compiling swift on Windows (i.e. running "ant dist" in the swift
directory on a windows machine). This is now fixed, but should not
affect you if you are using a binary distribution.

The second was an out-of-date _swiftwrap.vbs. I fixed this also and
committed to SVN. You can either compile 0.95 from sources, or,
alternatively, remove the following lines from libexec/_swiftwrap.vbs:

expectArg("k")
KICKSTART=getOptArg()

Things should work after that.

Regarding the new config mechanism and sysinfo="INTEL32::WINDOWS", on a
cursory look at the code, it does not seem to be supported. You will
have to use the old mechanism (i.e. sites.xml) for now.

Mihael


On Thu, 2014-07-03 at 16:09 -0700, Mihael Hategan wrote:
> Hi Jonathan,
> 
> We need to update the quick start guide. There is a missing bit if you
> are running on Windows. The details are here:
> http://swift-lang.org/guides/release-0.94/userguide/userguide.html#_running_on_windows
> 
> That said, I don't think we routinely test swift on Windows, and I
> haven't maintained it, but I will give it a shot and see if it needs any
> fixes.
> 
> Mihael
> 
> On Thu, 2014-07-03 at 17:18 -0500, Jonathan Ozik wrote:
> > Hi all,
> > 
> > I have two colleagues that are trying to run Swift 0.95 RC6 on Windows using the included swift.bat Windows script. They haven?t been successful yet. Trying the quickstart guide http://swift-lang.org/guides/quickstart.html and the hello.swift script under examples/swift/misc/ they see the following error:
> > 
> > Swift 0.95 RC6 swift-r7900 cog-r3908
> > RunID: 20140703-1433-7iq5x697
> > Progress: Thu, 03 Jul 2014 14:33:49-0500
> > 
> > Execution failed:
> > Exception in echo:
> >    Arguments: [hi]
> >    Host: localhost
> >    Directory: hello-20140703-1433-7iq5x697/jobs/6/echo-6nydmzsl
> >        exception @ swift-int.k, line: 530
> > Caused by: null
> > Caused by: java.io.IOException: Cannot run program "/bin/bash" (in directory "\var\tmp\hello-20140703-1433-7iq5x697"): CreateProcess error=267, The directory name is invalid
> > Caused by: java.io.IOException: CreateProcess error=267, The directory name is invalid
> > 
> > Any thoughts on this?
> > 
> > Jonathan
> > _______________________________________________
> > Swift-user mailing list
> > Swift-user at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> 


From jozik at uchicago.edu  Thu Jul  3 21:35:10 2014
From: jozik at uchicago.edu (Jonathan Ozik)
Date: Thu, 3 Jul 2014 21:35:10 -0500
Subject: [Swift-user] Swift 0.95 RC6 on Windows
In-Reply-To: <1404440015.24691.4.camel@echo>
References: <98581826-0010-422C-B4F5-9CAB8A53B1B3@uchicago.edu>
	<1404428998.22636.2.camel@echo> <1404440015.24691.4.camel@echo>
Message-ID: <F3C33376-B8D4-4676-B28B-B03FB5595035@uchicago.edu>

Thanks Mihael! 
I'll pass this on. Do you happen to remember if both the new and old configuration files can be used together? I mean including both -properties and -sites command line arguments. 

Jonathan

> On Jul 3, 2014, at 9:13 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> 
> Hi again,
> 
> There were two problems:
> 
> One was compiling swift on Windows (i.e. running "ant dist" in the swift
> directory on a windows machine). This is now fixed, but should not
> affect you if you are using a binary distribution.
> 
> The second was an out-of-date _swiftwrap.vbs. I fixed this also and
> committed to SVN. You can either compile 0.95 from sources, or,
> alternatively, remove the following lines from libexec/_swiftwrap.vbs:
> 
> expectArg("k")
> KICKSTART=getOptArg()
> 
> Things should work after that.
> 
> Regarding the new config mechanism and sysinfo="INTEL32::WINDOWS", on a
> cursory look at the code, it does not seem to be supported. You will
> have to use the old mechanism (i.e. sites.xml) for now.
> 
> Mihael
> 
> 
>> On Thu, 2014-07-03 at 16:09 -0700, Mihael Hategan wrote:
>> Hi Jonathan,
>> 
>> We need to update the quick start guide. There is a missing bit if you
>> are running on Windows. The details are here:
>> http://swift-lang.org/guides/release-0.94/userguide/userguide.html#_running_on_windows
>> 
>> That said, I don't think we routinely test swift on Windows, and I
>> haven't maintained it, but I will give it a shot and see if it needs any
>> fixes.
>> 
>> Mihael
>> 
>>> On Thu, 2014-07-03 at 17:18 -0500, Jonathan Ozik wrote:
>>> Hi all,
>>> 
>>> I have two colleagues that are trying to run Swift 0.95 RC6 on Windows using the included swift.bat Windows script. They haven?t been successful yet. Trying the quickstart guide http://swift-lang.org/guides/quickstart.html and the hello.swift script under examples/swift/misc/ they see the following error:
>>> 
>>> Swift 0.95 RC6 swift-r7900 cog-r3908
>>> RunID: 20140703-1433-7iq5x697
>>> Progress: Thu, 03 Jul 2014 14:33:49-0500
>>> 
>>> Execution failed:
>>> Exception in echo:
>>>   Arguments: [hi]
>>>   Host: localhost
>>>   Directory: hello-20140703-1433-7iq5x697/jobs/6/echo-6nydmzsl
>>>       exception @ swift-int.k, line: 530
>>> Caused by: null
>>> Caused by: java.io.IOException: Cannot run program "/bin/bash" (in directory "\var\tmp\hello-20140703-1433-7iq5x697"): CreateProcess error=267, The directory name is invalid
>>> Caused by: java.io.IOException: CreateProcess error=267, The directory name is invalid
>>> 
>>> Any thoughts on this?
>>> 
>>> Jonathan
>>> _______________________________________________
>>> Swift-user mailing list
>>> Swift-user at ci.uchicago.edu
>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> 
> 


From hategan at mcs.anl.gov  Thu Jul  3 22:46:53 2014
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 3 Jul 2014 20:46:53 -0700
Subject: [Swift-user] Swift 0.95 RC6 on Windows
In-Reply-To: <F3C33376-B8D4-4676-B28B-B03FB5595035@uchicago.edu>
References: <98581826-0010-422C-B4F5-9CAB8A53B1B3@uchicago.edu>
	<1404428998.22636.2.camel@echo> <1404440015.24691.4.camel@echo>
	<F3C33376-B8D4-4676-B28B-B03FB5595035@uchicago.edu>
Message-ID: <1404445613.25475.3.camel@echo>

On Thu, 2014-07-03 at 21:35 -0500, Jonathan Ozik wrote:
> Thanks Mihael! 
> I'll pass this on. Do you happen to remember if both the new and old configuration files can be used together? I mean including both -properties and -sites command line arguments. 

Probably not. The new mechanism generates a sites file, and you can only
feed swift one sites file. In a pinch, you could hack bin/swiftrun which
generates the sites file from the properties file. Line 75 might be of
particular interest.

Although we will most likely have a fix for this issue soon.

Mihael


From iraicu at cs.iit.edu  Thu Jul 10 21:10:27 2014
From: iraicu at cs.iit.edu (Ioan Raicu)
Date: Thu, 10 Jul 2014 21:10:27 -0500
Subject: [Swift-user] CFP: 7th IEEE Workshop on Many-Task Computing on
 Clouds, Grids, and Supercomputers (MTAGS) 2014 @ IEEE/ACM SC14
Message-ID: <53BF4793.7030304@cs.iit.edu>

Call for Papers

---------------------------------------------------------------------------------------
The 7th Workshop on Many-Task Computing on Clouds, Grids, and Supercomputers (MTAGS) 2014
http://datasys.cs.iit.edu/events/MTAGS14/
---------------------------------------------------------------------------------------
November 16th, 2014
New Orleans, Louisiana, USA

Co-located with with IEEE/ACM International Conference for
High Performance Computing, Networking, Storage and Analysis (SC14)
In cooperation with ACM SIGHPC

=======================================================================================
The 7th workshop on Many-Task Computing on Clouds, Grids, and Supercomputers (MTAGS) will provide the scientific community a dedicated forum for presenting new research, development, and deployment efforts of large-scale many-task computing (MTC) applications on large scale clusters, Grids, Supercomputers, and Cloud Computing infrastructure. MTC, the theme of the workshop encompasses loosely coupled applications, which are generally composed of many tasks (both independent and dependent tasks) to achieve some larger application goal.  This workshop will cover challenges that can hamper efficiency and utilization in running applications on large-scale systems, such as local resource manager scalability and granularity, efficient utilization of raw hardware, parallel file system contention and scalability, data management, I/O management, reliability at scale, and application scalability. We welcome paper submissions on all theoretical, simulations, and systems topics related to MTC, but we give special consideration to papers addressing petascale to exascale challenges. Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the ACM digital library (pending approval). The workshop will be co-located with the IEEE/ACM Supercomputing 2014 Conference in New Orleans on November 17th, 2014. For more information, please see http://datasys.cs.iit.edu/events/MTAGS14/.

For more information on past workshops, please see MTAGS13, MTAGS12, MTAGS11, MTAGS10, MTAGS09, and MTAGS08. We also ran a Special Issue on Many-Task Computing in the IEEE Transactions on Parallel and Distributed Systems (TPDS) which appeared in June 2011; the proceedings can be found online at http://www.computer.org/portal/web/csdl/abs/trans/td/2011/06/ttd201106toc.htm. We, the workshop organizers, also published a highly relevant paper that defines Many-Task Computing which was published in MTAGS08, titled ?Many-Task Computing for Grids and Supercomputers?; we encourage potential authors to read this paper, and to clearly articulate in your paper submissions how your papers are related to Many-Task Computing.


Topics
---------------------------------------------------------------------------------------
We invite the submission of original work that is related to the topics below. The papers should be 6 pages, including all figures and references. We aim to cover topics related to Many-Task Computing on each of the three major distributed systems paradigms, Cloud Computing, Grid Computing and Supercomputing. Topics of interest include:
* Compute Resource Management
   * Scheduling
   * Job execution frameworks
   * Local resource manager extensions
   * Performance evaluation of resource managers in use on large scale systems
   * Dynamic resource provisioning
   * Techniques to manage many-core resources and/or GPUs
   * Challenges and opportunities in running many-task workloads on HPC systems
   * Challenges and opportunities in running many-task workloads on Cloud Computing infrastructure
* Storage architectures and implementations
   * Distributed file systems
   * Parallel file systems
   * Distributed meta-data management
   * Content distribution systems for large data
   * Data caching frameworks and techniques
   * Data management within and across data centers
   * Data-aware scheduling
   * Data-intensive computing applications
   * Eventual-consistency storage usage and management
* Programming models and tools
   * Map-reduce and its generalizations
   * Many-task computing middleware and applications
   * Parallel programming frameworks
   * Ensemble MPI techniques and frameworks
   * Service-oriented science applications
* Large-Scale Workflow Systems
   * Workflow system performance and scalability analysis
   * Scalability of workflow systems
   * Workflow infrastructure and e-Science middleware
   * Programming Paradigms and Models
* Large-Scale Many-Task Applications
   * High-throughput computing (HTC) applications
   * Data-intensive applications
   * Quasi-supercomputing applications, deployments, and experiences
   * Performance Evaluation
* Performance evaluation
   * Real systems
   * Simulations
   * Reliability of large systems


Paper Submission and Publication
---------------------------------------------------------------------------------------
Authors are invited to submit papers with unpublished, original work of not more than 6 pages of double column text using single spaced 10 point size on 8.5 x 11 inch pages, as per ACM 8.5 x 11 manuscript guidelines; document templates can be found at http://www.acm.org/sigs/publications/proceedings-templates. The final 6 page papers in PDF format must be submitted online at https://cmt.research.microsoft.com/MTAGS2014/ before the deadline of August 25th, 2014 at 11:59PM PST (note that an abstract must be submitted 1 week prior to the deadline, on August 18th, 2014). Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the ACM digital library (in cooperation with SIGHPC). Notifications of the paper decisions will be sent out by September 22nd, 2014. Accepted workshop papers will be eligible for additional post-conference publication as journal articles in the IEEE Transaction on Cloud Computing, Special Issue on Many-Task Computing in the Cloud (papers will be due in February 2015, file:///C:/Users/iraicu/Documents/Webs/DataSys/events/TCC-MTC15/index.html). Submission implies the willingness of at least one of the authors to register and present the paper. For more information, please see http://datasys.cs.iit.edu/events/MTAGS14/.


Important Dates
---------------------------------------------------------------------------------------
*	Abstract Due:			August 18th, 2014
*	Papers Due:			August 25th, 2014
*	Notification of Acceptance:	September 22nd, 2014
*	Camera Ready Papers Due:	October  6th, 2014
*	Workshop Date:			November 16th, 2014


Committee Members
---------------------------------------------------------------------------------------
Workshop Chairs
*	Ioan Raicu, Illinois Institute of Technology & Argonne National Laboratory
*	Justin Wozniak, University of Chicago & Argonne National Laboratory
*	Ian Foster, University of Chicago & Argonne National Laboratory
*	Yong Zhao, University of Electronic Science and Technology of China

Steering Committee
*	David Abramson, Monash University, Australia
*	Jack Dongarra, University of Tennessee, USA
*	Geoffrey Fox, Indiana University, USA
*	Manish Parashar, Rutgers University, USA
*	Marc Snir, University of Illinois at Urbana Champaign, USA
*	Xian-He Sun, Illinois Institute of Technology, USA
*	Weimin Zheng, Tsinghua University, China

Technical Committee
*	Hasan Abbasi, Oak Ridge National Labs, USA
*	Tim Armstrong, University of Chicago, USA
*	Roger Barga, Microsoft, USA
*	Rajkumar Buyya University of Melbourne, Australia
*	Kyle Chard, University of Chicago, USA
*	Evangelinos Constantinos, Massachusetts Institute of Technology, USA
*	Catalin Dumitrescu, Fermi National Labs, USA
*	Haryadi Gunawi, University of Chicago, USA
*	Indranil Gupta, University of Illinois at Urbana Champaign, USA
*	Alexandru Iosup, Delft University of Technology, Netherlands
*	Florin Isaila, Universidad Carlos III de Madrid, Spain & Argonne National Laboratory, USA
*	Kamil Iskra, Argonne National Laboratory, USA
*	Daniel S. Katz, University of Chicago, USA
*	Jik-Soo Kim, Kristi, Korea
*	Scott A. Klasky, Oak Ridge National Labs, USA
*	Mike Lang, Los Alamos National Laboratory, USA
*	Tonglin Li, Illinois Institute of Technology, USA
*	Chris Moretti, Princeton University, USA
*	David O'Hallaron, Carnegie Mellon University, USA
*	Marlon Pierce, Indiana University, USA
*	Judy Qiu, Indiana University, USA
*	Lavanya Ramakrishnan, Lawrence Berkeley National Laboratory, USA
*	Matei Ripeanu, University of British Columbia, Canada
*	Wei Tang, Argonne National Laboratory, USA
*	Edward Walker, Whitworth University, USA
*	Ke Wang, Illinois Institute of Technology, USA
*	Matthew Woitaszek, Walmart Labs, USA
*	Rich Wolski, University of California, Santa Barbara, USA
*	Zhifeng Yun, University of Houston, USA
*	Ziming Zheng, University of Chicago, USA


-- 
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor, Illinois Institute of Technology (IIT)
Guest Research Faculty, Argonne National Laboratory (ANL)
=================================================================
Data-Intensive Distributed Systems Laboratory, CS/IIT
Distributed Systems Laboratory, MCS/ANL
=================================================================
Editor: IEEE TCC, Springer Cluster, Springer JoCCASA
Chair:  IEEE/ACM MTAGS, ACM ScienceCloud
=================================================================
Cel:      1-847-722-0876
Office:   1-312-567-5704
Email:    iraicu at cs.iit.edu
Web:      http://www.cs.iit.edu/~iraicu/
Web:      http://datasys.cs.iit.edu/
LinkedIn: http://www.linkedin.com/in/ioanraicu
Google:   http://scholar.google.com/citations?user=jE73HYAAAAAJ
=================================================================
=================================================================


From iraicu at cs.iit.edu  Sat Jul 12 09:20:36 2014
From: iraicu at cs.iit.edu (Ioan Raicu)
Date: Sat, 12 Jul 2014 09:20:36 -0500
Subject: [Swift-user] Call for Papers: IEEE Transactions on Cloud Computing
 - Special Issue on Many-Task Computing in the Cloud
Message-ID: <53C14434.3050806@cs.iit.edu>

Call for Papers

---------------------------------------------------------------------------------------
IEEE Transaction on Cloud Computing
Special Issue on Many-Task Computing in the Cloud
http://datasys.cs.iit.edu/events/TCC-MTC15/

=======================================================================================
The Special Issue on Many-Task Computing (MTC) in the Cloud will provide the scientific community a dedicated forum, within the prestigious IEEE Transactions on Cloud Computing journal, for presenting new research, development, and deployment efforts of loosely coupled large scale applications on Cloud Computing infrastructure. MTC, the theme of this special issue, encompasses loosely coupled applications, which are generally composed of many tasks to achieve some larger application goal. This special issue will cover challenges that can hamper efficiency and utilization in running applications on large-scale systems, such as local resource manager scalability and granularity, efficient utilization of raw hardware, parallel file-system contention and scalability, data management, I/O management, reliability at scale, and application scalability. We welcome paper submissions in theoretical, simulations, and systems topics with special consideration to papers addressing the intersection of petascale/exascale challenges with large-scale cloud computing. We seek submission of papers that present new, original and innovative ideas for the "first" time in TCC (Transactions on Cloud Computing). That means, submission of "extended versions" of already published works (e.g., conference/workshop papers) is not encouraged unless they contain significant number of "new and original" ideas/contributions along with more than 49% brand "new" material.

For more information on past workshops and special issues on Many-Task Computing, see http://datasys.cs.iit.edu/events/MTAGS/index.html. We ran a Special Issue on Many-Task Computing in the IEEE Transactions on Parallel and Distributed Systems (TPDS) which appeared in June 2011; the proceedings can be found online at http://www.computer.org/portal/web/csdl/abs/trans/td/2011/06/ttd201106toc.htm. The special issue editors also published a highly relevant paper that defines Many-Task Computing which was published in the inaugural MTAGS08 workshop, titled "Many-Task Computing for Grids and Supercomputers"; we encourage potential authors to read this paper, and to clearly articulate in your paper submissions how your papers are related to Many-Task Computing.

For more information on this special issue, please see http://datasys.cs.iit.edu/events/TCC-MTC15/.


Topics
---------------------------------------------------------------------------------------
We seek submission of papers that present new, original and innovative ideas for the "first" time in TCC (Transactions on Cloud Computing). That means, submission of "extended versions" of already published works (e.g., conference/workshop papers) will only be encouraged if they contain significant number of "new and original" ideas/contributions along with more than 49% brand "new" material. TCC expects submissions to be complete in all respects including author names, affiliation, bios etc. Manuscript should be 14 double column pages (all regular paper page limits include references and author biographies). We aim to cover topics related to Many-Task Computing and Cloud Computing. Topics of interest include:
* Compute Resource Management
   * Scheduling
   * Job execution frameworks
   * Local resource manager extensions
   * Performance evaluation of resource managers in use on large scale systems
   * Dynamic resource provisioning
   * Techniques to manage many-core resources and/or GPUs
   * Challenges and opportunities in running many-task workloads on Cloud Computing infrastructure
* Storage architectures and implementations
   * Distributed file systems
   * Parallel file systems
   * Distributed meta-data management
   * Content distribution systems for large data
   * Data caching frameworks and techniques
   * Data management within and across data centers
   * Data-aware scheduling
   * Data-intensive computing applications
   * Eventual-consistency storage usage and management
* Programming models and tools
   * Map-reduce and its generalizations
   * Many-task computing middleware and applications
   * Parallel programming frameworks
   * Ensemble MPI techniques and frameworks
   * Service-oriented science applications
* Large-Scale Workflow Systems
   * Workflow system performance and scalability analysis
   * Scalability of workflow systems
   * Workflow infrastructure and e-Science middleware
   * Programming Paradigms and Models
* Large-Scale Many-Task Applications
   * High-throughput computing (HTC) applications
   * Data-intensive applications
   * Quasi-supercomputing applications, deployments, and experiences
   * Performance Evaluation
* Performance evaluation
   * Real systems
   * Simulations
   * Reliability of large systems


Paper Submission and Publication
---------------------------------------------------------------------------------------

Authors are invited to submit unpublished and original work to the IEEE Transactions on Cloud Computing (TCC), Special Issue on Many-Task Computing in the Cloud. If the paper is extended from an initial work, the submission must contain at least 50% new material that can be qualified as ?brand? new ideas and results. The paper must be in the IEEE TCC format, namely 14 double-column pages or 30 single-column pages (Note: All regular paper page limits include references and author biographies). Please note that the double-column format will translate more readily into the final publication format. A double-column page is defined as a 7.875??10.75? page with 10-point type, 12-point vertical spacing, and 0.5 inch margins. A single-column page is defined as an 8.5??11? page with 12-point type and 24-point vertical spacing, containing approximately 250 words. All of the margins should be one inch (top, bottom, right and left). These length limits are taking into account reasonably-sized figures and references. Papers must be submitted using the submission system: https://mc.manuscriptcentral.com/tcc-cs, by selecting the special issue option ?SI-MTC?. For more information, please see http://datasys.cs.iit.edu/events/TCC-MTC15/.

Important Dates
---------------------------------------------------------------------------------------
*	Abstract Due:			February 2nd, 2015
*	Papers Due:			February 9th, 2015
* 	First round decisions:		May 18th, 2015
* 	Major Revisions if needed:	July 18th, 2015
* 	Final decisions:		August 18th, 2015
* 	Publication Date:		Fall 2015 (may vary depending on production queue)


Guest Editors
---------------------------------------------------------------------------------------
*	Ioan Raicu, Illinois Institute of Technology & Argonne National Laboratory
*	Justin Wozniak, University of Chicago & Argonne National Laboratory
*	Ian Foster, University of Chicago & Argonne National Laboratory
*	Yong Zhao, University of Electronic Science and Technology of China


-- 
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor, Illinois Institute of Technology (IIT)
Guest Research Faculty, Argonne National Laboratory (ANL)
=================================================================
Data-Intensive Distributed Systems Laboratory, CS/IIT
Distributed Systems Laboratory, MCS/ANL
=================================================================
Editor: IEEE TCC, Springer Cluster, Springer JoCCASA
Chair:  IEEE/ACM MTAGS, ACM ScienceCloud
=================================================================
Cel:      1-847-722-0876
Office:   1-312-567-5704
Email:    iraicu at cs.iit.edu
Web:      http://www.cs.iit.edu/~iraicu/
Web:      http://datasys.cs.iit.edu/
LinkedIn: http://www.linkedin.com/in/ioanraicu
Google:   http://scholar.google.com/citations?user=jE73HYAAAAAJ
=================================================================
=================================================================


From iraicu at cs.iit.edu  Tue Jul 15 09:19:04 2014
From: iraicu at cs.iit.edu (Ioan Raicu)
Date: Tue, 15 Jul 2014 09:19:04 -0500
Subject: [Swift-user] Call for Papers: IEEE Transactions on Cloud Computing
 - Special Issue on Scientific Cloud Computing (deadline Jul 31, 2014)
Message-ID: <53C53858.9000904@cs.iit.edu>

Dear colleagues,

This is just a friendly reminder about the upcoming deadline (July 31st, 
2014) for the special issue on Scientific Cloud Computing.

-------------------------------------------------------------------------------

Call for Papers

IEEE Transactions on Cloud Computing
Special Issue on Scientific Cloud Computing
http://datasys.cs.iit.edu/events/ScienceCloud2014-TCC/  

-------------------------------------------------------------------------------
IMPORTANT DATES

Paper Submissions Due: July 31, 2014
First Round Decision: September 30,2014
Major Revisions Due (if necessary): October 31, 2014
Final Decision: December 1, 2014
Journal Publication: TBD

-------------------------------------------------------------------------------
OVERVIEW

Computational and Data-Driven Sciences have become the third and fourth pillar
of scientific discovery in addition to experimental and theoretical sciences.
Scientific Computing has already begun to change how science is done, enabling
scientific breakthroughs through new kinds of experiments that would have been
impossible only a decade ago. It is the key to solving "grand challenges" in
many domains and providing breakthroughs in new knowledge, and it comes in many
shapes and forms: high-performance computing (HPC) which is heavily focused on
compute-intensive applications; high-throughput computing (HTC) which focuses
on using many computing resources over long periods of time to accomplish its
computational tasks; many-task computing (MTC) which aims to bridge the gap
between HPC and HTC by focusing on using many resources over short periods of
time; and data-intensive computing which is heavily focused on data
distribution, data-parallel execution, and harnessing data locality by
scheduling of computations close to the data. Today's "Big Data" trend is
generating datasets that are increasing exponentially in both complexity and
volume, making their analysis, archival, and sharing one of the grand
challenges of the 21st century. Not surprisingly, it becomes increasingly
difficult to design and operate large scale systems capable of addressing these
grand challenges.

This journal Special Issue on Scientific Cloud Computing in the IEEE
Transaction on Cloud Computing will provide the scientific community a
dedicated forum for discussing new research, development, and deployment
efforts in running these kinds of scientific computing workloads on Cloud
Computing infrastructures. This special issue will focus on the use of
cloud-based technologies to meet new compute-intensive and data-intensive
scientific challenges that are not well served by the current supercomputers,
grids and HPC clusters. The special issue will aim to address questions such
as: What architectural changes to the current cloud frameworks (hardware,
operating systems, networking and/or programming models) are needed to support
science? Dynamic information derived from remote instruments and coupled
simulation, and sensor ensembles that stream data for real-time analysis are
important emerging techniques in scientific and cyber-physical engineering
systems. How can cloud technologies enable and adapt to these new scientific
approaches dealing with dynamism? How are scientists using clouds? Are there
scientific HPC/HTC/MTC workloads that are suitable candidates to take advantage
of emerging cloud computing resources with high efficiency? Commercial public
clouds provide easy access to cloud infrastructure for scientists. What are the
gaps in commercial cloud offerings and how can they be adapted for running
existing and novel eScience applications? What benefits exist by adopting the
cloud model, over clusters, grids, or supercomputers? What factors are limiting
clouds use or would make them more usable/efficient?

-------------------------------------------------------------------------------
TOPICS

The topics of interest are, but not limited to, the application of Cloud in
scientific applications:

? Scientific application cases studies on Clouds
? Performance evaluation of Cloud technologies
? Fault tolerance and reliability in cloud systems
? Data-intensive workloads and tools on Clouds
? Programming models such as Map-Reduce
? Storage cloud architectures
? I/O and Data management in the Cloud
? Workflow and resource management in the Cloud
? NoSQL databases for scientific applications
? Data streaming and dynamic applications on Clouds
? Dynamic resource provisioning
? Many-Task Computing in the Cloud
? Application of cloud concepts in HPC environments
? Virtualized High performance parallel file systems
? Virtualized high performance I/O networks
? Virtualization and its Impact on Applications
? Distributed Operating Systems
? Many-core computing and accelerators in the Cloud
? Cloud security


-------------------------------------------------------------------------------
SUBMISSION INSTRUCTIONS
  
Authors are invited to submit papers with unpublished, original work to the
IEEE Transactions on Cloud Computing, Special Issue on Scientific Cloud
Computing. If the paper is extended from a workshop or conference paper, it
must contain at least 50% new material with "brand" new ideas and results. The
papers should not be longer than 14 double column pages in the IEEE TCC format.
Papers should be submitted directly to TCC at
https://mc.manuscriptcentral.com/tcc-cs, and "SI-ScienceCloud" should be
selected.

-------------------------------------------------------------------------------
ORGANIZERS

? Kate Keahey, University of Chicago & Argonne National Laboratory, USA
? Ioan Raicu, Illinois Institute of Technology & Argonne National Lab., USA
? Kyle Chard, University of Chicago & Argonne National Laboratory, USA
? Bogdan Nicolae, IBM Research, Ireland


-------------------------------------------------------------------------------
CONTACT

Email:sciencecloud2014-tcc-editors at datasys.cs.iit.edu  <mailto:sciencecloud2014-tcc-editors at datasys.cs.iit.edu>
Website:http://datasys.cs.iit.edu/events/ScienceCloud2014-TCC/

----------------------

-- 
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor, Illinois Institute of Technology (IIT)
Guest Research Faculty, Argonne National Laboratory (ANL)
=================================================================
Data-Intensive Distributed Systems Laboratory, CS/IIT
Distributed Systems Laboratory, MCS/ANL
=================================================================
Editor: IEEE TCC, Springer Cluster, Springer JoCCASA
Chair:  IEEE/ACM MTAGS, ACM ScienceCloud
=================================================================
Cel:      1-847-722-0876
Office:   1-312-567-5704
Email:    iraicu at cs.iit.edu
Web:      http://www.cs.iit.edu/~iraicu/
Web:      http://datasys.cs.iit.edu/
LinkedIn: http://www.linkedin.com/in/ioanraicu
Google:   http://scholar.google.com/citations?user=jE73HYAAAAAJ
=================================================================
=================================================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140715/dbe1e773/attachment.html>

From iraicu at cs.iit.edu  Wed Jul 23 16:23:56 2014
From: iraicu at cs.iit.edu (Ioan Raicu)
Date: Wed, 23 Jul 2014 16:23:56 -0500
Subject: [Swift-user] CFP: 1st International Workshop on Collaborative
 methodologies to Accelerate Scientific Knowledge discovery in big data
 (CASK) 2014 @ IEEE BigData 2014
Message-ID: <53D027EC.70108@cs.iit.edu>

Call for Papers

-------------------------------------------------------------------------------
1st International Workshop on Collaborative methodologies to Accelerate
Scientific Knowledge discovery in big data (CASK) 2014

Oct 27-30, 2014, Washington DC, US

In conjunction with:
2014 IEEE International Conference on Big Data (IEEE BigData 2014)

         http://bigscientificdata.org/cask14/
===============================================================================

Introduction
-------------------------------------------------------------------------------
Big Data has become an increasingly important part of life, and has become a
common buzzword used to describe many aspects of data intensive computing. One
of the unique aspects that we see to this new age of science is in terms of new
methods to collaborate and share ideas, data, and services. Service Oriented
Architectures have become commonplace in Enterprise computing, but its role in
scientific data has been often underplayed. Commonly scientists will write
software for themselves, and occasionally share their programs with their
colleagues. As computing moves to new levels of performance, by using
accelerators and many cores, one must rethink how scientific codes are produced
and move to new frameworks which promote collaboration.

In this workshop we want researchers to discuss techniques, infrastructure,
science drivers, and new ways to promote this new way of computing for
scientific applications. We will want to address fundamental issues in
workflows on and off large-scale high performance systems, clouds, IPads, and
mobile devices. Our overall view is that we can accelerate the scientific
knowledge discovery process by embracing new technologies where researchers can
share codes, workflows, data, and ultimately knowledge. Topics of interest in
this workshop will span from methods to share all aspects of code, data, and
workflows.  We will investigate the topic of how do you share Big Data, when
data gets to extreme sizes.  What new services need to be developed in order to
promote Big Data for science and engineering aspects? How can we get
researchers across the globe to develop shareable code and participate in the
greater science community? Just like math was a common language that
researchers could share and everyone understand, what are the new pieces of
software which must be developed in order to ensure collaboration across the
end to end lifecycle of scientific data?  The workshop will provide a venue to
show what has worked across different communities, and how to bring
collaboration to new scientific communities.  The workshop will bring together
DOE and NSF researchers along with researchers in the enterprise to present
papers, along with give invited talks. We will also conclude with a panel with
many leading experts in the field. We will also feature one panel which will
include experts in scientific and enterprise Big Data.


Topics
Topics of interest include, but are not limited to:
-------------------------------------------------------------------------------
* Data at Rest (Storage)
* Data in Motion (Data Streaming)
* Storage Systems (Database, file systems)
* Resource Management
* Query and Search
* Acquisition, Integrating, Cleaning and Best Practice
* Privacy
* Provenance
* Algorithms
* Analytics
* Visualization
* Near Real-time Decision Making
* Data Fusion
* Workflows
* Programming Models (e.g. MapReduce, MPI, etc.)


Important Dates
-------------------------------------------------------------------------------
* Papers Due: September 1st, 2014
* Notification of Acceptance: September 20th, 2014
* Camera Ready Papers Due: October 5th, 2014


Paper Submission
-------------------------------------------------------------------------------
Authors are invited to submit papers electronically. Submitted manuscripts
should be structured as technical papers and may not exceed 6 letter size
(8.5 x 11) pages including figures, tables and references using the IEEE format
for conference proceedings (print area of 6-1/2 inches (16.51 cm) wide by 8-7/8
inches (22.51 cm) high, two-column format with columns 3-1/16 inches (7.85 cm)
wide with a 3/8 inch (0.81 cm) space between them, single-spaced 10-point Times
fully justified text). Submissions not conforming to these guidelines may be
returned without review. Authors should submit the manuscript in PDF format and
make sure that the file will print on a printer that uses letter size (8.5 x 11)
paper. The official language of the meeting is English. All manuscripts will be
reviewed and will be judged on correctness, originality, technical strength,
significance, quality of presentation, and interest and relevance to the
conference attendees. Papers conforming to the above guidelines can be
submitted through the CASK 2014 paper submission system
(https://cmt.research.microsoft.com/CASK2014).

Submitted papers must represent original unpublished research that is not
currently under review for any other conference or journal. Papers not
following these guidelines will be rejected without review and further action
may be taken, including (but not limited to) notifications sent to the heads of
the institutions of the authors and sponsors of the conference. Submissions
received after the due date, exceeding length limit, or not appropriately
structured may also not be considered. Authors may contact the conference PC
Chair for more information. The proceedings will be published through the IEEE
Computer Society Press, USA and will be made online through the IEEE Digital
Library. Selected papers from CASK 2014 will be invited to extend and submit to
the Special Issue on Many-Task Computing in the Cloud in the IEEE Transaction on
Cloud Computing (http://datasys.cs.iit.edu/events/TCC-MTC15/CFP_TCC-MTC15.pdf).


Chairs and Committees
-------------------------------------------------------------------------------
Workshop Co-Chairs:

* Chen Jin (Palantir Technology)
* Ioan Raicu (Illinois Institute of Technology)
* Scott Klasky (Oak Ridge National Laboratory)

Program Co-Chairs:

* Raju Vatsavai (North Carolina State Univ. & Oak Ridge National Lab.)
* Judy Qiu (Indiana University)
* George Ostrouchov (Oak Ridge National Laboratory)
* Tahsin Kurc (Stony Brook University)
* Daniel S. Katz (University of Chicago)
* Bogdan Nicolae (IBM Research)
* Doug Thain (University of Notre Dame)
* Josh Wills (Cloudera)
* Zhengzhang Chen (NEC Labs)
* Kunpeng Zhang (University of Illinois at Chicago)

-- 
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor, Illinois Institute of Technology (IIT)
Guest Research Faculty, Argonne National Laboratory (ANL)
=================================================================
Data-Intensive Distributed Systems Laboratory, CS/IIT
Distributed Systems Laboratory, MCS/ANL
=================================================================
Editor: IEEE TCC, Springer Cluster, Springer JoCCASA
Chair:  IEEE/ACM MTAGS, ACM ScienceCloud
=================================================================
Cel:      1-847-722-0876
Office:   1-312-567-5704
Email:    iraicu at cs.iit.edu
Web:      http://www.cs.iit.edu/~iraicu/
Web:      http://datasys.cs.iit.edu/
LinkedIn: http://www.linkedin.com/in/ioanraicu
Google:   http://scholar.google.com/citations?user=jE73HYAAAAAJ
=================================================================
=================================================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140723/dee0a91e/attachment.html>

From iraicu at cs.iit.edu  Fri Jul 25 11:38:58 2014
From: iraicu at cs.iit.edu (Ioan Raicu)
Date: Fri, 25 Jul 2014 11:38:58 -0500
Subject: [Swift-user] CFP: International Symposium on Big Data Computing
	(BDC) 2014
Message-ID: <53D28822.1020605@cs.iit.edu>

Call for Papers

IEEE/ACM International Symposium on Big Data Computing (BDC) 2014
December 8-11, 2014, London, UK
http://www.cloudbus.org/bdc2014

In conjunction with:
7th IEEE/ACM International Conference on Utility and Cloud Computing (UCC 2014)
Sponsored by: IEEE Computer Society and ACM (Association for Computing Machinery)

Introduction
===============================================================================
Rapid advances in digital sensors, networks, storage, and computation along
with their availability at low cost is leading to the creation of huge
collections of data -- dubbed as Big Data. This data has the potential for
enabling new insights that can change the way business, science, and
governments deliver services to their consumers and can impact society as a
whole. This has led to the emergence of the Big Data Computing paradigm
focusing on sensing, collection, storage, management and analysis of data from
variety of sources to enable new value and insights.

To realize the full potential of Big Data Computing, we need to address
several challenges and develop suitable conceptual and technological solutions
for dealing them. These include life-cycle management of data, large-scale
storage, flexible processing infrastructure, data modelling, scalable machine
learning and data analysis algorithms, techniques for sampling and making
trade-off between data processing time and accuracy, and dealing with privacy
and ethical issues involved in data sensing, storage, processing, and actions.

The IEEE/ACM International Symposium on Big Data Computing (BDC) 2014 -- held
in conjunction with 7th IEEE/ACM International Conference on Utility and Cloud
Computing (UCC 2014), December 8-11, 2014, London, UK, aims at bringing
together international researchers, developers, policy makers, and users and
to provide an international forum to present leading research activities,
technical solutions, and results on a broad range of topics related to Big
Data Computing paradigms, platforms and their applications. The conference
features keynotes, technical presentations, posters, workshops, tutorials, as
well as competitions featuring live demonstrations.

Topics
===============================================================================
Topics of interest include, but are not limited to:
I. Big Data Science
* Analytics
* Algorithms for Big Data
* Energy-efficient Algorithms
* Big Data Search
* Big Data Acquisition, Integration, Cleaning, and Best Practices
* Visualization of Big Data

II. Big Data Infrastructures and Platforms
* Programming Systems
* Cyber-Infrastructure
* Performance evaluation
* Fault tolerance and reliability
* I/O and Data management
* Storage Systems (including file systems, NoSQL, and RDBMS)
* Resource management
* Many-Task Computing
* Many-core computing and accelerators

III. Big Data Security and Policy
* Management Policies
* Data Privacy
* Data Security
* Big Data Archival and Preservation
* Big Data Provenance

IV. Big Data Applications
* Scientific application cases studies on Cloud infrastructure
* Big Data Applications at Scale
* Experience Papers with Big Data Application Deployments
* Data streaming applications
* Big Data in Social Networks
* Healthcare Applications
* Enterprise Applications

IMPORTANT DATES
===============================================================================
* Papers Due:                   September 15th, 2014
* Notification of Acceptance:   October 15th, 2014
* Camera Ready Papers Due:      October 31st, 2014

PAPER SUBMISSION
===============================================================================
Authors are invited to submit papers electronically. Submitted manuscripts
should be structured as technical papers and may not exceed 10 letter size
(8.5 x 11) pages including figures, tables and references using the IEEE format
for conference proceedings (print area of 6-1/2 inches (16.51 cm) wide by 8-7/8
inches (22.51 cm) high, two-column format with columns 3-1/16 inches (7.85 cm)
wide with a 3/8 inch (0.81 cm) space between them, single-spaced 10-point Times
fully justified text). Submissions not conforming to these guidelines may be
returned without review. Authors should submit the manuscript in PDF format and
make sure that the file will print on a printer that uses letter size (8.5 x 11)
paper. The official language of the meeting is English. All manuscripts will be
reviewed and will be judged on correctness, originality, technical strength,
significance, quality of presentation, and interest and relevance to the
conference attendees. Papers conforming to the above guidelines can be
submitted through the BDC 2014 paper submission system
(https://www.easychair.org/conferences/?conf=bdc2014).

Submitted papers must represent original unpublished research that is not
currently under review for any other conference or journal. Papers not
following these guidelines will be rejected without review and further action
may be taken, including (but not limited to) notifications sent to the heads
of the institutions of the authors and sponsors of the conference. Submissions
received after the due date, exceeding length limit, or not appropriately
structured may also not be considered. Authors may contact the conference PC
Chair for more information. The proceedings will be published through the IEEE
Computer Society Press, USA and will be made online through the IEEE Digital
Library. Selected papers from BDC 2014 will be invited to extend and submit to
the Special Issue on Many-Task Computing in the Cloud in the IEEE Transaction
on Cloud Computing (http://datasys.cs.iit.edu/events/TCC-MTC15/CFP_TCC-MTC15.pdf)

CHAIRS & COMMITTEES
===============================================================================
General Co-Chairs:
* Rajkumar Buyya, University of Melbourne, Australia
* Divyakant Agrawal, University of California at Santa Barbara, USA

Program Co-Chairs:
* Ioan Raicu, Illinois Institute of Technology and Argonne National Lab., USA
* Manish Parashar, Rutgers, The State University of New Jersey, USA

Area Track Co-Chairs:
* Big Data Science
   o TBA
* Data Infrastructures and Platforms
   o Amy Apon, Clemson University, USA
   o Jiannong Cao, Honk Kong Polytechnic University
* Big Data Security and Policy
   o Bogdan Carbunar, Florida International University
* Big Data Applications
   o Dennis Gannon, Microsoft Research, USA

Cyber Chair
* Amir Vahid, University of Melbourne, Australia

Publicity Chairs
* Carlos Westphall, Federal University of Santa Catarina, Brazil
* Ching-Hsien Hsu, Chung Hua Univ., Taiwan & Tianjin Univ. of Technology, China
* Rong Ge, Marquette University, USA

Organizing Chair:
* Ashiq Anjum, University of Derby, UK


-- 
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor, Illinois Institute of Technology (IIT)
Guest Research Faculty, Argonne National Laboratory (ANL)
=================================================================
Data-Intensive Distributed Systems Laboratory, CS/IIT
Distributed Systems Laboratory, MCS/ANL
=================================================================
Editor: IEEE TCC, Springer Cluster, Springer JoCCASA
Chair:  IEEE/ACM MTAGS, ACM ScienceCloud
=================================================================
Cel:      1-847-722-0876
Office:   1-312-567-5704
Email:    iraicu at cs.iit.edu
Web:      http://www.cs.iit.edu/~iraicu/
Web:      http://datasys.cs.iit.edu/
LinkedIn: http://www.linkedin.com/in/ioanraicu
Google:   http://scholar.google.com/citations?user=jE73HYAAAAAJ
=================================================================
=================================================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140725/4eb9a2ae/attachment.html>

From jozik at uchicago.edu  Tue Jul 29 20:56:28 2014
From: jozik at uchicago.edu (Jonathan Ozik)
Date: Tue, 29 Jul 2014 20:56:28 -0500
Subject: [Swift-user] exception @ swift-int.k, line: 511,
	Caused by: Block task failed: Connection to worker lost
Message-ID: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu>

Hi all,

I?m getting spurious errors in the jobs that I?m running on Blues. The stdout includes exceptions like:
	exception @ swift-int.k, line: 511
Caused by: Block task failed: Connection to worker lost
java.io.IOException: Broken pipe
	at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
	at sun.nio.ch.IOUtil.write(IOUtil.java:65)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
	at org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)
	at org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)

These seem to occur at different parts of the submitted jobs. Let me know if there?s a log file that you?d like to look at.

In earlier attempts I was getting these warnings followed by broken pipe errors:
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000a0000000, 704643072, 2097152, 0) failed; error='Cannot allocate memory' (errno=12); Cannot allocate large pages, falling back to regular pages

Apparently that?s a known precursor of crashes on Java 7 as described here (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html):
Area: hotspot/gc
Synopsis: Crashes due to failure to allocate large pages.

On Linux, failures when allocating large pages can lead to crashes. When running JDK 7u51 or later versions, the issue can be recognized in two ways:

	? Before the crash happens one or more lines similar to this will have been printed to the log:
os::commit_memory(0x00000006b1600000, 352321536, 2097152, 0) failed; 
error='Cannot allocate memory' (errno=12); Cannot allocate large pages, falling back to regular pages
	? If a hs_err file is generated it will contain a line similar to this:
Large page allocation failures have occurred 3 times
The problem can be avoided by running with large page support turned off, for example by passing the "-XX:-UseLargePages" option to the java binary.

See 8007074 (not public).

So I added the -XX:-UseLargePages option in the invocations of Java code that I was responsible for. That seemed to get rid of the warning and the crashes for a while, but perhaps that was just a coincidence.

Jonathan


From wilde at anl.gov  Thu Jul 31 09:13:02 2014
From: wilde at anl.gov (Michael Wilde)
Date: Thu, 31 Jul 2014 09:13:02 -0500
Subject: [Swift-user] exception @ swift-int.k, line: 511,
 Caused by: Block task failed: Connection to worker lost
In-Reply-To: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu>
References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu>
Message-ID: <53DA4EEE.5010800@anl.gov>

Some discussion and diagnosis of this incident has taken place off list.

In a quick scan of the worker logs, I don't spot an obvious error that 
would cause workers to exit.
Hopefully others on the Swift team can check those as well.

Jonathan, do you have stdout/err files from the PBS scheduler on blues, 
in your runNNN log dirs?

If so, can you point us to them?

Thanks,

- Mike

On 7/29/14, 8:56 PM, Jonathan Ozik wrote:
> Hi all,
>
> I?m getting spurious errors in the jobs that I?m running on Blues. The stdout includes exceptions like:
> 	exception @ swift-int.k, line: 511
> Caused by: Block task failed: Connection to worker lost
> java.io.IOException: Broken pipe
> 	at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> 	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> 	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> 	at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> 	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
> 	at org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)
> 	at org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)
>
> These seem to occur at different parts of the submitted jobs. Let me know if there?s a log file that you?d like to look at.
>
> In earlier attempts I was getting these warnings followed by broken pipe errors:
> Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000a0000000, 704643072, 2097152, 0) failed; error='Cannot allocate memory' (errno=12); Cannot allocate large pages, falling back to regular pages
>
> Apparently that?s a known precursor of crashes on Java 7 as described here (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html):
> Area: hotspot/gc
> Synopsis: Crashes due to failure to allocate large pages.
>
> On Linux, failures when allocating large pages can lead to crashes. When running JDK 7u51 or later versions, the issue can be recognized in two ways:
>
> 	? Before the crash happens one or more lines similar to this will have been printed to the log:
> os::commit_memory(0x00000006b1600000, 352321536, 2097152, 0) failed;
> error='Cannot allocate memory' (errno=12); Cannot allocate large pages, falling back to regular pages
> 	? If a hs_err file is generated it will contain a line similar to this:
> Large page allocation failures have occurred 3 times
> The problem can be avoided by running with large page support turned off, for example by passing the "-XX:-UseLargePages" option to the java binary.
>
> See 8007074 (not public).
>
> So I added the -XX:-UseLargePages option in the invocations of Java code that I was responsible for. That seemed to get rid of the warning and the crashes for a while, but perhaps that was just a coincidence.
>
> Jonathan
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user

-- 
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory               The University of Chicago


From wilde at anl.gov  Thu Jul 31 09:18:08 2014
From: wilde at anl.gov (Michael Wilde)
Date: Thu, 31 Jul 2014 09:18:08 -0500
Subject: [Swift-user] exception @ swift-int.k, line: 511,
 Caused by: Block task failed: Connection to worker lost
In-Reply-To: <53DA4EEE.5010800@anl.gov>
References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu>
	<53DA4EEE.5010800@anl.gov>
Message-ID: <53DA5020.7050402@anl.gov>

I see this from PBS in your home dir:

blues$ cat 583937.bmgt1.lcrc.anl.gov.ER
Use of uninitialized value $s in concatenation (.) or string at 
/home/ozik/.globus/coasters/cscript4312030037430783094.pl line 2220.
Use of uninitialized value $s in concatenation (.) or string at 
/home/ozik/.globus/coasters/cscript4312030037430783094.pl line 2220.
blues$

That looks to me like a Swift bug in worker.pl

We'll look into this angle.

Also I'm curious why these files are not going into your run dir (but 
perhaps thats because youre running an older trunk release, not 0.95? 
Or, thats a separate 0.95 bug).

- Mike

On 7/31/14, 9:13 AM, Michael Wilde wrote:
> Some discussion and diagnosis of this incident has taken place off list.
>
> In a quick scan of the worker logs, I don't spot an obvious error that
> would cause workers to exit.
> Hopefully others on the Swift team can check those as well.
>
> Jonathan, do you have stdout/err files from the PBS scheduler on blues,
> in your runNNN log dirs?
>
> If so, can you point us to them?
>
> Thanks,
>
> - Mike
>
> On 7/29/14, 8:56 PM, Jonathan Ozik wrote:
>> Hi all,
>>
>> I?m getting spurious errors in the jobs that I?m running on Blues. The stdout includes exceptions like:
>> 	exception @ swift-int.k, line: 511
>> Caused by: Block task failed: Connection to worker lost
>> java.io.IOException: Broken pipe
>> 	at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>> 	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>> 	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>> 	at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>> 	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
>> 	at org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)
>> 	at org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)
>>
>> These seem to occur at different parts of the submitted jobs. Let me know if there?s a log file that you?d like to look at.
>>
>> In earlier attempts I was getting these warnings followed by broken pipe errors:
>> Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000a0000000, 704643072, 2097152, 0) failed; error='Cannot allocate memory' (errno=12); Cannot allocate large pages, falling back to regular pages
>>
>> Apparently that?s a known precursor of crashes on Java 7 as described here (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html):
>> Area: hotspot/gc
>> Synopsis: Crashes due to failure to allocate large pages.
>>
>> On Linux, failures when allocating large pages can lead to crashes. When running JDK 7u51 or later versions, the issue can be recognized in two ways:
>>
>> 	? Before the crash happens one or more lines similar to this will have been printed to the log:
>> os::commit_memory(0x00000006b1600000, 352321536, 2097152, 0) failed;
>> error='Cannot allocate memory' (errno=12); Cannot allocate large pages, falling back to regular pages
>> 	? If a hs_err file is generated it will contain a line similar to this:
>> Large page allocation failures have occurred 3 times
>> The problem can be avoided by running with large page support turned off, for example by passing the "-XX:-UseLargePages" option to the java binary.
>>
>> See 8007074 (not public).
>>
>> So I added the -XX:-UseLargePages option in the invocations of Java code that I was responsible for. That seemed to get rid of the warning and the crashes for a while, but perhaps that was just a coincidence.
>>
>> Jonathan
>>
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user

-- 
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory               The University of Chicago


From hategan at mcs.anl.gov  Thu Jul 31 12:09:15 2014
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 31 Jul 2014 10:09:15 -0700
Subject: [Swift-user] exception @ swift-int.k, line: 511,
 Caused by: Block task failed: Connection to worker lost
In-Reply-To: <F3AA70C7-A8A0-4B49-BA04-D03FFF175926@uchicago.edu>
References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu>
	<E85EF03A-B4EF-4124-A06F-87DFB107207A@uchicago.edu>
	<53D96721.6030004@anl.gov>
	<CC03A2F2-EF31-4D2F-A59B-0D122008AE13@uchicago.edu>
	<53D99769.5050107@uchicago.edu>
	<D3E499FD-094F-4173-8DE2-A9986E956A40@uchicago.edu>
	<1406783097.19120.2.camel@echo> <53D9CFCB.8090803@uchicago.edu>
	<1406783556.19312.1.camel@echo>
	<28AB72AB-2650-40E2-B084-9D1E20D03AEB@uchicago.edu>
	<F3AA70C7-A8A0-4B49-BA04-D03FFF175926@uchicago.edu>
Message-ID: <1406826555.22289.4.camel@echo>

Hi Jonathan,

I can't see anything obvious in the worker logs, but they are pretty
large. Can you also post the swift log from this run? It would make it
easier to focus on the right time frame.

Mihael

On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote:
> Hi all,
> 
> I?m attaching the stdout and the worker logs below.
> 
> Thanks for looking at these!
> 
> Jonathan
> 
> 
> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik <jozik at uchicago.edu>
> wrote:
> 
> > Woops, sorry about that. It?s running now and the logs are being
> generated. Once the run is done I?ll send you log files.
> > 
> > Thanks!
> > 
> > Jonathan
> > 
> > On Jul 31, 2014, at 12:12 AM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
> > 
> >> Right. This isn't your fault. We should, though, probably talk
> about
> >> addressing the issue.
> >> 
> >> Mihael
> >> 
> >> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote:
> >>> Mihael, thanks for spotting that.  I added the comments to
> highlight the 
> >>> changes in email.
> >>> 
> >>> -Yadu
> >>> 
> >>> On 07/31/2014 12:04 AM, Mihael Hategan wrote:
> >>>> Hi Jonathan,
> >>>> 
> >>>> I suspect that the site config is considering the comment to be
> part of
> >>>> the value of the workerLogLevel property. We could confirm this
> if you
> >>>> send us the swift log from this particular run.
> >>>> 
> >>>> To fix it, you could try to remove everything after DEBUG
> (including all
> >>>> horizontal white space). In other words:
> >>>> 
> >>>> ...
> >>>> workerloglevel=DEBUG
> >>>> workerlogdirectory=/home/$USER/
> >>>> ...
> >>>> 
> >>>> Mihael
> >>>> 
> >>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote:
> >>>>> Hi Yadu,
> >>>>> 
> >>>>> 
> >>>>> I?m getting errors indicating that DEBUG is an invalid worker
> logging
> >>>>> level. I?m attaching the stdout below. Let me know if I?m doing
> >>>>> something silly.
> >>>>> 
> >>>>> 
> >>>>> Jonathan
> >>>>> 
> >>>>> 
> >>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji
> <yadunand at uchicago.edu>
> >>>>> wrote:
> >>>>> 
> >>>>>> Hi Jonathan,
> >>>>>> 
> >>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do not
> see
> >>>>>> anything unusual.
> >>>>>> 
> >>>>>> From your logs, it looks like workers are failing, so getting
> worker
> >>>>>> logs would help.
> >>>>>> Could you try running on Blues with the following
> swift.properties
> >>>>>> and get us the worker*logs that would show up in the
> >>>>>> workerlogdirectory ?
> >>>>>> 
> >>>>>> site=blues
> >>>>>> 
> >>>>>> site.blues {
> >>>>>>    jobManager=pbs
> >>>>>>    jobQueue=shared
> >>>>>>    maxJobs=4
> >>>>>>    jobGranularity=1
> >>>>>>    maxNodesPerJob=1
> >>>>>>    tasksPerWorker=16
> >>>>>>    taskThrottle=64
> >>>>>>    initialScore=10000
> >>>>>>    jobWalltime=00:48:00
> >>>>>>    taskWalltime=00:40:00
> >>>>>>    workerloglevel=DEBUG                                  #
> Adding
> >>>>>> debug for workers
> >>>>>>    workerlogdirectory=/home/$USER/                # Logging
> >>>>>> directory on SFS
> >>>>>>    workdir=$RUNDIRECTORY
> >>>>>>    filesystem=local
> >>>>>> }
> >>>>>> 
> >>>>>> Thanks,
> >>>>>> Yadu
> >>>>>> 
> >>>>>> 
> >>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote:
> >>>>>> 
> >>>>>>> Hi Mike,
> >>>>>>> 
> >>>>>>> 
> >>>>>>> Sorry, I figured there was some busy-ness involved!
> >>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I
> didn?t
> >>>>>>> get the same issue. That is, the model run completed
> successfully.
> >>>>>>> For the Blues run, I used a trunk distribution (as of May 29,
> >>>>>>> 2014). I?m including one of the log files below. I?m also
> >>>>>>> including the swift.properties file that was used for the
> blues
> >>>>>>> runs below.
> >>>>>>> 
> >>>>>>> 
> >>>>>>> Thank you!
> >>>>>>> 
> >>>>>>> 
> >>>>>>> Jonathan
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde <wilde at anl.gov>
> wrote:
> >>>>>>> 
> >>>>>>>> Hi Jonathan,
> >>>>>>>> 
> >>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping!
> >>>>>>>> 
> >>>>>>>> I or one of the team will answer soon, on swift-user.
> >>>>>>>> 
> >>>>>>>> (But the first question is: which Swift release, and can you
> >>>>>>>> point us to, or send, the full log file?)
> >>>>>>>> 
> >>>>>>>> Thanks and regards,
> >>>>>>>> 
> >>>>>>>> - Mike
> >>>>>>>> 
> >>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote:
> >>>>>>>> 
> >>>>>>>>> Hi Mike,
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> I didn?t get a response yet so just wanted to make sure that
> >>>>>>>>> the message came across.
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> Jonathan
> >>>>>>>>> 
> >>>>>>>>> Begin forwarded message:
> >>>>>>>>> 
> >>>>>>>>>> From: Jonathan Ozik <jozik at uchicago.edu>
> >>>>>>>>>> 
> >>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511,
> >>>>>>>>>> Caused by: Block task failed: Connection to worker lost
> >>>>>>>>>> 
> >>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT
> >>>>>>>>>> 
> >>>>>>>>>> To: Mihael Hategan <hategan at mcs.anl.gov>,
> >>>>>>>>>> "swift-user at ci.uchicago.edu" <swift-user at ci.uchicago.edu>
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> Hi all,
> >>>>>>>>>> 
> >>>>>>>>>> I?m getting spurious errors in the jobs that I?m running on
> >>>>>>>>>> Blues. The stdout includes exceptions like:
> >>>>>>>>>> exception @ swift-int.k, line: 511
> >>>>>>>>>> Caused by: Block task failed: Connection to worker lost
> >>>>>>>>>> java.io.IOException: Broken pipe
> >>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> >>>>>>>>>> at
> >>>>>>>>>> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> >>>>>>>>>> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> >>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> >>>>>>>>>> at
> >>>>>>>>>>
> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
> >>>>>>>>>> at
> >>>>>>>>>>
> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)
> >>>>>>>>>> at
> >>>>>>>>>>
> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)
> >>>>>>>>>> 
> >>>>>>>>>> These seem to occur at different parts of the submitted
> >>>>>>>>>> jobs. Let me know if there?s a log file that you?d like to
> >>>>>>>>>> look at.
> >>>>>>>>>> 
> >>>>>>>>>> In earlier attempts I was getting these warnings followed
> by
> >>>>>>>>>> broken pipe errors:
> >>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO:
> >>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152,
> 0)
> >>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot
> >>>>>>>>>> allocate large pages, falling back to regular pages
> >>>>>>>>>> 
> >>>>>>>>>> Apparently that?s a known precursor of crashes on Java 7 as
> >>>>>>>>>> described here
> >>>>>>>>>>
> (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html):
> >>>>>>>>>> Area: hotspot/gc
> >>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages.
> >>>>>>>>>> 
> >>>>>>>>>> On Linux, failures when allocating large pages can lead to
> >>>>>>>>>> crashes. When running JDK 7u51 or later versions, the issue
> >>>>>>>>>> can be recognized in two ways:
> >>>>>>>>>> 
> >>>>>>>>>> ? Before the crash happens one or more lines similar to
> this
> >>>>>>>>>> will have been printed to the log:
> >>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152,
> 0)
> >>>>>>>>>> failed;
> >>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot allocate
> >>>>>>>>>> large pages, falling back to regular pages
> >>>>>>>>>> ? If a hs_err file is generated it will contain a line
> >>>>>>>>>> similar to this:
> >>>>>>>>>> Large page allocation failures have occurred 3 times
> >>>>>>>>>> The problem can be avoided by running with large page
> >>>>>>>>>> support turned off, for example by passing the
> >>>>>>>>>> "-XX:-UseLargePages" option to the java binary.
> >>>>>>>>>> 
> >>>>>>>>>> See 8007074 (not public).
> >>>>>>>>>> 
> >>>>>>>>>> So I added the -XX:-UseLargePages option in the invocations
> >>>>>>>>>> of Java code that I was responsible for. That seemed to get
> >>>>>>>>>> rid of the warning and the crashes for a while, but perhaps
> >>>>>>>>>> that was just a coincidence.
> >>>>>>>>>> 
> >>>>>>>>>> Jonathan
> >>>>>>>>>> 
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> Swift-user mailing list
> >>>>>>>>>> Swift-user at ci.uchicago.edu
> >>>>>>>>>>
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> >>>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>> -- 
> >>>>>>>> Michael Wilde
> >>>>>>>> Mathematics and Computer Science          Computation
> Institute
> >>>>>>>> Argonne National Laboratory               The University of
> Chicago
> >>>>>>> 
> >>>>>> 
> >>>>> 
> >>>> 
> >>> 
> >> 
> >> 
> > 
> 
> 


From hategan at mcs.anl.gov  Thu Jul 31 12:13:34 2014
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 31 Jul 2014 10:13:34 -0700
Subject: [Swift-user] exception @ swift-int.k, line: 511,
 Caused by: Block task failed: Connection to worker lost
In-Reply-To: <53DA5020.7050402@anl.gov>
References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu>
	<53DA4EEE.5010800@anl.gov> <53DA5020.7050402@anl.gov>
Message-ID: <1406826814.22632.1.camel@echo>

On Thu, 2014-07-31 at 09:18 -0500, Michael Wilde wrote:
> I see this from PBS in your home dir:
> 
> blues$ cat 583937.bmgt1.lcrc.anl.gov.ER
> Use of uninitialized value $s in concatenation (.) or string at 
> /home/ozik/.globus/coasters/cscript4312030037430783094.pl line 2220.
> Use of uninitialized value $s in concatenation (.) or string at 
> /home/ozik/.globus/coasters/cscript4312030037430783094.pl line 2220.
> blues$
> 
> That looks to me like a Swift bug in worker.pl

Line 2220 of worker.pl:
wlog DEBUG, "$JOBID Got output from child ($s). Closing pipe.\n";

So I don't think this is the problem (or much of a problem in general),
although it could be confusing so it should be fixed.

Mihael


From jozik at uchicago.edu  Thu Jul 31 12:34:57 2014
From: jozik at uchicago.edu (Jonathan Ozik)
Date: Thu, 31 Jul 2014 12:34:57 -0500
Subject: [Swift-user] exception @ swift-int.k, line: 511,
	Caused by: Block task failed: Connection to worker lost
In-Reply-To: <1406826555.22289.4.camel@echo>
References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu>
	<E85EF03A-B4EF-4124-A06F-87DFB107207A@uchicago.edu>
	<53D96721.6030004@anl.gov>
	<CC03A2F2-EF31-4D2F-A59B-0D122008AE13@uchicago.edu>
	<53D99769.5050107@uchicago.edu>
	<D3E499FD-094F-4173-8DE2-A9986E956A40@uchicago.edu>
	<1406783097.19120.2.camel@echo> <53D9CFCB.8090803@uchicago.edu>
	<1406783556.19312.1.camel@echo>
	<28AB72AB-2650-40E2-B084-9D1E20D03AEB@uchicago.edu>
	<F3AA70C7-A8A0-4B49-BA04-D03FFF175926@uchicago.edu>
	<1406826555.22289.4.camel@echo>
Message-ID: <22271280-84FF-4C85-8A1E-5002ED165DAE@uchicago.edu>

Sure thing, it?s attached below.

Jonathan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: run014.log
Type: application/octet-stream
Size: 2887616 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140731/63a8f8f7/attachment.obj>
-------------- next part --------------

On Jul 31, 2014, at 12:09 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> Hi Jonathan,
> 
> I can't see anything obvious in the worker logs, but they are pretty
> large. Can you also post the swift log from this run? It would make it
> easier to focus on the right time frame.
> 
> Mihael
> 
> On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote:
>> Hi all,
>> 
>> I?m attaching the stdout and the worker logs below.
>> 
>> Thanks for looking at these!
>> 
>> Jonathan
>> 
>> 
>> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik <jozik at uchicago.edu>
>> wrote:
>> 
>>> Woops, sorry about that. It?s running now and the logs are being
>> generated. Once the run is done I?ll send you log files.
>>> 
>>> Thanks!
>>> 
>>> Jonathan
>>> 
>>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan <hategan at mcs.anl.gov>
>> wrote:
>>> 
>>>> Right. This isn't your fault. We should, though, probably talk
>> about
>>>> addressing the issue.
>>>> 
>>>> Mihael
>>>> 
>>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote:
>>>>> Mihael, thanks for spotting that.  I added the comments to
>> highlight the 
>>>>> changes in email.
>>>>> 
>>>>> -Yadu
>>>>> 
>>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote:
>>>>>> Hi Jonathan,
>>>>>> 
>>>>>> I suspect that the site config is considering the comment to be
>> part of
>>>>>> the value of the workerLogLevel property. We could confirm this
>> if you
>>>>>> send us the swift log from this particular run.
>>>>>> 
>>>>>> To fix it, you could try to remove everything after DEBUG
>> (including all
>>>>>> horizontal white space). In other words:
>>>>>> 
>>>>>> ...
>>>>>> workerloglevel=DEBUG
>>>>>> workerlogdirectory=/home/$USER/
>>>>>> ...
>>>>>> 
>>>>>> Mihael
>>>>>> 
>>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote:
>>>>>>> Hi Yadu,
>>>>>>> 
>>>>>>> 
>>>>>>> I?m getting errors indicating that DEBUG is an invalid worker
>> logging
>>>>>>> level. I?m attaching the stdout below. Let me know if I?m doing
>>>>>>> something silly.
>>>>>>> 
>>>>>>> 
>>>>>>> Jonathan
>>>>>>> 
>>>>>>> 
>>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji
>> <yadunand at uchicago.edu>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Jonathan,
>>>>>>>> 
>>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do not
>> see
>>>>>>>> anything unusual.
>>>>>>>> 
>>>>>>>> From your logs, it looks like workers are failing, so getting
>> worker
>>>>>>>> logs would help.
>>>>>>>> Could you try running on Blues with the following
>> swift.properties
>>>>>>>> and get us the worker*logs that would show up in the
>>>>>>>> workerlogdirectory ?
>>>>>>>> 
>>>>>>>> site=blues
>>>>>>>> 
>>>>>>>> site.blues {
>>>>>>>>   jobManager=pbs
>>>>>>>>   jobQueue=shared
>>>>>>>>   maxJobs=4
>>>>>>>>   jobGranularity=1
>>>>>>>>   maxNodesPerJob=1
>>>>>>>>   tasksPerWorker=16
>>>>>>>>   taskThrottle=64
>>>>>>>>   initialScore=10000
>>>>>>>>   jobWalltime=00:48:00
>>>>>>>>   taskWalltime=00:40:00
>>>>>>>>   workerloglevel=DEBUG                                  #
>> Adding
>>>>>>>> debug for workers
>>>>>>>>   workerlogdirectory=/home/$USER/                # Logging
>>>>>>>> directory on SFS
>>>>>>>>   workdir=$RUNDIRECTORY
>>>>>>>>   filesystem=local
>>>>>>>> }
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Yadu
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote:
>>>>>>>> 
>>>>>>>>> Hi Mike,
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Sorry, I figured there was some busy-ness involved!
>>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I
>> didn?t
>>>>>>>>> get the same issue. That is, the model run completed
>> successfully.
>>>>>>>>> For the Blues run, I used a trunk distribution (as of May 29,
>>>>>>>>> 2014). I?m including one of the log files below. I?m also
>>>>>>>>> including the swift.properties file that was used for the
>> blues
>>>>>>>>> runs below.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thank you!
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Jonathan
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde <wilde at anl.gov>
>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi Jonathan,
>>>>>>>>>> 
>>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping!
>>>>>>>>>> 
>>>>>>>>>> I or one of the team will answer soon, on swift-user.
>>>>>>>>>> 
>>>>>>>>>> (But the first question is: which Swift release, and can you
>>>>>>>>>> point us to, or send, the full log file?)
>>>>>>>>>> 
>>>>>>>>>> Thanks and regards,
>>>>>>>>>> 
>>>>>>>>>> - Mike
>>>>>>>>>> 
>>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Mike,
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I didn?t get a response yet so just wanted to make sure that
>>>>>>>>>>> the message came across.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Jonathan
>>>>>>>>>>> 
>>>>>>>>>>> Begin forwarded message:
>>>>>>>>>>> 
>>>>>>>>>>>> From: Jonathan Ozik <jozik at uchicago.edu>
>>>>>>>>>>>> 
>>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511,
>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
>>>>>>>>>>>> 
>>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT
>>>>>>>>>>>> 
>>>>>>>>>>>> To: Mihael Hategan <hategan at mcs.anl.gov>,
>>>>>>>>>>>> "swift-user at ci.uchicago.edu" <swift-user at ci.uchicago.edu>
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>> 
>>>>>>>>>>>> I?m getting spurious errors in the jobs that I?m running on
>>>>>>>>>>>> Blues. The stdout includes exceptions like:
>>>>>>>>>>>> exception @ swift-int.k, line: 511
>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
>>>>>>>>>>>> java.io.IOException: Broken pipe
>>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>>>>>>>>>>>> at
>>>>>>>>>>>> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>>>>>>>>>>>> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>>>>>>>>>>>> at
>>>>>>>>>>>> 
>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
>>>>>>>>>>>> at
>>>>>>>>>>>> 
>> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)
>>>>>>>>>>>> at
>>>>>>>>>>>> 
>> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)
>>>>>>>>>>>> 
>>>>>>>>>>>> These seem to occur at different parts of the submitted
>>>>>>>>>>>> jobs. Let me know if there?s a log file that you?d like to
>>>>>>>>>>>> look at.
>>>>>>>>>>>> 
>>>>>>>>>>>> In earlier attempts I was getting these warnings followed
>> by
>>>>>>>>>>>> broken pipe errors:
>>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO:
>>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152,
>> 0)
>>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot
>>>>>>>>>>>> allocate large pages, falling back to regular pages
>>>>>>>>>>>> 
>>>>>>>>>>>> Apparently that?s a known precursor of crashes on Java 7 as
>>>>>>>>>>>> described here
>>>>>>>>>>>> 
>> (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html):
>>>>>>>>>>>> Area: hotspot/gc
>>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Linux, failures when allocating large pages can lead to
>>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the issue
>>>>>>>>>>>> can be recognized in two ways:
>>>>>>>>>>>> 
>>>>>>>>>>>> ? Before the crash happens one or more lines similar to
>> this
>>>>>>>>>>>> will have been printed to the log:
>>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152,
>> 0)
>>>>>>>>>>>> failed;
>>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot allocate
>>>>>>>>>>>> large pages, falling back to regular pages
>>>>>>>>>>>> ? If a hs_err file is generated it will contain a line
>>>>>>>>>>>> similar to this:
>>>>>>>>>>>> Large page allocation failures have occurred 3 times
>>>>>>>>>>>> The problem can be avoided by running with large page
>>>>>>>>>>>> support turned off, for example by passing the
>>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary.
>>>>>>>>>>>> 
>>>>>>>>>>>> See 8007074 (not public).
>>>>>>>>>>>> 
>>>>>>>>>>>> So I added the -XX:-UseLargePages option in the invocations
>>>>>>>>>>>> of Java code that I was responsible for. That seemed to get
>>>>>>>>>>>> rid of the warning and the crashes for a while, but perhaps
>>>>>>>>>>>> that was just a coincidence.
>>>>>>>>>>>> 
>>>>>>>>>>>> Jonathan
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Swift-user mailing list
>>>>>>>>>>>> Swift-user at ci.uchicago.edu
>>>>>>>>>>>> 
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> -- 
>>>>>>>>>> Michael Wilde
>>>>>>>>>> Mathematics and Computer Science          Computation
>> Institute
>>>>>>>>>> Argonne National Laboratory               The University of
>> Chicago
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
> 
> 


From yadudoc1729 at gmail.com  Thu Jul 31 13:02:18 2014
From: yadudoc1729 at gmail.com (Yadu Nand)
Date: Thu, 31 Jul 2014 13:02:18 -0500
Subject: [Swift-user] exception @ swift-int.k, line: 511,
 Caused by: Block task failed: Connection to worker lost
In-Reply-To: <53DA5020.7050402@anl.gov>
References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu>
	<53DA4EEE.5010800@anl.gov> <53DA5020.7050402@anl.gov>
Message-ID: <CANa904nyE3RU95NyWm_VHERWU3GMffd+cj_GjW5VyGrD6E+-7w@mail.gmail.com>

Hi Mike,

I checked Jonathan's folders and it looks like the submit scripts and the
PBS submit, submit.stdout and submit.stderr files
correctly were written under the runNNN/scripts folder. His latest run was
using Swift-0.95-RC6 which failed with the logs
that you saw. The are also PBS*submit.stderr files which report the same
"uninitialized value $s in concatenation" error.

-Yadu


On Thu, Jul 31, 2014 at 9:18 AM, Michael Wilde <wilde at anl.gov> wrote:

> I see this from PBS in your home dir:
>
> blues$ cat 583937.bmgt1.lcrc.anl.gov.ER
> Use of uninitialized value $s in concatenation (.) or string at
> /home/ozik/.globus/coasters/cscript4312030037430783094.pl line 2220.
> Use of uninitialized value $s in concatenation (.) or string at
> /home/ozik/.globus/coasters/cscript4312030037430783094.pl line 2220.
> blues$
>
> That looks to me like a Swift bug in worker.pl
>
> We'll look into this angle.
>
> Also I'm curious why these files are not going into your run dir (but
> perhaps thats because youre running an older trunk release, not 0.95?
> Or, thats a separate 0.95 bug).
>
> - Mike
>
> On 7/31/14, 9:13 AM, Michael Wilde wrote:
> > Some discussion and diagnosis of this incident has taken place off list.
> >
> > In a quick scan of the worker logs, I don't spot an obvious error that
> > would cause workers to exit.
> > Hopefully others on the Swift team can check those as well.
> >
> > Jonathan, do you have stdout/err files from the PBS scheduler on blues,
> > in your runNNN log dirs?
> >
> > If so, can you point us to them?
> >
> > Thanks,
> >
> > - Mike
> >
> > On 7/29/14, 8:56 PM, Jonathan Ozik wrote:
> >> Hi all,
> >>
> >> I?m getting spurious errors in the jobs that I?m running on Blues. The
> stdout includes exceptions like:
> >>      exception @ swift-int.k, line: 511
> >> Caused by: Block task failed: Connection to worker lost
> >> java.io.IOException: Broken pipe
> >>      at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> >>      at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> >>      at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> >>      at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> >>      at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
> >>      at
> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)
> >>      at
> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)
> >>
> >> These seem to occur at different parts of the submitted jobs. Let me
> know if there?s a log file that you?d like to look at.
> >>
> >> In earlier attempts I was getting these warnings followed by broken
> pipe errors:
> >> Java HotSpot(TM) 64-Bit Server VM warning: INFO:
> os::commit_memory(0x00000000a0000000, 704643072, 2097152, 0) failed;
> error='Cannot allocate memory' (errno=12); Cannot allocate large pages,
> falling back to regular pages
> >>
> >> Apparently that?s a known precursor of crashes on Java 7 as described
> here (
> http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html):
> >> Area: hotspot/gc
> >> Synopsis: Crashes due to failure to allocate large pages.
> >>
> >> On Linux, failures when allocating large pages can lead to crashes.
> When running JDK 7u51 or later versions, the issue can be recognized in two
> ways:
> >>
> >>      ? Before the crash happens one or more lines similar to this will
> have been printed to the log:
> >> os::commit_memory(0x00000006b1600000, 352321536, 2097152, 0) failed;
> >> error='Cannot allocate memory' (errno=12); Cannot allocate large pages,
> falling back to regular pages
> >>      ? If a hs_err file is generated it will contain a line similar to
> this:
> >> Large page allocation failures have occurred 3 times
> >> The problem can be avoided by running with large page support turned
> off, for example by passing the "-XX:-UseLargePages" option to the java
> binary.
> >>
> >> See 8007074 (not public).
> >>
> >> So I added the -XX:-UseLargePages option in the invocations of Java
> code that I was responsible for. That seemed to get rid of the warning and
> the crashes for a while, but perhaps that was just a coincidence.
> >>
> >> Jonathan
> >>
> >> _______________________________________________
> >> Swift-user mailing list
> >> Swift-user at ci.uchicago.edu
> >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
> --
> Michael Wilde
> Mathematics and Computer Science          Computation Institute
> Argonne National Laboratory               The University of Chicago
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>


-- 
Yadu Nand B
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140731/2c92b07c/attachment.html>

From hategan at mcs.anl.gov  Thu Jul 31 13:18:28 2014
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 31 Jul 2014 11:18:28 -0700
Subject: [Swift-user] exception @ swift-int.k, line: 511,
 Caused by: Block task failed: Connection to worker lost
In-Reply-To: <22271280-84FF-4C85-8A1E-5002ED165DAE@uchicago.edu>
References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu>
	<E85EF03A-B4EF-4124-A06F-87DFB107207A@uchicago.edu>
	<53D96721.6030004@anl.gov>
	<CC03A2F2-EF31-4D2F-A59B-0D122008AE13@uchicago.edu>
	<53D99769.5050107@uchicago.edu>
	<D3E499FD-094F-4173-8DE2-A9986E956A40@uchicago.edu>
	<1406783097.19120.2.camel@echo> <53D9CFCB.8090803@uchicago.edu>
	<1406783556.19312.1.camel@echo>
	<28AB72AB-2650-40E2-B084-9D1E20D03AEB@uchicago.edu>
	<F3AA70C7-A8A0-4B49-BA04-D03FFF175926@uchicago.edu>
	<1406826555.22289.4.camel@echo>
	<22271280-84FF-4C85-8A1E-5002ED165DAE@uchicago.edu>
Message-ID: <1406830708.23317.3.camel@echo>

Ok, so the workers die while the jobs are running and not much else is
happening.
My money is on the apps eating up all RAM and the kernel killing the
worker.

The question is how we check whether this is true or not. Ideas?

Yadu, can you do me a favor and package all the PBS output files from
this run? 

Jonathan, can you see if you get the same errors with tasksPerWorker=8?

Mihael

On Thu, 2014-07-31 at 12:34 -0500, Jonathan Ozik wrote:
> Sure thing, it?s attached below.
> 
> Jonathan
> 
> 
> On Jul 31, 2014, at 12:09 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
> 
> > Hi Jonathan,
> > 
> > I can't see anything obvious in the worker logs, but they are pretty
> > large. Can you also post the swift log from this run? It would make
> it
> > easier to focus on the right time frame.
> > 
> > Mihael
> > 
> > On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote:
> >> Hi all,
> >> 
> >> I?m attaching the stdout and the worker logs below.
> >> 
> >> Thanks for looking at these!
> >> 
> >> Jonathan
> >> 
> >> 
> >> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik <jozik at uchicago.edu>
> >> wrote:
> >> 
> >>> Woops, sorry about that. It?s running now and the logs are being
> >> generated. Once the run is done I?ll send you log files.
> >>> 
> >>> Thanks!
> >>> 
> >>> Jonathan
> >>> 
> >>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan <hategan at mcs.anl.gov>
> >> wrote:
> >>> 
> >>>> Right. This isn't your fault. We should, though, probably talk
> >> about
> >>>> addressing the issue.
> >>>> 
> >>>> Mihael
> >>>> 
> >>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote:
> >>>>> Mihael, thanks for spotting that.  I added the comments to
> >> highlight the 
> >>>>> changes in email.
> >>>>> 
> >>>>> -Yadu
> >>>>> 
> >>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote:
> >>>>>> Hi Jonathan,
> >>>>>> 
> >>>>>> I suspect that the site config is considering the comment to be
> >> part of
> >>>>>> the value of the workerLogLevel property. We could confirm this
> >> if you
> >>>>>> send us the swift log from this particular run.
> >>>>>> 
> >>>>>> To fix it, you could try to remove everything after DEBUG
> >> (including all
> >>>>>> horizontal white space). In other words:
> >>>>>> 
> >>>>>> ...
> >>>>>> workerloglevel=DEBUG
> >>>>>> workerlogdirectory=/home/$USER/
> >>>>>> ...
> >>>>>> 
> >>>>>> Mihael
> >>>>>> 
> >>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote:
> >>>>>>> Hi Yadu,
> >>>>>>> 
> >>>>>>> 
> >>>>>>> I?m getting errors indicating that DEBUG is an invalid worker
> >> logging
> >>>>>>> level. I?m attaching the stdout below. Let me know if I?m
> doing
> >>>>>>> something silly.
> >>>>>>> 
> >>>>>>> 
> >>>>>>> Jonathan
> >>>>>>> 
> >>>>>>> 
> >>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji
> >> <yadunand at uchicago.edu>
> >>>>>>> wrote:
> >>>>>>> 
> >>>>>>>> Hi Jonathan,
> >>>>>>>> 
> >>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do
> not
> >> see
> >>>>>>>> anything unusual.
> >>>>>>>> 
> >>>>>>>> From your logs, it looks like workers are failing, so getting
> >> worker
> >>>>>>>> logs would help.
> >>>>>>>> Could you try running on Blues with the following
> >> swift.properties
> >>>>>>>> and get us the worker*logs that would show up in the
> >>>>>>>> workerlogdirectory ?
> >>>>>>>> 
> >>>>>>>> site=blues
> >>>>>>>> 
> >>>>>>>> site.blues {
> >>>>>>>>   jobManager=pbs
> >>>>>>>>   jobQueue=shared
> >>>>>>>>   maxJobs=4
> >>>>>>>>   jobGranularity=1
> >>>>>>>>   maxNodesPerJob=1
> >>>>>>>>   tasksPerWorker=16
> >>>>>>>>   taskThrottle=64
> >>>>>>>>   initialScore=10000
> >>>>>>>>   jobWalltime=00:48:00
> >>>>>>>>   taskWalltime=00:40:00
> >>>>>>>>   workerloglevel=DEBUG                                  #
> >> Adding
> >>>>>>>> debug for workers
> >>>>>>>>   workerlogdirectory=/home/$USER/                # Logging
> >>>>>>>> directory on SFS
> >>>>>>>>   workdir=$RUNDIRECTORY
> >>>>>>>>   filesystem=local
> >>>>>>>> }
> >>>>>>>> 
> >>>>>>>> Thanks,
> >>>>>>>> Yadu
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote:
> >>>>>>>> 
> >>>>>>>>> Hi Mike,
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> Sorry, I figured there was some busy-ness involved!
> >>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I
> >> didn?t
> >>>>>>>>> get the same issue. That is, the model run completed
> >> successfully.
> >>>>>>>>> For the Blues run, I used a trunk distribution (as of May
> 29,
> >>>>>>>>> 2014). I?m including one of the log files below. I?m also
> >>>>>>>>> including the swift.properties file that was used for the
> >> blues
> >>>>>>>>> runs below.
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> Thank you!
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> Jonathan
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde <wilde at anl.gov>
> >> wrote:
> >>>>>>>>> 
> >>>>>>>>>> Hi Jonathan,
> >>>>>>>>>> 
> >>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping!
> >>>>>>>>>> 
> >>>>>>>>>> I or one of the team will answer soon, on swift-user.
> >>>>>>>>>> 
> >>>>>>>>>> (But the first question is: which Swift release, and can
> you
> >>>>>>>>>> point us to, or send, the full log file?)
> >>>>>>>>>> 
> >>>>>>>>>> Thanks and regards,
> >>>>>>>>>> 
> >>>>>>>>>> - Mike
> >>>>>>>>>> 
> >>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote:
> >>>>>>>>>> 
> >>>>>>>>>>> Hi Mike,
> >>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> I didn?t get a response yet so just wanted to make sure
> that
> >>>>>>>>>>> the message came across.
> >>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> Jonathan
> >>>>>>>>>>> 
> >>>>>>>>>>> Begin forwarded message:
> >>>>>>>>>>> 
> >>>>>>>>>>>> From: Jonathan Ozik <jozik at uchicago.edu>
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511,
> >>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT
> >>>>>>>>>>>> 
> >>>>>>>>>>>> To: Mihael Hategan <hategan at mcs.anl.gov>,
> >>>>>>>>>>>> "swift-user at ci.uchicago.edu" <swift-user at ci.uchicago.edu>
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Hi all,
> >>>>>>>>>>>> 
> >>>>>>>>>>>> I?m getting spurious errors in the jobs that I?m running
> on
> >>>>>>>>>>>> Blues. The stdout includes exceptions like:
> >>>>>>>>>>>> exception @ swift-int.k, line: 511
> >>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
> >>>>>>>>>>>> java.io.IOException: Broken pipe
> >>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> >>>>>>>>>>>> at
> >>>>>>>>>>>>
> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> >>>>>>>>>>>> at
> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> >>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> >>>>>>>>>>>> at
> >>>>>>>>>>>> 
> >> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
> >>>>>>>>>>>> at
> >>>>>>>>>>>> 
> >> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)
> >>>>>>>>>>>> at
> >>>>>>>>>>>> 
> >> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)
> >>>>>>>>>>>> 
> >>>>>>>>>>>> These seem to occur at different parts of the submitted
> >>>>>>>>>>>> jobs. Let me know if there?s a log file that you?d like
> to
> >>>>>>>>>>>> look at.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> In earlier attempts I was getting these warnings followed
> >> by
> >>>>>>>>>>>> broken pipe errors:
> >>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO:
> >>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152,
> >> 0)
> >>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot
> >>>>>>>>>>>> allocate large pages, falling back to regular pages
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Apparently that?s a known precursor of crashes on Java 7
> as
> >>>>>>>>>>>> described here
> >>>>>>>>>>>> 
> >>
> (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html):
> >>>>>>>>>>>> Area: hotspot/gc
> >>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> On Linux, failures when allocating large pages can lead
> to
> >>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the
> issue
> >>>>>>>>>>>> can be recognized in two ways:
> >>>>>>>>>>>> 
> >>>>>>>>>>>> ? Before the crash happens one or more lines similar to
> >> this
> >>>>>>>>>>>> will have been printed to the log:
> >>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152,
> >> 0)
> >>>>>>>>>>>> failed;
> >>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot
> allocate
> >>>>>>>>>>>> large pages, falling back to regular pages
> >>>>>>>>>>>> ? If a hs_err file is generated it will contain a line
> >>>>>>>>>>>> similar to this:
> >>>>>>>>>>>> Large page allocation failures have occurred 3 times
> >>>>>>>>>>>> The problem can be avoided by running with large page
> >>>>>>>>>>>> support turned off, for example by passing the
> >>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> See 8007074 (not public).
> >>>>>>>>>>>> 
> >>>>>>>>>>>> So I added the -XX:-UseLargePages option in the
> invocations
> >>>>>>>>>>>> of Java code that I was responsible for. That seemed to
> get
> >>>>>>>>>>>> rid of the warning and the crashes for a while, but
> perhaps
> >>>>>>>>>>>> that was just a coincidence.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Jonathan
> >>>>>>>>>>>> 
> >>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>> Swift-user mailing list
> >>>>>>>>>>>> Swift-user at ci.uchicago.edu
> >>>>>>>>>>>> 
> >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> >>>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>> -- 
> >>>>>>>>>> Michael Wilde
> >>>>>>>>>> Mathematics and Computer Science          Computation
> >> Institute
> >>>>>>>>>> Argonne National Laboratory               The University of
> >> Chicago
> >>>>>>>>> 
> >>>>>>>> 
> >>>>>>> 
> >>>>>> 
> >>>>> 
> >>>> 
> >>>> 
> >>> 
> >> 
> >> 
> > 
> > 
> 
> 


From yadudoc1729 at gmail.com  Thu Jul 31 13:26:02 2014
From: yadudoc1729 at gmail.com (Yadu Nand)
Date: Thu, 31 Jul 2014 13:26:02 -0500
Subject: [Swift-user] exception @ swift-int.k, line: 511,
 Caused by: Block task failed: Connection to worker lost
In-Reply-To: <1406830708.23317.3.camel@echo>
References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu>
	<E85EF03A-B4EF-4124-A06F-87DFB107207A@uchicago.edu>
	<53D96721.6030004@anl.gov>
	<CC03A2F2-EF31-4D2F-A59B-0D122008AE13@uchicago.edu>
	<53D99769.5050107@uchicago.edu>
	<D3E499FD-094F-4173-8DE2-A9986E956A40@uchicago.edu>
	<1406783097.19120.2.camel@echo>
	<53D9CFCB.8090803@uchicago.edu> <1406783556.19312.1.camel@echo>
	<28AB72AB-2650-40E2-B084-9D1E20D03AEB@uchicago.edu>
	<F3AA70C7-A8A0-4B49-BA04-D03FFF175926@uchicago.edu>
	<1406826555.22289.4.camel@echo>
	<22271280-84FF-4C85-8A1E-5002ED165DAE@uchicago.edu>
	<1406830708.23317.3.camel@echo>
Message-ID: <CANa904=Mc_-gpeRw0FQO_KN6Ja2BQSPDfbyfyvZB7y=gf0Gpug@mail.gmail.com>

?Here's a link to the scripts folder tarred up.
http://users.rcc.uchicago.edu/~yadunand/pbs_ozik_blues.tar.gz

A couple files couldn't be copied due to permission issues.

-Yadu?


On Thu, Jul 31, 2014 at 1:18 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> Ok, so the workers die while the jobs are running and not much else is
> happening.
> My money is on the apps eating up all RAM and the kernel killing the
> worker.
>
> The question is how we check whether this is true or not. Ideas?
>
> Yadu, can you do me a favor and package all the PBS output files from
> this run?
>
> Jonathan, can you see if you get the same errors with tasksPerWorker=8?
>
> Mihael
>
> On Thu, 2014-07-31 at 12:34 -0500, Jonathan Ozik wrote:
> > Sure thing, it?s attached below.
> >
> > Jonathan
> >
> >
> > On Jul 31, 2014, at 12:09 PM, Mihael Hategan <hategan at mcs.anl.gov>
> > wrote:
> >
> > > Hi Jonathan,
> > >
> > > I can't see anything obvious in the worker logs, but they are pretty
> > > large. Can you also post the swift log from this run? It would make
> > it
> > > easier to focus on the right time frame.
> > >
> > > Mihael
> > >
> > > On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote:
> > >> Hi all,
> > >>
> > >> I?m attaching the stdout and the worker logs below.
> > >>
> > >> Thanks for looking at these!
> > >>
> > >> Jonathan
> > >>
> > >>
> > >> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik <jozik at uchicago.edu>
> > >> wrote:
> > >>
> > >>> Woops, sorry about that. It?s running now and the logs are being
> > >> generated. Once the run is done I?ll send you log files.
> > >>>
> > >>> Thanks!
> > >>>
> > >>> Jonathan
> > >>>
> > >>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan <hategan at mcs.anl.gov>
> > >> wrote:
> > >>>
> > >>>> Right. This isn't your fault. We should, though, probably talk
> > >> about
> > >>>> addressing the issue.
> > >>>>
> > >>>> Mihael
> > >>>>
> > >>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote:
> > >>>>> Mihael, thanks for spotting that.  I added the comments to
> > >> highlight the
> > >>>>> changes in email.
> > >>>>>
> > >>>>> -Yadu
> > >>>>>
> > >>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote:
> > >>>>>> Hi Jonathan,
> > >>>>>>
> > >>>>>> I suspect that the site config is considering the comment to be
> > >> part of
> > >>>>>> the value of the workerLogLevel property. We could confirm this
> > >> if you
> > >>>>>> send us the swift log from this particular run.
> > >>>>>>
> > >>>>>> To fix it, you could try to remove everything after DEBUG
> > >> (including all
> > >>>>>> horizontal white space). In other words:
> > >>>>>>
> > >>>>>> ...
> > >>>>>> workerloglevel=DEBUG
> > >>>>>> workerlogdirectory=/home/$USER/
> > >>>>>> ...
> > >>>>>>
> > >>>>>> Mihael
> > >>>>>>
> > >>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote:
> > >>>>>>> Hi Yadu,
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> I?m getting errors indicating that DEBUG is an invalid worker
> > >> logging
> > >>>>>>> level. I?m attaching the stdout below. Let me know if I?m
> > doing
> > >>>>>>> something silly.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Jonathan
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji
> > >> <yadunand at uchicago.edu>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hi Jonathan,
> > >>>>>>>>
> > >>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do
> > not
> > >> see
> > >>>>>>>> anything unusual.
> > >>>>>>>>
> > >>>>>>>> From your logs, it looks like workers are failing, so getting
> > >> worker
> > >>>>>>>> logs would help.
> > >>>>>>>> Could you try running on Blues with the following
> > >> swift.properties
> > >>>>>>>> and get us the worker*logs that would show up in the
> > >>>>>>>> workerlogdirectory ?
> > >>>>>>>>
> > >>>>>>>> site=blues
> > >>>>>>>>
> > >>>>>>>> site.blues {
> > >>>>>>>>   jobManager=pbs
> > >>>>>>>>   jobQueue=shared
> > >>>>>>>>   maxJobs=4
> > >>>>>>>>   jobGranularity=1
> > >>>>>>>>   maxNodesPerJob=1
> > >>>>>>>>   tasksPerWorker=16
> > >>>>>>>>   taskThrottle=64
> > >>>>>>>>   initialScore=10000
> > >>>>>>>>   jobWalltime=00:48:00
> > >>>>>>>>   taskWalltime=00:40:00
> > >>>>>>>>   workerloglevel=DEBUG                                  #
> > >> Adding
> > >>>>>>>> debug for workers
> > >>>>>>>>   workerlogdirectory=/home/$USER/                # Logging
> > >>>>>>>> directory on SFS
> > >>>>>>>>   workdir=$RUNDIRECTORY
> > >>>>>>>>   filesystem=local
> > >>>>>>>> }
> > >>>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>> Yadu
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote:
> > >>>>>>>>
> > >>>>>>>>> Hi Mike,
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Sorry, I figured there was some busy-ness involved!
> > >>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I
> > >> didn?t
> > >>>>>>>>> get the same issue. That is, the model run completed
> > >> successfully.
> > >>>>>>>>> For the Blues run, I used a trunk distribution (as of May
> > 29,
> > >>>>>>>>> 2014). I?m including one of the log files below. I?m also
> > >>>>>>>>> including the swift.properties file that was used for the
> > >> blues
> > >>>>>>>>> runs below.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Thank you!
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Jonathan
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde <wilde at anl.gov>
> > >> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Hi Jonathan,
> > >>>>>>>>>>
> > >>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping!
> > >>>>>>>>>>
> > >>>>>>>>>> I or one of the team will answer soon, on swift-user.
> > >>>>>>>>>>
> > >>>>>>>>>> (But the first question is: which Swift release, and can
> > you
> > >>>>>>>>>> point us to, or send, the full log file?)
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks and regards,
> > >>>>>>>>>>
> > >>>>>>>>>> - Mike
> > >>>>>>>>>>
> > >>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Hi Mike,
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> I didn?t get a response yet so just wanted to make sure
> > that
> > >>>>>>>>>>> the message came across.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> Jonathan
> > >>>>>>>>>>>
> > >>>>>>>>>>> Begin forwarded message:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> From: Jonathan Ozik <jozik at uchicago.edu>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511,
> > >>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> To: Mihael Hategan <hategan at mcs.anl.gov>,
> > >>>>>>>>>>>> "swift-user at ci.uchicago.edu" <swift-user at ci.uchicago.edu>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I?m getting spurious errors in the jobs that I?m running
> > on
> > >>>>>>>>>>>> Blues. The stdout includes exceptions like:
> > >>>>>>>>>>>> exception @ swift-int.k, line: 511
> > >>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
> > >>>>>>>>>>>> java.io.IOException: Broken pipe
> > >>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> > >>>>>>>>>>>> at
> > >>>>>>>>>>>>
> > sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> > >>>>>>>>>>>> at
> > sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> > >>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> > >>>>>>>>>>>> at
> > >>>>>>>>>>>>
> > >> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
> > >>>>>>>>>>>> at
> > >>>>>>>>>>>>
> > >> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)
> > >>>>>>>>>>>> at
> > >>>>>>>>>>>>
> > >> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> These seem to occur at different parts of the submitted
> > >>>>>>>>>>>> jobs. Let me know if there?s a log file that you?d like
> > to
> > >>>>>>>>>>>> look at.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> In earlier attempts I was getting these warnings followed
> > >> by
> > >>>>>>>>>>>> broken pipe errors:
> > >>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO:
> > >>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152,
> > >> 0)
> > >>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot
> > >>>>>>>>>>>> allocate large pages, falling back to regular pages
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Apparently that?s a known precursor of crashes on Java 7
> > as
> > >>>>>>>>>>>> described here
> > >>>>>>>>>>>>
> > >>
> > (
> http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html):
> > >>>>>>>>>>>> Area: hotspot/gc
> > >>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Linux, failures when allocating large pages can lead
> > to
> > >>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the
> > issue
> > >>>>>>>>>>>> can be recognized in two ways:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> ? Before the crash happens one or more lines similar to
> > >> this
> > >>>>>>>>>>>> will have been printed to the log:
> > >>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152,
> > >> 0)
> > >>>>>>>>>>>> failed;
> > >>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot
> > allocate
> > >>>>>>>>>>>> large pages, falling back to regular pages
> > >>>>>>>>>>>> ? If a hs_err file is generated it will contain a line
> > >>>>>>>>>>>> similar to this:
> > >>>>>>>>>>>> Large page allocation failures have occurred 3 times
> > >>>>>>>>>>>> The problem can be avoided by running with large page
> > >>>>>>>>>>>> support turned off, for example by passing the
> > >>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> See 8007074 (not public).
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> So I added the -XX:-UseLargePages option in the
> > invocations
> > >>>>>>>>>>>> of Java code that I was responsible for. That seemed to
> > get
> > >>>>>>>>>>>> rid of the warning and the crashes for a while, but
> > perhaps
> > >>>>>>>>>>>> that was just a coincidence.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Jonathan
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> _______________________________________________
> > >>>>>>>>>>>> Swift-user mailing list
> > >>>>>>>>>>>> Swift-user at ci.uchicago.edu
> > >>>>>>>>>>>>
> > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>> --
> > >>>>>>>>>> Michael Wilde
> > >>>>>>>>>> Mathematics and Computer Science          Computation
> > >> Institute
> > >>>>>>>>>> Argonne National Laboratory               The University of
> > >> Chicago
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>
> > >>
> > >>
> > >
> > >
> >
> >
>
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>


-- 
Yadu Nand B
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140731/68b84153/attachment.html>

From hategan at mcs.anl.gov  Thu Jul 31 13:36:27 2014
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 31 Jul 2014 11:36:27 -0700
Subject: [Swift-user] exception @ swift-int.k, line: 511,
 Caused by: Block task failed: Connection to worker lost
In-Reply-To: <CANa904=Mc_-gpeRw0FQO_KN6Ja2BQSPDfbyfyvZB7y=gf0Gpug@mail.gmail.com>
References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu>
	<E85EF03A-B4EF-4124-A06F-87DFB107207A@uchicago.edu>
	<53D96721.6030004@anl.gov>
	<CC03A2F2-EF31-4D2F-A59B-0D122008AE13@uchicago.edu>
	<53D99769.5050107@uchicago.edu>
	<D3E499FD-094F-4173-8DE2-A9986E956A40@uchicago.edu>
	<1406783097.19120.2.camel@echo> <53D9CFCB.8090803@uchicago.edu>
	<1406783556.19312.1.camel@echo>
	<28AB72AB-2650-40E2-B084-9D1E20D03AEB@uchicago.edu>
	<F3AA70C7-A8A0-4B49-BA04-D03FFF175926@uchicago.edu>
	<1406826555.22289.4.camel@echo>
	<22271280-84FF-4C85-8A1E-5002ED165DAE@uchicago.edu>
	<1406830708.23317.3.camel@echo>
	<CANa904=Mc_-gpeRw0FQO_KN6Ja2BQSPDfbyfyvZB7y=gf0Gpug@mail.gmail.com>
Message-ID: <1406831787.23534.8.camel@echo>

Thanks Yadu!

I see nothing interesting in those logs. Again, the absence of any kind*
of problem logged by the worker points to some abrupt termination of the
process, which is most likely explained by an OOM.

Mihael

(*) uninitialized variable concatenation warning aside

On Thu, 2014-07-31 at 13:26 -0500, Yadu Nand wrote:
> ?Here's a link to the scripts folder tarred up.
> http://users.rcc.uchicago.edu/~yadunand/pbs_ozik_blues.tar.gz
> 
> A couple files couldn't be copied due to permission issues.
> 
> -Yadu?
> 
> 
> On Thu, Jul 31, 2014 at 1:18 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> 
> > Ok, so the workers die while the jobs are running and not much else is
> > happening.
> > My money is on the apps eating up all RAM and the kernel killing the
> > worker.
> >
> > The question is how we check whether this is true or not. Ideas?
> >
> > Yadu, can you do me a favor and package all the PBS output files from
> > this run?
> >
> > Jonathan, can you see if you get the same errors with tasksPerWorker=8?
> >
> > Mihael
> >
> > On Thu, 2014-07-31 at 12:34 -0500, Jonathan Ozik wrote:
> > > Sure thing, it?s attached below.
> > >
> > > Jonathan
> > >
> > >
> > > On Jul 31, 2014, at 12:09 PM, Mihael Hategan <hategan at mcs.anl.gov>
> > > wrote:
> > >
> > > > Hi Jonathan,
> > > >
> > > > I can't see anything obvious in the worker logs, but they are pretty
> > > > large. Can you also post the swift log from this run? It would make
> > > it
> > > > easier to focus on the right time frame.
> > > >
> > > > Mihael
> > > >
> > > > On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote:
> > > >> Hi all,
> > > >>
> > > >> I?m attaching the stdout and the worker logs below.
> > > >>
> > > >> Thanks for looking at these!
> > > >>
> > > >> Jonathan
> > > >>
> > > >>
> > > >> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik <jozik at uchicago.edu>
> > > >> wrote:
> > > >>
> > > >>> Woops, sorry about that. It?s running now and the logs are being
> > > >> generated. Once the run is done I?ll send you log files.
> > > >>>
> > > >>> Thanks!
> > > >>>
> > > >>> Jonathan
> > > >>>
> > > >>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan <hategan at mcs.anl.gov>
> > > >> wrote:
> > > >>>
> > > >>>> Right. This isn't your fault. We should, though, probably talk
> > > >> about
> > > >>>> addressing the issue.
> > > >>>>
> > > >>>> Mihael
> > > >>>>
> > > >>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote:
> > > >>>>> Mihael, thanks for spotting that.  I added the comments to
> > > >> highlight the
> > > >>>>> changes in email.
> > > >>>>>
> > > >>>>> -Yadu
> > > >>>>>
> > > >>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote:
> > > >>>>>> Hi Jonathan,
> > > >>>>>>
> > > >>>>>> I suspect that the site config is considering the comment to be
> > > >> part of
> > > >>>>>> the value of the workerLogLevel property. We could confirm this
> > > >> if you
> > > >>>>>> send us the swift log from this particular run.
> > > >>>>>>
> > > >>>>>> To fix it, you could try to remove everything after DEBUG
> > > >> (including all
> > > >>>>>> horizontal white space). In other words:
> > > >>>>>>
> > > >>>>>> ...
> > > >>>>>> workerloglevel=DEBUG
> > > >>>>>> workerlogdirectory=/home/$USER/
> > > >>>>>> ...
> > > >>>>>>
> > > >>>>>> Mihael
> > > >>>>>>
> > > >>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote:
> > > >>>>>>> Hi Yadu,
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> I?m getting errors indicating that DEBUG is an invalid worker
> > > >> logging
> > > >>>>>>> level. I?m attaching the stdout below. Let me know if I?m
> > > doing
> > > >>>>>>> something silly.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> Jonathan
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji
> > > >> <yadunand at uchicago.edu>
> > > >>>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Hi Jonathan,
> > > >>>>>>>>
> > > >>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do
> > > not
> > > >> see
> > > >>>>>>>> anything unusual.
> > > >>>>>>>>
> > > >>>>>>>> From your logs, it looks like workers are failing, so getting
> > > >> worker
> > > >>>>>>>> logs would help.
> > > >>>>>>>> Could you try running on Blues with the following
> > > >> swift.properties
> > > >>>>>>>> and get us the worker*logs that would show up in the
> > > >>>>>>>> workerlogdirectory ?
> > > >>>>>>>>
> > > >>>>>>>> site=blues
> > > >>>>>>>>
> > > >>>>>>>> site.blues {
> > > >>>>>>>>   jobManager=pbs
> > > >>>>>>>>   jobQueue=shared
> > > >>>>>>>>   maxJobs=4
> > > >>>>>>>>   jobGranularity=1
> > > >>>>>>>>   maxNodesPerJob=1
> > > >>>>>>>>   tasksPerWorker=16
> > > >>>>>>>>   taskThrottle=64
> > > >>>>>>>>   initialScore=10000
> > > >>>>>>>>   jobWalltime=00:48:00
> > > >>>>>>>>   taskWalltime=00:40:00
> > > >>>>>>>>   workerloglevel=DEBUG                                  #
> > > >> Adding
> > > >>>>>>>> debug for workers
> > > >>>>>>>>   workerlogdirectory=/home/$USER/                # Logging
> > > >>>>>>>> directory on SFS
> > > >>>>>>>>   workdir=$RUNDIRECTORY
> > > >>>>>>>>   filesystem=local
> > > >>>>>>>> }
> > > >>>>>>>>
> > > >>>>>>>> Thanks,
> > > >>>>>>>> Yadu
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote:
> > > >>>>>>>>
> > > >>>>>>>>> Hi Mike,
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> Sorry, I figured there was some busy-ness involved!
> > > >>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I
> > > >> didn?t
> > > >>>>>>>>> get the same issue. That is, the model run completed
> > > >> successfully.
> > > >>>>>>>>> For the Blues run, I used a trunk distribution (as of May
> > > 29,
> > > >>>>>>>>> 2014). I?m including one of the log files below. I?m also
> > > >>>>>>>>> including the swift.properties file that was used for the
> > > >> blues
> > > >>>>>>>>> runs below.
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> Thank you!
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> Jonathan
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde <wilde at anl.gov>
> > > >> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> Hi Jonathan,
> > > >>>>>>>>>>
> > > >>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping!
> > > >>>>>>>>>>
> > > >>>>>>>>>> I or one of the team will answer soon, on swift-user.
> > > >>>>>>>>>>
> > > >>>>>>>>>> (But the first question is: which Swift release, and can
> > > you
> > > >>>>>>>>>> point us to, or send, the full log file?)
> > > >>>>>>>>>>
> > > >>>>>>>>>> Thanks and regards,
> > > >>>>>>>>>>
> > > >>>>>>>>>> - Mike
> > > >>>>>>>>>>
> > > >>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>>> Hi Mike,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> I didn?t get a response yet so just wanted to make sure
> > > that
> > > >>>>>>>>>>> the message came across.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Jonathan
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Begin forwarded message:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> From: Jonathan Ozik <jozik at uchicago.edu>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511,
> > > >>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> To: Mihael Hategan <hategan at mcs.anl.gov>,
> > > >>>>>>>>>>>> "swift-user at ci.uchicago.edu" <swift-user at ci.uchicago.edu>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Hi all,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> I?m getting spurious errors in the jobs that I?m running
> > > on
> > > >>>>>>>>>>>> Blues. The stdout includes exceptions like:
> > > >>>>>>>>>>>> exception @ swift-int.k, line: 511
> > > >>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
> > > >>>>>>>>>>>> java.io.IOException: Broken pipe
> > > >>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> > > >>>>>>>>>>>> at
> > > >>>>>>>>>>>>
> > > sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> > > >>>>>>>>>>>> at
> > > sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> > > >>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> > > >>>>>>>>>>>> at
> > > >>>>>>>>>>>>
> > > >> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
> > > >>>>>>>>>>>> at
> > > >>>>>>>>>>>>
> > > >> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)
> > > >>>>>>>>>>>> at
> > > >>>>>>>>>>>>
> > > >> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> These seem to occur at different parts of the submitted
> > > >>>>>>>>>>>> jobs. Let me know if there?s a log file that you?d like
> > > to
> > > >>>>>>>>>>>> look at.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> In earlier attempts I was getting these warnings followed
> > > >> by
> > > >>>>>>>>>>>> broken pipe errors:
> > > >>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO:
> > > >>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152,
> > > >> 0)
> > > >>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot
> > > >>>>>>>>>>>> allocate large pages, falling back to regular pages
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Apparently that?s a known precursor of crashes on Java 7
> > > as
> > > >>>>>>>>>>>> described here
> > > >>>>>>>>>>>>
> > > >>
> > > (
> > http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html):
> > > >>>>>>>>>>>> Area: hotspot/gc
> > > >>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On Linux, failures when allocating large pages can lead
> > > to
> > > >>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the
> > > issue
> > > >>>>>>>>>>>> can be recognized in two ways:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> ? Before the crash happens one or more lines similar to
> > > >> this
> > > >>>>>>>>>>>> will have been printed to the log:
> > > >>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152,
> > > >> 0)
> > > >>>>>>>>>>>> failed;
> > > >>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot
> > > allocate
> > > >>>>>>>>>>>> large pages, falling back to regular pages
> > > >>>>>>>>>>>> ? If a hs_err file is generated it will contain a line
> > > >>>>>>>>>>>> similar to this:
> > > >>>>>>>>>>>> Large page allocation failures have occurred 3 times
> > > >>>>>>>>>>>> The problem can be avoided by running with large page
> > > >>>>>>>>>>>> support turned off, for example by passing the
> > > >>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> See 8007074 (not public).
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> So I added the -XX:-UseLargePages option in the
> > > invocations
> > > >>>>>>>>>>>> of Java code that I was responsible for. That seemed to
> > > get
> > > >>>>>>>>>>>> rid of the warning and the crashes for a while, but
> > > perhaps
> > > >>>>>>>>>>>> that was just a coincidence.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Jonathan
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> _______________________________________________
> > > >>>>>>>>>>>> Swift-user mailing list
> > > >>>>>>>>>>>> Swift-user at ci.uchicago.edu
> > > >>>>>>>>>>>>
> > > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>> --
> > > >>>>>>>>>> Michael Wilde
> > > >>>>>>>>>> Mathematics and Computer Science          Computation
> > > >> Institute
> > > >>>>>>>>>> Argonne National Laboratory               The University of
> > > >> Chicago
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>>
> > > >>>
> > > >>
> > > >>
> > > >
> > > >
> > >
> > >
> >
> >
> > _______________________________________________
> > Swift-user mailing list
> > Swift-user at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> >
> 
> 
> 


From jozik at uchicago.edu  Thu Jul 31 13:42:15 2014
From: jozik at uchicago.edu (Jonathan Ozik)
Date: Thu, 31 Jul 2014 13:42:15 -0500
Subject: [Swift-user] exception @ swift-int.k, line: 511,
	Caused by: Block task failed: Connection to worker lost
In-Reply-To: <1406830708.23317.3.camel@echo>
References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu>
	<E85EF03A-B4EF-4124-A06F-87DFB107207A@uchicago.edu>
	<53D96721.6030004@anl.gov>
	<CC03A2F2-EF31-4D2F-A59B-0D122008AE13@uchicago.edu>
	<53D99769.5050107@uchicago.edu>
	<D3E499FD-094F-4173-8DE2-A9986E956A40@uchicago.edu>
	<1406783097.19120.2.camel@echo> <53D9CFCB.8090803@uchicago.edu>
	<1406783556.19312.1.camel@echo>
	<28AB72AB-2650-40E2-B084-9D1E20D03AEB@uchicago.edu>
	<F3AA70C7-A8A0-4B49-BA04-D03FFF175926@uchicago.edu>
	<1406826555.22289.4.camel@echo>
	<22271280-84FF-4C85-8A1E-5002ED165DAE@uchicago.edu>
	<1406830708.23317.3.camel@echo>
Message-ID: <CF77F65F-AD2C-407C-89A0-BB05284E400C@uchicago.edu>

Okay, I?ve launched a new job, with tasksPerWorker=8. This is running on the sSerial queue rather than the shared queue for the other runs. Just wanted to note that for comparison. Each of the java processes that are launched with -Xmx1536m. I believe that Blues advertises each node having access to 64GB (http://www.lcrc.anl.gov/about/Blues), so at least at first glance the memory issue doesn?t seem like it could be an issue. 

Jonathan

On Jul 31, 2014, at 1:18 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> Ok, so the workers die while the jobs are running and not much else is
> happening.
> My money is on the apps eating up all RAM and the kernel killing the
> worker.
> 
> The question is how we check whether this is true or not. Ideas?
> 
> Yadu, can you do me a favor and package all the PBS output files from
> this run? 
> 
> Jonathan, can you see if you get the same errors with tasksPerWorker=8?
> 
> Mihael
> 
> On Thu, 2014-07-31 at 12:34 -0500, Jonathan Ozik wrote:
>> Sure thing, it?s attached below.
>> 
>> Jonathan
>> 
>> 
>> On Jul 31, 2014, at 12:09 PM, Mihael Hategan <hategan at mcs.anl.gov>
>> wrote:
>> 
>>> Hi Jonathan,
>>> 
>>> I can't see anything obvious in the worker logs, but they are pretty
>>> large. Can you also post the swift log from this run? It would make
>> it
>>> easier to focus on the right time frame.
>>> 
>>> Mihael
>>> 
>>> On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote:
>>>> Hi all,
>>>> 
>>>> I?m attaching the stdout and the worker logs below.
>>>> 
>>>> Thanks for looking at these!
>>>> 
>>>> Jonathan
>>>> 
>>>> 
>>>> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik <jozik at uchicago.edu>
>>>> wrote:
>>>> 
>>>>> Woops, sorry about that. It?s running now and the logs are being
>>>> generated. Once the run is done I?ll send you log files.
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> Jonathan
>>>>> 
>>>>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan <hategan at mcs.anl.gov>
>>>> wrote:
>>>>> 
>>>>>> Right. This isn't your fault. We should, though, probably talk
>>>> about
>>>>>> addressing the issue.
>>>>>> 
>>>>>> Mihael
>>>>>> 
>>>>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote:
>>>>>>> Mihael, thanks for spotting that.  I added the comments to
>>>> highlight the 
>>>>>>> changes in email.
>>>>>>> 
>>>>>>> -Yadu
>>>>>>> 
>>>>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote:
>>>>>>>> Hi Jonathan,
>>>>>>>> 
>>>>>>>> I suspect that the site config is considering the comment to be
>>>> part of
>>>>>>>> the value of the workerLogLevel property. We could confirm this
>>>> if you
>>>>>>>> send us the swift log from this particular run.
>>>>>>>> 
>>>>>>>> To fix it, you could try to remove everything after DEBUG
>>>> (including all
>>>>>>>> horizontal white space). In other words:
>>>>>>>> 
>>>>>>>> ...
>>>>>>>> workerloglevel=DEBUG
>>>>>>>> workerlogdirectory=/home/$USER/
>>>>>>>> ...
>>>>>>>> 
>>>>>>>> Mihael
>>>>>>>> 
>>>>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote:
>>>>>>>>> Hi Yadu,
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I?m getting errors indicating that DEBUG is an invalid worker
>>>> logging
>>>>>>>>> level. I?m attaching the stdout below. Let me know if I?m
>> doing
>>>>>>>>> something silly.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Jonathan
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji
>>>> <yadunand at uchicago.edu>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi Jonathan,
>>>>>>>>>> 
>>>>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do
>> not
>>>> see
>>>>>>>>>> anything unusual.
>>>>>>>>>> 
>>>>>>>>>> From your logs, it looks like workers are failing, so getting
>>>> worker
>>>>>>>>>> logs would help.
>>>>>>>>>> Could you try running on Blues with the following
>>>> swift.properties
>>>>>>>>>> and get us the worker*logs that would show up in the
>>>>>>>>>> workerlogdirectory ?
>>>>>>>>>> 
>>>>>>>>>> site=blues
>>>>>>>>>> 
>>>>>>>>>> site.blues {
>>>>>>>>>>  jobManager=pbs
>>>>>>>>>>  jobQueue=shared
>>>>>>>>>>  maxJobs=4
>>>>>>>>>>  jobGranularity=1
>>>>>>>>>>  maxNodesPerJob=1
>>>>>>>>>>  tasksPerWorker=16
>>>>>>>>>>  taskThrottle=64
>>>>>>>>>>  initialScore=10000
>>>>>>>>>>  jobWalltime=00:48:00
>>>>>>>>>>  taskWalltime=00:40:00
>>>>>>>>>>  workerloglevel=DEBUG                                  #
>>>> Adding
>>>>>>>>>> debug for workers
>>>>>>>>>>  workerlogdirectory=/home/$USER/                # Logging
>>>>>>>>>> directory on SFS
>>>>>>>>>>  workdir=$RUNDIRECTORY
>>>>>>>>>>  filesystem=local
>>>>>>>>>> }
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Yadu
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Mike,
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Sorry, I figured there was some busy-ness involved!
>>>>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I
>>>> didn?t
>>>>>>>>>>> get the same issue. That is, the model run completed
>>>> successfully.
>>>>>>>>>>> For the Blues run, I used a trunk distribution (as of May
>> 29,
>>>>>>>>>>> 2014). I?m including one of the log files below. I?m also
>>>>>>>>>>> including the swift.properties file that was used for the
>>>> blues
>>>>>>>>>>> runs below.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Thank you!
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Jonathan
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde <wilde at anl.gov>
>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Jonathan,
>>>>>>>>>>>> 
>>>>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping!
>>>>>>>>>>>> 
>>>>>>>>>>>> I or one of the team will answer soon, on swift-user.
>>>>>>>>>>>> 
>>>>>>>>>>>> (But the first question is: which Swift release, and can
>> you
>>>>>>>>>>>> point us to, or send, the full log file?)
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks and regards,
>>>>>>>>>>>> 
>>>>>>>>>>>> - Mike
>>>>>>>>>>>> 
>>>>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Mike,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I didn?t get a response yet so just wanted to make sure
>> that
>>>>>>>>>>>>> the message came across.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Jonathan
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Begin forwarded message:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> From: Jonathan Ozik <jozik at uchicago.edu>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511,
>>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> To: Mihael Hategan <hategan at mcs.anl.gov>,
>>>>>>>>>>>>>> "swift-user at ci.uchicago.edu" <swift-user at ci.uchicago.edu>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I?m getting spurious errors in the jobs that I?m running
>> on
>>>>>>>>>>>>>> Blues. The stdout includes exceptions like:
>>>>>>>>>>>>>> exception @ swift-int.k, line: 511
>>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
>>>>>>>>>>>>>> java.io.IOException: Broken pipe
>>>>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> 
>> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>>>>>>>>>>>>>> at
>> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>>>>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> 
>>>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> 
>>>> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> 
>>>> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> These seem to occur at different parts of the submitted
>>>>>>>>>>>>>> jobs. Let me know if there?s a log file that you?d like
>> to
>>>>>>>>>>>>>> look at.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> In earlier attempts I was getting these warnings followed
>>>> by
>>>>>>>>>>>>>> broken pipe errors:
>>>>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO:
>>>>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152,
>>>> 0)
>>>>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot
>>>>>>>>>>>>>> allocate large pages, falling back to regular pages
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Apparently that?s a known precursor of crashes on Java 7
>> as
>>>>>>>>>>>>>> described here
>>>>>>>>>>>>>> 
>>>> 
>> (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html):
>>>>>>>>>>>>>> Area: hotspot/gc
>>>>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Linux, failures when allocating large pages can lead
>> to
>>>>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the
>> issue
>>>>>>>>>>>>>> can be recognized in two ways:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ? Before the crash happens one or more lines similar to
>>>> this
>>>>>>>>>>>>>> will have been printed to the log:
>>>>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152,
>>>> 0)
>>>>>>>>>>>>>> failed;
>>>>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot
>> allocate
>>>>>>>>>>>>>> large pages, falling back to regular pages
>>>>>>>>>>>>>> ? If a hs_err file is generated it will contain a line
>>>>>>>>>>>>>> similar to this:
>>>>>>>>>>>>>> Large page allocation failures have occurred 3 times
>>>>>>>>>>>>>> The problem can be avoided by running with large page
>>>>>>>>>>>>>> support turned off, for example by passing the
>>>>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> See 8007074 (not public).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So I added the -XX:-UseLargePages option in the
>> invocations
>>>>>>>>>>>>>> of Java code that I was responsible for. That seemed to
>> get
>>>>>>>>>>>>>> rid of the warning and the crashes for a while, but
>> perhaps
>>>>>>>>>>>>>> that was just a coincidence.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Jonathan
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> Swift-user mailing list
>>>>>>>>>>>>>> Swift-user at ci.uchicago.edu
>>>>>>>>>>>>>> 
>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> -- 
>>>>>>>>>>>> Michael Wilde
>>>>>>>>>>>> Mathematics and Computer Science          Computation
>>>> Institute
>>>>>>>>>>>> Argonne National Laboratory               The University of
>>>> Chicago
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 


From davidkelly at uchicago.edu  Thu Jul 31 13:55:33 2014
From: davidkelly at uchicago.edu (David Kelly)
Date: Thu, 31 Jul 2014 13:55:33 -0500
Subject: [Swift-user] exception @ swift-int.k, line: 511,
 Caused by: Block task failed: Connection to worker lost
In-Reply-To: <CF77F65F-AD2C-407C-89A0-BB05284E400C@uchicago.edu>
References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu>
	<E85EF03A-B4EF-4124-A06F-87DFB107207A@uchicago.edu>
	<53D96721.6030004@anl.gov>
	<CC03A2F2-EF31-4D2F-A59B-0D122008AE13@uchicago.edu>
	<53D99769.5050107@uchicago.edu>
	<D3E499FD-094F-4173-8DE2-A9986E956A40@uchicago.edu>
	<1406783097.19120.2.camel@echo> <53D9CFCB.8090803@uchicago.edu>
	<1406783556.19312.1.camel@echo>
	<28AB72AB-2650-40E2-B084-9D1E20D03AEB@uchicago.edu>
	<F3AA70C7-A8A0-4B49-BA04-D03FFF175926@uchicago.edu>
	<1406826555.22289.4.camel@echo>
	<22271280-84FF-4C85-8A1E-5002ED165DAE@uchicago.edu>
	<1406830708.23317.3.camel@echo>
	<CF77F65F-AD2C-407C-89A0-BB05284E400C@uchicago.edu>
Message-ID: <CA+_+Ey-QVO_wsw0V-ViFnb9kiYPs7Va6rhZvrWtebgajtZoNdA@mail.gmail.com>

Is each invocation of the Java app creating a large number of threads? I
ran into an issue like that (on another cluster) where I was hitting the
maximum number of processes per node, and the scheduler ended up killing my
jobs.


On Thu, Jul 31, 2014 at 1:42 PM, Jonathan Ozik <jozik at uchicago.edu> wrote:

> Okay, I?ve launched a new job, with tasksPerWorker=8. This is running on
> the sSerial queue rather than the shared queue for the other runs. Just
> wanted to note that for comparison. Each of the java processes that are
> launched with -Xmx1536m. I believe that Blues advertises each node having
> access to 64GB (http://www.lcrc.anl.gov/about/Blues), so at least at
> first glance the memory issue doesn?t seem like it could be an issue.
>
> Jonathan
>
> On Jul 31, 2014, at 1:18 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
>
> > Ok, so the workers die while the jobs are running and not much else is
> > happening.
> > My money is on the apps eating up all RAM and the kernel killing the
> > worker.
> >
> > The question is how we check whether this is true or not. Ideas?
> >
> > Yadu, can you do me a favor and package all the PBS output files from
> > this run?
> >
> > Jonathan, can you see if you get the same errors with tasksPerWorker=8?
> >
> > Mihael
> >
> > On Thu, 2014-07-31 at 12:34 -0500, Jonathan Ozik wrote:
> >> Sure thing, it?s attached below.
> >>
> >> Jonathan
> >>
> >>
> >> On Jul 31, 2014, at 12:09 PM, Mihael Hategan <hategan at mcs.anl.gov>
> >> wrote:
> >>
> >>> Hi Jonathan,
> >>>
> >>> I can't see anything obvious in the worker logs, but they are pretty
> >>> large. Can you also post the swift log from this run? It would make
> >> it
> >>> easier to focus on the right time frame.
> >>>
> >>> Mihael
> >>>
> >>> On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote:
> >>>> Hi all,
> >>>>
> >>>> I?m attaching the stdout and the worker logs below.
> >>>>
> >>>> Thanks for looking at these!
> >>>>
> >>>> Jonathan
> >>>>
> >>>>
> >>>> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik <jozik at uchicago.edu>
> >>>> wrote:
> >>>>
> >>>>> Woops, sorry about that. It?s running now and the logs are being
> >>>> generated. Once the run is done I?ll send you log files.
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> Jonathan
> >>>>>
> >>>>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan <hategan at mcs.anl.gov>
> >>>> wrote:
> >>>>>
> >>>>>> Right. This isn't your fault. We should, though, probably talk
> >>>> about
> >>>>>> addressing the issue.
> >>>>>>
> >>>>>> Mihael
> >>>>>>
> >>>>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote:
> >>>>>>> Mihael, thanks for spotting that.  I added the comments to
> >>>> highlight the
> >>>>>>> changes in email.
> >>>>>>>
> >>>>>>> -Yadu
> >>>>>>>
> >>>>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote:
> >>>>>>>> Hi Jonathan,
> >>>>>>>>
> >>>>>>>> I suspect that the site config is considering the comment to be
> >>>> part of
> >>>>>>>> the value of the workerLogLevel property. We could confirm this
> >>>> if you
> >>>>>>>> send us the swift log from this particular run.
> >>>>>>>>
> >>>>>>>> To fix it, you could try to remove everything after DEBUG
> >>>> (including all
> >>>>>>>> horizontal white space). In other words:
> >>>>>>>>
> >>>>>>>> ...
> >>>>>>>> workerloglevel=DEBUG
> >>>>>>>> workerlogdirectory=/home/$USER/
> >>>>>>>> ...
> >>>>>>>>
> >>>>>>>> Mihael
> >>>>>>>>
> >>>>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote:
> >>>>>>>>> Hi Yadu,
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I?m getting errors indicating that DEBUG is an invalid worker
> >>>> logging
> >>>>>>>>> level. I?m attaching the stdout below. Let me know if I?m
> >> doing
> >>>>>>>>> something silly.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Jonathan
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji
> >>>> <yadunand at uchicago.edu>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Jonathan,
> >>>>>>>>>>
> >>>>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do
> >> not
> >>>> see
> >>>>>>>>>> anything unusual.
> >>>>>>>>>>
> >>>>>>>>>> From your logs, it looks like workers are failing, so getting
> >>>> worker
> >>>>>>>>>> logs would help.
> >>>>>>>>>> Could you try running on Blues with the following
> >>>> swift.properties
> >>>>>>>>>> and get us the worker*logs that would show up in the
> >>>>>>>>>> workerlogdirectory ?
> >>>>>>>>>>
> >>>>>>>>>> site=blues
> >>>>>>>>>>
> >>>>>>>>>> site.blues {
> >>>>>>>>>>  jobManager=pbs
> >>>>>>>>>>  jobQueue=shared
> >>>>>>>>>>  maxJobs=4
> >>>>>>>>>>  jobGranularity=1
> >>>>>>>>>>  maxNodesPerJob=1
> >>>>>>>>>>  tasksPerWorker=16
> >>>>>>>>>>  taskThrottle=64
> >>>>>>>>>>  initialScore=10000
> >>>>>>>>>>  jobWalltime=00:48:00
> >>>>>>>>>>  taskWalltime=00:40:00
> >>>>>>>>>>  workerloglevel=DEBUG                                  #
> >>>> Adding
> >>>>>>>>>> debug for workers
> >>>>>>>>>>  workerlogdirectory=/home/$USER/                # Logging
> >>>>>>>>>> directory on SFS
> >>>>>>>>>>  workdir=$RUNDIRECTORY
> >>>>>>>>>>  filesystem=local
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Yadu
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Mike,
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Sorry, I figured there was some busy-ness involved!
> >>>>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I
> >>>> didn?t
> >>>>>>>>>>> get the same issue. That is, the model run completed
> >>>> successfully.
> >>>>>>>>>>> For the Blues run, I used a trunk distribution (as of May
> >> 29,
> >>>>>>>>>>> 2014). I?m including one of the log files below. I?m also
> >>>>>>>>>>> including the swift.properties file that was used for the
> >>>> blues
> >>>>>>>>>>> runs below.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Thank you!
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Jonathan
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde <wilde at anl.gov>
> >>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Jonathan,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping!
> >>>>>>>>>>>>
> >>>>>>>>>>>> I or one of the team will answer soon, on swift-user.
> >>>>>>>>>>>>
> >>>>>>>>>>>> (But the first question is: which Swift release, and can
> >> you
> >>>>>>>>>>>> point us to, or send, the full log file?)
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks and regards,
> >>>>>>>>>>>>
> >>>>>>>>>>>> - Mike
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Mike,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I didn?t get a response yet so just wanted to make sure
> >> that
> >>>>>>>>>>>>> the message came across.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Jonathan
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Begin forwarded message:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> From: Jonathan Ozik <jozik at uchicago.edu>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511,
> >>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> To: Mihael Hategan <hategan at mcs.anl.gov>,
> >>>>>>>>>>>>>> "swift-user at ci.uchicago.edu" <swift-user at ci.uchicago.edu>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I?m getting spurious errors in the jobs that I?m running
> >> on
> >>>>>>>>>>>>>> Blues. The stdout includes exceptions like:
> >>>>>>>>>>>>>> exception @ swift-int.k, line: 511
> >>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
> >>>>>>>>>>>>>> java.io.IOException: Broken pipe
> >>>>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>
> >> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> >>>>>>>>>>>>>> at
> >> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> >>>>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>
> >>>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>
> >>>> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>
> >>>> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> These seem to occur at different parts of the submitted
> >>>>>>>>>>>>>> jobs. Let me know if there?s a log file that you?d like
> >> to
> >>>>>>>>>>>>>> look at.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> In earlier attempts I was getting these warnings followed
> >>>> by
> >>>>>>>>>>>>>> broken pipe errors:
> >>>>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO:
> >>>>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152,
> >>>> 0)
> >>>>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot
> >>>>>>>>>>>>>> allocate large pages, falling back to regular pages
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Apparently that?s a known precursor of crashes on Java 7
> >> as
> >>>>>>>>>>>>>> described here
> >>>>>>>>>>>>>>
> >>>>
> >> (
> http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html):
> >>>>>>>>>>>>>> Area: hotspot/gc
> >>>>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Linux, failures when allocating large pages can lead
> >> to
> >>>>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the
> >> issue
> >>>>>>>>>>>>>> can be recognized in two ways:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> ? Before the crash happens one or more lines similar to
> >>>> this
> >>>>>>>>>>>>>> will have been printed to the log:
> >>>>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152,
> >>>> 0)
> >>>>>>>>>>>>>> failed;
> >>>>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot
> >> allocate
> >>>>>>>>>>>>>> large pages, falling back to regular pages
> >>>>>>>>>>>>>> ? If a hs_err file is generated it will contain a line
> >>>>>>>>>>>>>> similar to this:
> >>>>>>>>>>>>>> Large page allocation failures have occurred 3 times
> >>>>>>>>>>>>>> The problem can be avoided by running with large page
> >>>>>>>>>>>>>> support turned off, for example by passing the
> >>>>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> See 8007074 (not public).
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> So I added the -XX:-UseLargePages option in the
> >> invocations
> >>>>>>>>>>>>>> of Java code that I was responsible for. That seemed to
> >> get
> >>>>>>>>>>>>>> rid of the warning and the crashes for a while, but
> >> perhaps
> >>>>>>>>>>>>>> that was just a coincidence.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Jonathan
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>>> Swift-user mailing list
> >>>>>>>>>>>>>> Swift-user at ci.uchicago.edu
> >>>>>>>>>>>>>>
> >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Michael Wilde
> >>>>>>>>>>>> Mathematics and Computer Science          Computation
> >>>> Institute
> >>>>>>>>>>>> Argonne National Laboratory               The University of
> >>>> Chicago
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
> >
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140731/01087b84/attachment.html>

From jozik at uchicago.edu  Thu Jul 31 14:28:53 2014
From: jozik at uchicago.edu (Jonathan Ozik)
Date: Thu, 31 Jul 2014 14:28:53 -0500
Subject: [Swift-user] exception @ swift-int.k, line: 511,
	Caused by: Block task failed: Connection to worker lost
In-Reply-To: <CA+_+Ey-QVO_wsw0V-ViFnb9kiYPs7Va6rhZvrWtebgajtZoNdA@mail.gmail.com>
References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu>
	<E85EF03A-B4EF-4124-A06F-87DFB107207A@uchicago.edu>
	<53D96721.6030004@anl.gov>
	<CC03A2F2-EF31-4D2F-A59B-0D122008AE13@uchicago.edu>
	<53D99769.5050107@uchicago.edu>
	<D3E499FD-094F-4173-8DE2-A9986E956A40@uchicago.edu>
	<1406783097.19120.2.camel@echo> <53D9CFCB.8090803@uchicago.edu>
	<1406783556.19312.1.camel@echo>
	<28AB72AB-2650-40E2-B084-9D1E20D03AEB@uchicago.edu>
	<F3AA70C7-A8A0-4B49-BA04-D03FFF175926@uchicago.edu>
	<1406826555.22289.4.camel@echo>
	<22271280-84FF-4C85-8A1E-5002ED165DAE@uchicago.edu>
	<1406830708.23317.3.camel@echo>
	<CF77F65F-AD2C-407C-89A0-BB05284E400C@uchicago.edu>
	<CA+_+Ey-QVO_wsw0V-ViFnb9kiYPs7Va6rhZvrWtebgajtZoNdA@mail.gmail.com>
Message-ID: <2FE8C351-A0F2-4868-8488-BE66D753B5FD@uchicago.edu>

The tasksPerWorker=8 on sSerial seems to have worked. That?s good because it worked but not so good because it didn?t add any definitive information?
As for the large number of threads question, I believe that each Java application isn?t creating any additional threads or, if so, maybe one additional.
Let me know if you?d like to see the successful log files (and which ones).

Jonathan

On Jul 31, 2014, at 1:55 PM, David Kelly <davidkelly at uchicago.edu> wrote:

> Is each invocation of the Java app creating a large number of threads? I ran into an issue like that (on another cluster) where I was hitting the maximum number of processes per node, and the scheduler ended up killing my jobs.
> 
> 
> On Thu, Jul 31, 2014 at 1:42 PM, Jonathan Ozik <jozik at uchicago.edu> wrote:
> Okay, I?ve launched a new job, with tasksPerWorker=8. This is running on the sSerial queue rather than the shared queue for the other runs. Just wanted to note that for comparison. Each of the java processes that are launched with -Xmx1536m. I believe that Blues advertises each node having access to 64GB (http://www.lcrc.anl.gov/about/Blues), so at least at first glance the memory issue doesn?t seem like it could be an issue.
> 
> Jonathan
> 
> On Jul 31, 2014, at 1:18 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> 
> > Ok, so the workers die while the jobs are running and not much else is
> > happening.
> > My money is on the apps eating up all RAM and the kernel killing the
> > worker.
> >
> > The question is how we check whether this is true or not. Ideas?
> >
> > Yadu, can you do me a favor and package all the PBS output files from
> > this run?
> >
> > Jonathan, can you see if you get the same errors with tasksPerWorker=8?
> >
> > Mihael
> >
> > On Thu, 2014-07-31 at 12:34 -0500, Jonathan Ozik wrote:
> >> Sure thing, it?s attached below.
> >>
> >> Jonathan
> >>
> >>
> >> On Jul 31, 2014, at 12:09 PM, Mihael Hategan <hategan at mcs.anl.gov>
> >> wrote:
> >>
> >>> Hi Jonathan,
> >>>
> >>> I can't see anything obvious in the worker logs, but they are pretty
> >>> large. Can you also post the swift log from this run? It would make
> >> it
> >>> easier to focus on the right time frame.
> >>>
> >>> Mihael
> >>>
> >>> On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote:
> >>>> Hi all,
> >>>>
> >>>> I?m attaching the stdout and the worker logs below.
> >>>>
> >>>> Thanks for looking at these!
> >>>>
> >>>> Jonathan
> >>>>
> >>>>
> >>>> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik <jozik at uchicago.edu>
> >>>> wrote:
> >>>>
> >>>>> Woops, sorry about that. It?s running now and the logs are being
> >>>> generated. Once the run is done I?ll send you log files.
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> Jonathan
> >>>>>
> >>>>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan <hategan at mcs.anl.gov>
> >>>> wrote:
> >>>>>
> >>>>>> Right. This isn't your fault. We should, though, probably talk
> >>>> about
> >>>>>> addressing the issue.
> >>>>>>
> >>>>>> Mihael
> >>>>>>
> >>>>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote:
> >>>>>>> Mihael, thanks for spotting that.  I added the comments to
> >>>> highlight the
> >>>>>>> changes in email.
> >>>>>>>
> >>>>>>> -Yadu
> >>>>>>>
> >>>>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote:
> >>>>>>>> Hi Jonathan,
> >>>>>>>>
> >>>>>>>> I suspect that the site config is considering the comment to be
> >>>> part of
> >>>>>>>> the value of the workerLogLevel property. We could confirm this
> >>>> if you
> >>>>>>>> send us the swift log from this particular run.
> >>>>>>>>
> >>>>>>>> To fix it, you could try to remove everything after DEBUG
> >>>> (including all
> >>>>>>>> horizontal white space). In other words:
> >>>>>>>>
> >>>>>>>> ...
> >>>>>>>> workerloglevel=DEBUG
> >>>>>>>> workerlogdirectory=/home/$USER/
> >>>>>>>> ...
> >>>>>>>>
> >>>>>>>> Mihael
> >>>>>>>>
> >>>>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote:
> >>>>>>>>> Hi Yadu,
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I?m getting errors indicating that DEBUG is an invalid worker
> >>>> logging
> >>>>>>>>> level. I?m attaching the stdout below. Let me know if I?m
> >> doing
> >>>>>>>>> something silly.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Jonathan
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji
> >>>> <yadunand at uchicago.edu>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Jonathan,
> >>>>>>>>>>
> >>>>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do
> >> not
> >>>> see
> >>>>>>>>>> anything unusual.
> >>>>>>>>>>
> >>>>>>>>>> From your logs, it looks like workers are failing, so getting
> >>>> worker
> >>>>>>>>>> logs would help.
> >>>>>>>>>> Could you try running on Blues with the following
> >>>> swift.properties
> >>>>>>>>>> and get us the worker*logs that would show up in the
> >>>>>>>>>> workerlogdirectory ?
> >>>>>>>>>>
> >>>>>>>>>> site=blues
> >>>>>>>>>>
> >>>>>>>>>> site.blues {
> >>>>>>>>>>  jobManager=pbs
> >>>>>>>>>>  jobQueue=shared
> >>>>>>>>>>  maxJobs=4
> >>>>>>>>>>  jobGranularity=1
> >>>>>>>>>>  maxNodesPerJob=1
> >>>>>>>>>>  tasksPerWorker=16
> >>>>>>>>>>  taskThrottle=64
> >>>>>>>>>>  initialScore=10000
> >>>>>>>>>>  jobWalltime=00:48:00
> >>>>>>>>>>  taskWalltime=00:40:00
> >>>>>>>>>>  workerloglevel=DEBUG                                  #
> >>>> Adding
> >>>>>>>>>> debug for workers
> >>>>>>>>>>  workerlogdirectory=/home/$USER/                # Logging
> >>>>>>>>>> directory on SFS
> >>>>>>>>>>  workdir=$RUNDIRECTORY
> >>>>>>>>>>  filesystem=local
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Yadu
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Mike,
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Sorry, I figured there was some busy-ness involved!
> >>>>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I
> >>>> didn?t
> >>>>>>>>>>> get the same issue. That is, the model run completed
> >>>> successfully.
> >>>>>>>>>>> For the Blues run, I used a trunk distribution (as of May
> >> 29,
> >>>>>>>>>>> 2014). I?m including one of the log files below. I?m also
> >>>>>>>>>>> including the swift.properties file that was used for the
> >>>> blues
> >>>>>>>>>>> runs below.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Thank you!
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Jonathan
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde <wilde at anl.gov>
> >>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Jonathan,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping!
> >>>>>>>>>>>>
> >>>>>>>>>>>> I or one of the team will answer soon, on swift-user.
> >>>>>>>>>>>>
> >>>>>>>>>>>> (But the first question is: which Swift release, and can
> >> you
> >>>>>>>>>>>> point us to, or send, the full log file?)
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks and regards,
> >>>>>>>>>>>>
> >>>>>>>>>>>> - Mike
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Mike,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I didn?t get a response yet so just wanted to make sure
> >> that
> >>>>>>>>>>>>> the message came across.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Jonathan
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Begin forwarded message:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> From: Jonathan Ozik <jozik at uchicago.edu>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511,
> >>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> To: Mihael Hategan <hategan at mcs.anl.gov>,
> >>>>>>>>>>>>>> "swift-user at ci.uchicago.edu" <swift-user at ci.uchicago.edu>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I?m getting spurious errors in the jobs that I?m running
> >> on
> >>>>>>>>>>>>>> Blues. The stdout includes exceptions like:
> >>>>>>>>>>>>>> exception @ swift-int.k, line: 511
> >>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
> >>>>>>>>>>>>>> java.io.IOException: Broken pipe
> >>>>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>
> >> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> >>>>>>>>>>>>>> at
> >> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> >>>>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>
> >>>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>
> >>>> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>
> >>>> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> These seem to occur at different parts of the submitted
> >>>>>>>>>>>>>> jobs. Let me know if there?s a log file that you?d like
> >> to
> >>>>>>>>>>>>>> look at.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> In earlier attempts I was getting these warnings followed
> >>>> by
> >>>>>>>>>>>>>> broken pipe errors:
> >>>>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO:
> >>>>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152,
> >>>> 0)
> >>>>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot
> >>>>>>>>>>>>>> allocate large pages, falling back to regular pages
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Apparently that?s a known precursor of crashes on Java 7
> >> as
> >>>>>>>>>>>>>> described here
> >>>>>>>>>>>>>>
> >>>>
> >> (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html):
> >>>>>>>>>>>>>> Area: hotspot/gc
> >>>>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Linux, failures when allocating large pages can lead
> >> to
> >>>>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the
> >> issue
> >>>>>>>>>>>>>> can be recognized in two ways:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> ? Before the crash happens one or more lines similar to
> >>>> this
> >>>>>>>>>>>>>> will have been printed to the log:
> >>>>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152,
> >>>> 0)
> >>>>>>>>>>>>>> failed;
> >>>>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot
> >> allocate
> >>>>>>>>>>>>>> large pages, falling back to regular pages
> >>>>>>>>>>>>>> ? If a hs_err file is generated it will contain a line
> >>>>>>>>>>>>>> similar to this:
> >>>>>>>>>>>>>> Large page allocation failures have occurred 3 times
> >>>>>>>>>>>>>> The problem can be avoided by running with large page
> >>>>>>>>>>>>>> support turned off, for example by passing the
> >>>>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> See 8007074 (not public).
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> So I added the -XX:-UseLargePages option in the
> >> invocations
> >>>>>>>>>>>>>> of Java code that I was responsible for. That seemed to
> >> get
> >>>>>>>>>>>>>> rid of the warning and the crashes for a while, but
> >> perhaps
> >>>>>>>>>>>>>> that was just a coincidence.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Jonathan
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>>> Swift-user mailing list
> >>>>>>>>>>>>>> Swift-user at ci.uchicago.edu
> >>>>>>>>>>>>>>
> >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Michael Wilde
> >>>>>>>>>>>> Mathematics and Computer Science          Computation
> >>>> Institute
> >>>>>>>>>>>> Argonne National Laboratory               The University of
> >>>> Chicago
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
> >
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140731/ab4e8dac/attachment.html>

From wilde at anl.gov  Thu Jul 31 15:18:53 2014
From: wilde at anl.gov (Michael Wilde)
Date: Thu, 31 Jul 2014 15:18:53 -0500
Subject: [Swift-user] exception @ swift-int.k, line: 511,
 Caused by: Block task failed: Connection to worker lost
In-Reply-To: <CA+_+Ey-QVO_wsw0V-ViFnb9kiYPs7Va6rhZvrWtebgajtZoNdA@mail.gmail.com>
References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu>	<E85EF03A-B4EF-4124-A06F-87DFB107207A@uchicago.edu>	<53D96721.6030004@anl.gov>	<CC03A2F2-EF31-4D2F-A59B-0D122008AE13@uchicago.edu>	<53D99769.5050107@uchicago.edu>	<D3E499FD-094F-4173-8DE2-A9986E956A40@uchicago.edu>	<1406783097.19120.2.camel@echo>
	<53D9CFCB.8090803@uchicago.edu>	<1406783556.19312.1.camel@echo>	<28AB72AB-2650-40E2-B084-9D1E20D03AEB@uchicago.edu>	<F3AA70C7-A8A0-4B49-BA04-D03FFF175926@uchicago.edu>	<1406826555.22289.4.camel@echo>	<22271280-84FF-4C85-8A1E-5002ED165DAE@uchicago.edu>	<1406830708.23317.3.camel@echo>	<CF77F65F-AD2C-407C-89A0-BB05284E400C@uchicago.edu>
	<CA+_+Ey-QVO_wsw0V-ViFnb9kiYPs7Va6rhZvrWtebgajtZoNdA@mail.gmail.com>
Message-ID: <53DAA4AD.3010602@anl.gov>

Its odd that no errors like OOM or OOT would be logged to stdout of the 
PBS job.

Jonathan, can you check with the Blues Sysadmins if they have any other 
error logs (e.g. dmesg/syslogs) for your jobs, or for the nodes on which 
they ran?

Thanks,

- Mike

On 7/31/14, 1:55 PM, David Kelly wrote:
> Is each invocation of the Java app creating a large number of threads? 
> I ran into an issue like that (on another cluster) where I was hitting 
> the maximum number of processes per node, and the scheduler ended up 
> killing my jobs.
>
>
> On Thu, Jul 31, 2014 at 1:42 PM, Jonathan Ozik <jozik at uchicago.edu 
> <mailto:jozik at uchicago.edu>> wrote:
>
>     Okay, I've launched a new job, with tasksPerWorker=8. This is
>     running on the sSerial queue rather than the shared queue for the
>     other runs. Just wanted to note that for comparison. Each of the
>     java processes that are launched with -Xmx1536m. I believe that
>     Blues advertises each node having access to 64GB
>     (http://www.lcrc.anl.gov/about/Blues), so at least at first glance
>     the memory issue doesn't seem like it could be an issue.
>
>     Jonathan
>
>     On Jul 31, 2014, at 1:18 PM, Mihael Hategan <hategan at mcs.anl.gov
>     <mailto:hategan at mcs.anl.gov>> wrote:
>
>     > Ok, so the workers die while the jobs are running and not much
>     else is
>     > happening.
>     > My money is on the apps eating up all RAM and the kernel killing the
>     > worker.
>     >
>     > The question is how we check whether this is true or not. Ideas?
>     >
>     > Yadu, can you do me a favor and package all the PBS output files
>     from
>     > this run?
>     >
>     > Jonathan, can you see if you get the same errors with
>     tasksPerWorker=8?
>     >
>     > Mihael
>     >
>     > On Thu, 2014-07-31 at 12:34 -0500, Jonathan Ozik wrote:
>     >> Sure thing, it's attached below.
>     >>
>     >> Jonathan
>     >>
>     >>
>     >> On Jul 31, 2014, at 12:09 PM, Mihael Hategan
>     <hategan at mcs.anl.gov <mailto:hategan at mcs.anl.gov>>
>     >> wrote:
>     >>
>     >>> Hi Jonathan,
>     >>>
>     >>> I can't see anything obvious in the worker logs, but they are
>     pretty
>     >>> large. Can you also post the swift log from this run? It would
>     make
>     >> it
>     >>> easier to focus on the right time frame.
>     >>>
>     >>> Mihael
>     >>>
>     >>> On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote:
>     >>>> Hi all,
>     >>>>
>     >>>> I'm attaching the stdout and the worker logs below.
>     >>>>
>     >>>> Thanks for looking at these!
>     >>>>
>     >>>> Jonathan
>     >>>>
>     >>>>
>     >>>> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik
>     <jozik at uchicago.edu <mailto:jozik at uchicago.edu>>
>     >>>> wrote:
>     >>>>
>     >>>>> Woops, sorry about that. It's running now and the logs are being
>     >>>> generated. Once the run is done I'll send you log files.
>     >>>>>
>     >>>>> Thanks!
>     >>>>>
>     >>>>> Jonathan
>     >>>>>
>     >>>>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan
>     <hategan at mcs.anl.gov <mailto:hategan at mcs.anl.gov>>
>     >>>> wrote:
>     >>>>>
>     >>>>>> Right. This isn't your fault. We should, though, probably talk
>     >>>> about
>     >>>>>> addressing the issue.
>     >>>>>>
>     >>>>>> Mihael
>     >>>>>>
>     >>>>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote:
>     >>>>>>> Mihael, thanks for spotting that.  I added the comments to
>     >>>> highlight the
>     >>>>>>> changes in email.
>     >>>>>>>
>     >>>>>>> -Yadu
>     >>>>>>>
>     >>>>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote:
>     >>>>>>>> Hi Jonathan,
>     >>>>>>>>
>     >>>>>>>> I suspect that the site config is considering the comment
>     to be
>     >>>> part of
>     >>>>>>>> the value of the workerLogLevel property. We could
>     confirm this
>     >>>> if you
>     >>>>>>>> send us the swift log from this particular run.
>     >>>>>>>>
>     >>>>>>>> To fix it, you could try to remove everything after DEBUG
>     >>>> (including all
>     >>>>>>>> horizontal white space). In other words:
>     >>>>>>>>
>     >>>>>>>> ...
>     >>>>>>>> workerloglevel=DEBUG
>     >>>>>>>> workerlogdirectory=/home/$USER/
>     >>>>>>>> ...
>     >>>>>>>>
>     >>>>>>>> Mihael
>     >>>>>>>>
>     >>>>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote:
>     >>>>>>>>> Hi Yadu,
>     >>>>>>>>>
>     >>>>>>>>>
>     >>>>>>>>> I'm getting errors indicating that DEBUG is an invalid
>     worker
>     >>>> logging
>     >>>>>>>>> level. I'm attaching the stdout below. Let me know if I'm
>     >> doing
>     >>>>>>>>> something silly.
>     >>>>>>>>>
>     >>>>>>>>>
>     >>>>>>>>> Jonathan
>     >>>>>>>>>
>     >>>>>>>>>
>     >>>>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji
>     >>>> <yadunand at uchicago.edu <mailto:yadunand at uchicago.edu>>
>     >>>>>>>>> wrote:
>     >>>>>>>>>
>     >>>>>>>>>> Hi Jonathan,
>     >>>>>>>>>>
>     >>>>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do
>     >> not
>     >>>> see
>     >>>>>>>>>> anything unusual.
>     >>>>>>>>>>
>     >>>>>>>>>> From your logs, it looks like workers are failing, so
>     getting
>     >>>> worker
>     >>>>>>>>>> logs would help.
>     >>>>>>>>>> Could you try running on Blues with the following
>     >>>> swift.properties
>     >>>>>>>>>> and get us the worker*logs that would show up in the
>     >>>>>>>>>> workerlogdirectory ?
>     >>>>>>>>>>
>     >>>>>>>>>> site=blues
>     >>>>>>>>>>
>     >>>>>>>>>> site.blues {
>     >>>>>>>>>>  jobManager=pbs
>     >>>>>>>>>>  jobQueue=shared
>     >>>>>>>>>>  maxJobs=4
>     >>>>>>>>>>  jobGranularity=1
>     >>>>>>>>>>  maxNodesPerJob=1
>     >>>>>>>>>>  tasksPerWorker=16
>     >>>>>>>>>>  taskThrottle=64
>     >>>>>>>>>>  initialScore=10000
>     >>>>>>>>>>  jobWalltime=00:48:00
>     >>>>>>>>>>  taskWalltime=00:40:00
>     >>>>>>>>>>  workerloglevel=DEBUG                                  #
>     >>>> Adding
>     >>>>>>>>>> debug for workers
>     >>>>>>>>>>  workerlogdirectory=/home/$USER/                # Logging
>     >>>>>>>>>> directory on SFS
>     >>>>>>>>>>  workdir=$RUNDIRECTORY
>     >>>>>>>>>>  filesystem=local
>     >>>>>>>>>> }
>     >>>>>>>>>>
>     >>>>>>>>>> Thanks,
>     >>>>>>>>>> Yadu
>     >>>>>>>>>>
>     >>>>>>>>>>
>     >>>>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote:
>     >>>>>>>>>>
>     >>>>>>>>>>> Hi Mike,
>     >>>>>>>>>>>
>     >>>>>>>>>>>
>     >>>>>>>>>>> Sorry, I figured there was some busy-ness involved!
>     >>>>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I
>     >>>> didn't
>     >>>>>>>>>>> get the same issue. That is, the model run completed
>     >>>> successfully.
>     >>>>>>>>>>> For the Blues run, I used a trunk distribution (as of May
>     >> 29,
>     >>>>>>>>>>> 2014). I'm including one of the log files below. I'm also
>     >>>>>>>>>>> including the swift.properties file that was used for the
>     >>>> blues
>     >>>>>>>>>>> runs below.
>     >>>>>>>>>>>
>     >>>>>>>>>>>
>     >>>>>>>>>>> Thank you!
>     >>>>>>>>>>>
>     >>>>>>>>>>>
>     >>>>>>>>>>> Jonathan
>     >>>>>>>>>>>
>     >>>>>>>>>>>
>     >>>>>>>>>>>
>     >>>>>>>>>>>
>     >>>>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde
>     <wilde at anl.gov <mailto:wilde at anl.gov>>
>     >>>> wrote:
>     >>>>>>>>>>>
>     >>>>>>>>>>>> Hi Jonathan,
>     >>>>>>>>>>>>
>     >>>>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the
>     ping!
>     >>>>>>>>>>>>
>     >>>>>>>>>>>> I or one of the team will answer soon, on swift-user.
>     >>>>>>>>>>>>
>     >>>>>>>>>>>> (But the first question is: which Swift release, and can
>     >> you
>     >>>>>>>>>>>> point us to, or send, the full log file?)
>     >>>>>>>>>>>>
>     >>>>>>>>>>>> Thanks and regards,
>     >>>>>>>>>>>>
>     >>>>>>>>>>>> - Mike
>     >>>>>>>>>>>>
>     >>>>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote:
>     >>>>>>>>>>>>
>     >>>>>>>>>>>>> Hi Mike,
>     >>>>>>>>>>>>>
>     >>>>>>>>>>>>>
>     >>>>>>>>>>>>> I didn't get a response yet so just wanted to make sure
>     >> that
>     >>>>>>>>>>>>> the message came across.
>     >>>>>>>>>>>>>
>     >>>>>>>>>>>>>
>     >>>>>>>>>>>>> Jonathan
>     >>>>>>>>>>>>>
>     >>>>>>>>>>>>> Begin forwarded message:
>     >>>>>>>>>>>>>
>     >>>>>>>>>>>>>> From: Jonathan Ozik <jozik at uchicago.edu
>     <mailto:jozik at uchicago.edu>>
>     >>>>>>>>>>>>>>
>     >>>>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k,
>     line: 511,
>     >>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
>     >>>>>>>>>>>>>>
>     >>>>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT
>     >>>>>>>>>>>>>>
>     >>>>>>>>>>>>>> To: Mihael Hategan <hategan at mcs.anl.gov
>     <mailto:hategan at mcs.anl.gov>>,
>     >>>>>>>>>>>>>> "swift-user at ci.uchicago.edu
>     <mailto:swift-user at ci.uchicago.edu>" <swift-user at ci.uchicago.edu
>     <mailto:swift-user at ci.uchicago.edu>>
>     >>>>>>>>>>>>>>
>     >>>>>>>>>>>>>>
>     >>>>>>>>>>>>>> Hi all,
>     >>>>>>>>>>>>>>
>     >>>>>>>>>>>>>> I'm getting spurious errors in the jobs that I'm
>     running
>     >> on
>     >>>>>>>>>>>>>> Blues. The stdout includes exceptions like:
>     >>>>>>>>>>>>>> exception @ swift-int.k, line: 511
>     >>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
>     >>>>>>>>>>>>>> java.io.IOException: Broken pipe
>     >>>>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>     >>>>>>>>>>>>>> at
>     >>>>>>>>>>>>>>
>     >> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>     >>>>>>>>>>>>>> at
>     >> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>     >>>>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>     >>>>>>>>>>>>>> at
>     >>>>>>>>>>>>>>
>     >>>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
>     >>>>>>>>>>>>>> at
>     >>>>>>>>>>>>>>
>     >>>>
>     org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)
>     >>>>>>>>>>>>>> at
>     >>>>>>>>>>>>>>
>     >>>> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)
>     >>>>>>>>>>>>>>
>     >>>>>>>>>>>>>> These seem to occur at different parts of the submitted
>     >>>>>>>>>>>>>> jobs. Let me know if there's a log file that you'd like
>     >> to
>     >>>>>>>>>>>>>> look at.
>     >>>>>>>>>>>>>>
>     >>>>>>>>>>>>>> In earlier attempts I was getting these warnings
>     followed
>     >>>> by
>     >>>>>>>>>>>>>> broken pipe errors:
>     >>>>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO:
>     >>>>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072,
>     2097152,
>     >>>> 0)
>     >>>>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12);
>     Cannot
>     >>>>>>>>>>>>>> allocate large pages, falling back to regular pages
>     >>>>>>>>>>>>>>
>     >>>>>>>>>>>>>> Apparently that's a known precursor of crashes on
>     Java 7
>     >> as
>     >>>>>>>>>>>>>> described here
>     >>>>>>>>>>>>>>
>     >>>>
>     >>
>     (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html):
>     >>>>>>>>>>>>>> Area: hotspot/gc
>     >>>>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large
>     pages.
>     >>>>>>>>>>>>>>
>     >>>>>>>>>>>>>> On Linux, failures when allocating large pages can lead
>     >> to
>     >>>>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the
>     >> issue
>     >>>>>>>>>>>>>> can be recognized in two ways:
>     >>>>>>>>>>>>>>
>     >>>>>>>>>>>>>> . Before the crash happens one or more lines similar to
>     >>>> this
>     >>>>>>>>>>>>>> will have been printed to the log:
>     >>>>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536,
>     2097152,
>     >>>> 0)
>     >>>>>>>>>>>>>> failed;
>     >>>>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot
>     >> allocate
>     >>>>>>>>>>>>>> large pages, falling back to regular pages
>     >>>>>>>>>>>>>> . If a hs_err file is generated it will contain a line
>     >>>>>>>>>>>>>> similar to this:
>     >>>>>>>>>>>>>> Large page allocation failures have occurred 3 times
>     >>>>>>>>>>>>>> The problem can be avoided by running with large page
>     >>>>>>>>>>>>>> support turned off, for example by passing the
>     >>>>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary.
>     >>>>>>>>>>>>>>
>     >>>>>>>>>>>>>> See 8007074 (not public).
>     >>>>>>>>>>>>>>
>     >>>>>>>>>>>>>> So I added the -XX:-UseLargePages option in the
>     >> invocations
>     >>>>>>>>>>>>>> of Java code that I was responsible for. That seemed to
>     >> get
>     >>>>>>>>>>>>>> rid of the warning and the crashes for a while, but
>     >> perhaps
>     >>>>>>>>>>>>>> that was just a coincidence.
>     >>>>>>>>>>>>>>
>     >>>>>>>>>>>>>> Jonathan
>     >>>>>>>>>>>>>>
>     >>>>>>>>>>>>>> _______________________________________________
>     >>>>>>>>>>>>>> Swift-user mailing list
>     >>>>>>>>>>>>>> Swift-user at ci.uchicago.edu
>     <mailto:Swift-user at ci.uchicago.edu>
>     >>>>>>>>>>>>>>
>     >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>     >>>>>>>>>>>>>>
>     >>>>>>>>>>>>>
>     >>>>>>>>>>>> --
>     >>>>>>>>>>>> Michael Wilde
>     >>>>>>>>>>>> Mathematics and Computer Science          Computation
>     >>>> Institute
>     >>>>>>>>>>>> Argonne National Laboratory               The
>     University of
>     >>>> Chicago
>     >>>>>>>>>>>
>     >>>>>>>>>>
>     >>>>>>>>>
>     >>>>>>>>
>     >>>>>>>
>     >>>>>>
>     >>>>>>
>     >>>>>
>     >>>>
>     >>>>
>     >>>
>     >>>
>     >>
>     >>
>     >
>     >
>
>     _______________________________________________
>     Swift-user mailing list
>     Swift-user at ci.uchicago.edu <mailto:Swift-user at ci.uchicago.edu>
>     https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
>
>
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user

-- 
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory               The University of Chicago

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140731/26f7a71c/attachment.html>

From jozik at uchicago.edu  Thu Jul 31 16:42:27 2014
From: jozik at uchicago.edu (Jonathan Ozik)
Date: Thu, 31 Jul 2014 16:42:27 -0500
Subject: [Swift-user] exception @ swift-int.k, line: 511,
	Caused by: Block task failed: Connection to worker lost
In-Reply-To: <53DAA4AD.3010602@anl.gov>
References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu>	<E85EF03A-B4EF-4124-A06F-87DFB107207A@uchicago.edu>	<53D96721.6030004@anl.gov>	<CC03A2F2-EF31-4D2F-A59B-0D122008AE13@uchicago.edu>	<53D99769.5050107@uchicago.edu>	<D3E499FD-094F-4173-8DE2-A9986E956A40@uchicago.edu>	<1406783097.19120.2.camel@echo>
	<53D9CFCB.8090803@uchicago.edu>	<1406783556.19312.1.camel@echo>	<28AB72AB-2650-40E2-B084-9D1E20D03AEB@uchicago.edu>	<F3AA70C7-A8A0-4B49-BA04-D03FFF175926@uchicago.edu>	<1406826555.22289.4.camel@echo>	<22271280-84FF-4C85-8A1E-5002ED165DAE@uchicago.edu>	<1406830708.23317.3.camel@echo>	<CF77F65F-AD2C-407C-89A0-BB05284E400C@uchicago.edu>
	<CA+_+Ey-QVO_wsw0V-ViFnb9kiYPs7Va6rhZvrWtebgajtZoNdA@mail.gmail.com>
	<53DAA4AD.3010602@anl.gov>
Message-ID: <EE8DD009-80CC-4822-AAC0-DEC6CC34AB8A@uchicago.edu>

Mike, all,

I?ve done a few more experiments and realized that while I had added the -XX:-UseLargePages command line argument to the main Java invocation, each app also included a post process Java app which I hadn?t added the UseLargePages argument to. Perhaps because this was running at the end of an app, the warnings that I?d seen previously prior to the crashes weren?t making it to stdout before the crashes. Purely speculation.
In any case, I did put the tasksPerWorker back to 16 and was able to successfully run all the jobs. I guess one bit of consolation in all this is that if there are any future ?mystery? worker failings, this will be one issue to check.

Thank you all for helping with tracking this down.

Jonathan

On Jul 31, 2014, at 3:18 PM, Michael Wilde <wilde at anl.gov> wrote:

> Its odd that no errors like OOM or OOT would be logged to stdout of the PBS job.
> 
> Jonathan, can you check with the Blues Sysadmins if they have any other error logs (e.g. dmesg/syslogs) for your jobs, or for the nodes on which they ran?  
> 
> Thanks,
> 
> - Mike
> 
> On 7/31/14, 1:55 PM, David Kelly wrote:
>> Is each invocation of the Java app creating a large number of threads? I ran into an issue like that (on another cluster) where I was hitting the maximum number of processes per node, and the scheduler ended up killing my jobs.
>> 
>> 
>> On Thu, Jul 31, 2014 at 1:42 PM, Jonathan Ozik <jozik at uchicago.edu> wrote:
>> Okay, I?ve launched a new job, with tasksPerWorker=8. This is running on the sSerial queue rather than the shared queue for the other runs. Just wanted to note that for comparison. Each of the java processes that are launched with -Xmx1536m. I believe that Blues advertises each node having access to 64GB (http://www.lcrc.anl.gov/about/Blues), so at least at first glance the memory issue doesn?t seem like it could be an issue.
>> 
>> Jonathan
>> 
>> On Jul 31, 2014, at 1:18 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
>> 
>> > Ok, so the workers die while the jobs are running and not much else is
>> > happening.
>> > My money is on the apps eating up all RAM and the kernel killing the
>> > worker.
>> >
>> > The question is how we check whether this is true or not. Ideas?
>> >
>> > Yadu, can you do me a favor and package all the PBS output files from
>> > this run?
>> >
>> > Jonathan, can you see if you get the same errors with tasksPerWorker=8?
>> >
>> > Mihael
>> >
>> > On Thu, 2014-07-31 at 12:34 -0500, Jonathan Ozik wrote:
>> >> Sure thing, it?s attached below.
>> >>
>> >> Jonathan
>> >>
>> >>
>> >> On Jul 31, 2014, at 12:09 PM, Mihael Hategan <hategan at mcs.anl.gov>
>> >> wrote:
>> >>
>> >>> Hi Jonathan,
>> >>>
>> >>> I can't see anything obvious in the worker logs, but they are pretty
>> >>> large. Can you also post the swift log from this run? It would make
>> >> it
>> >>> easier to focus on the right time frame.
>> >>>
>> >>> Mihael
>> >>>
>> >>> On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote:
>> >>>> Hi all,
>> >>>>
>> >>>> I?m attaching the stdout and the worker logs below.
>> >>>>
>> >>>> Thanks for looking at these!
>> >>>>
>> >>>> Jonathan
>> >>>>
>> >>>>
>> >>>> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik <jozik at uchicago.edu>
>> >>>> wrote:
>> >>>>
>> >>>>> Woops, sorry about that. It?s running now and the logs are being
>> >>>> generated. Once the run is done I?ll send you log files.
>> >>>>>
>> >>>>> Thanks!
>> >>>>>
>> >>>>> Jonathan
>> >>>>>
>> >>>>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan <hategan at mcs.anl.gov>
>> >>>> wrote:
>> >>>>>
>> >>>>>> Right. This isn't your fault. We should, though, probably talk
>> >>>> about
>> >>>>>> addressing the issue.
>> >>>>>>
>> >>>>>> Mihael
>> >>>>>>
>> >>>>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote:
>> >>>>>>> Mihael, thanks for spotting that.  I added the comments to
>> >>>> highlight the
>> >>>>>>> changes in email.
>> >>>>>>>
>> >>>>>>> -Yadu
>> >>>>>>>
>> >>>>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote:
>> >>>>>>>> Hi Jonathan,
>> >>>>>>>>
>> >>>>>>>> I suspect that the site config is considering the comment to be
>> >>>> part of
>> >>>>>>>> the value of the workerLogLevel property. We could confirm this
>> >>>> if you
>> >>>>>>>> send us the swift log from this particular run.
>> >>>>>>>>
>> >>>>>>>> To fix it, you could try to remove everything after DEBUG
>> >>>> (including all
>> >>>>>>>> horizontal white space). In other words:
>> >>>>>>>>
>> >>>>>>>> ...
>> >>>>>>>> workerloglevel=DEBUG
>> >>>>>>>> workerlogdirectory=/home/$USER/
>> >>>>>>>> ...
>> >>>>>>>>
>> >>>>>>>> Mihael
>> >>>>>>>>
>> >>>>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote:
>> >>>>>>>>> Hi Yadu,
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> I?m getting errors indicating that DEBUG is an invalid worker
>> >>>> logging
>> >>>>>>>>> level. I?m attaching the stdout below. Let me know if I?m
>> >> doing
>> >>>>>>>>> something silly.
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> Jonathan
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji
>> >>>> <yadunand at uchicago.edu>
>> >>>>>>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> Hi Jonathan,
>> >>>>>>>>>>
>> >>>>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do
>> >> not
>> >>>> see
>> >>>>>>>>>> anything unusual.
>> >>>>>>>>>>
>> >>>>>>>>>> From your logs, it looks like workers are failing, so getting
>> >>>> worker
>> >>>>>>>>>> logs would help.
>> >>>>>>>>>> Could you try running on Blues with the following
>> >>>> swift.properties
>> >>>>>>>>>> and get us the worker*logs that would show up in the
>> >>>>>>>>>> workerlogdirectory ?
>> >>>>>>>>>>
>> >>>>>>>>>> site=blues
>> >>>>>>>>>>
>> >>>>>>>>>> site.blues {
>> >>>>>>>>>>  jobManager=pbs
>> >>>>>>>>>>  jobQueue=shared
>> >>>>>>>>>>  maxJobs=4
>> >>>>>>>>>>  jobGranularity=1
>> >>>>>>>>>>  maxNodesPerJob=1
>> >>>>>>>>>>  tasksPerWorker=16
>> >>>>>>>>>>  taskThrottle=64
>> >>>>>>>>>>  initialScore=10000
>> >>>>>>>>>>  jobWalltime=00:48:00
>> >>>>>>>>>>  taskWalltime=00:40:00
>> >>>>>>>>>>  workerloglevel=DEBUG                                  #
>> >>>> Adding
>> >>>>>>>>>> debug for workers
>> >>>>>>>>>>  workerlogdirectory=/home/$USER/                # Logging
>> >>>>>>>>>> directory on SFS
>> >>>>>>>>>>  workdir=$RUNDIRECTORY
>> >>>>>>>>>>  filesystem=local
>> >>>>>>>>>> }
>> >>>>>>>>>>
>> >>>>>>>>>> Thanks,
>> >>>>>>>>>> Yadu
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>> Hi Mike,
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Sorry, I figured there was some busy-ness involved!
>> >>>>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I
>> >>>> didn?t
>> >>>>>>>>>>> get the same issue. That is, the model run completed
>> >>>> successfully.
>> >>>>>>>>>>> For the Blues run, I used a trunk distribution (as of May
>> >> 29,
>> >>>>>>>>>>> 2014). I?m including one of the log files below. I?m also
>> >>>>>>>>>>> including the swift.properties file that was used for the
>> >>>> blues
>> >>>>>>>>>>> runs below.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thank you!
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Jonathan
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde <wilde at anl.gov>
>> >>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> Hi Jonathan,
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping!
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> I or one of the team will answer soon, on swift-user.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> (But the first question is: which Swift release, and can
>> >> you
>> >>>>>>>>>>>> point us to, or send, the full log file?)
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Thanks and regards,
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> - Mike
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> Hi Mike,
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> I didn?t get a response yet so just wanted to make sure
>> >> that
>> >>>>>>>>>>>>> the message came across.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Jonathan
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Begin forwarded message:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> From: Jonathan Ozik <jozik at uchicago.edu>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511,
>> >>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> To: Mihael Hategan <hategan at mcs.anl.gov>,
>> >>>>>>>>>>>>>> "swift-user at ci.uchicago.edu" <swift-user at ci.uchicago.edu>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Hi all,
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> I?m getting spurious errors in the jobs that I?m running
>> >> on
>> >>>>>>>>>>>>>> Blues. The stdout includes exceptions like:
>> >>>>>>>>>>>>>> exception @ swift-int.k, line: 511
>> >>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
>> >>>>>>>>>>>>>> java.io.IOException: Broken pipe
>> >>>>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>> >>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>
>> >> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>> >>>>>>>>>>>>>> at
>> >> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>> >>>>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>> >>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>
>> >>>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
>> >>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>
>> >>>> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)
>> >>>>>>>>>>>>>> at
>> >>>>>>>>>>>>>>
>> >>>> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> These seem to occur at different parts of the submitted
>> >>>>>>>>>>>>>> jobs. Let me know if there?s a log file that you?d like
>> >> to
>> >>>>>>>>>>>>>> look at.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> In earlier attempts I was getting these warnings followed
>> >>>> by
>> >>>>>>>>>>>>>> broken pipe errors:
>> >>>>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO:
>> >>>>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152,
>> >>>> 0)
>> >>>>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot
>> >>>>>>>>>>>>>> allocate large pages, falling back to regular pages
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Apparently that?s a known precursor of crashes on Java 7
>> >> as
>> >>>>>>>>>>>>>> described here
>> >>>>>>>>>>>>>>
>> >>>>
>> >> (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html):
>> >>>>>>>>>>>>>> Area: hotspot/gc
>> >>>>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> On Linux, failures when allocating large pages can lead
>> >> to
>> >>>>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the
>> >> issue
>> >>>>>>>>>>>>>> can be recognized in two ways:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> ? Before the crash happens one or more lines similar to
>> >>>> this
>> >>>>>>>>>>>>>> will have been printed to the log:
>> >>>>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152,
>> >>>> 0)
>> >>>>>>>>>>>>>> failed;
>> >>>>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot
>> >> allocate
>> >>>>>>>>>>>>>> large pages, falling back to regular pages
>> >>>>>>>>>>>>>> ? If a hs_err file is generated it will contain a line
>> >>>>>>>>>>>>>> similar to this:
>> >>>>>>>>>>>>>> Large page allocation failures have occurred 3 times
>> >>>>>>>>>>>>>> The problem can be avoided by running with large page
>> >>>>>>>>>>>>>> support turned off, for example by passing the
>> >>>>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> See 8007074 (not public).
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> So I added the -XX:-UseLargePages option in the
>> >> invocations
>> >>>>>>>>>>>>>> of Java code that I was responsible for. That seemed to
>> >> get
>> >>>>>>>>>>>>>> rid of the warning and the crashes for a while, but
>> >> perhaps
>> >>>>>>>>>>>>>> that was just a coincidence.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Jonathan
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> _______________________________________________
>> >>>>>>>>>>>>>> Swift-user mailing list
>> >>>>>>>>>>>>>> Swift-user at ci.uchicago.edu
>> >>>>>>>>>>>>>>
>> >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>> --
>> >>>>>>>>>>>> Michael Wilde
>> >>>>>>>>>>>> Mathematics and Computer Science          Computation
>> >>>> Institute
>> >>>>>>>>>>>> Argonne National Laboratory               The University of
>> >>>> Chicago
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >>>
>> >>>
>> >>
>> >>
>> >
>> >
>> 
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>> 
>> 
>> 
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> 
> -- 
> Michael Wilde
> Mathematics and Computer Science          Computation Institute
> Argonne National Laboratory               The University of Chicago
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140731/980dbf92/attachment.html>

From hategan at mcs.anl.gov  Thu Jul 31 17:11:44 2014
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 31 Jul 2014 15:11:44 -0700
Subject: [Swift-user] exception @ swift-int.k, line: 511,
 Caused by: Block task failed: Connection to worker lost
In-Reply-To: <EE8DD009-80CC-4822-AAC0-DEC6CC34AB8A@uchicago.edu>
References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu>
	<E85EF03A-B4EF-4124-A06F-87DFB107207A@uchicago.edu>
	<53D96721.6030004@anl.gov>
	<CC03A2F2-EF31-4D2F-A59B-0D122008AE13@uchicago.edu>
	<53D99769.5050107@uchicago.edu>
	<D3E499FD-094F-4173-8DE2-A9986E956A40@uchicago.edu>
	<1406783097.19120.2.camel@echo> <53D9CFCB.8090803@uchicago.edu>
	<1406783556.19312.1.camel@echo>
	<28AB72AB-2650-40E2-B084-9D1E20D03AEB@uchicago.edu>
	<F3AA70C7-A8A0-4B49-BA04-D03FFF175926@uchicago.edu>
	<1406826555.22289.4.camel@echo>
	<22271280-84FF-4C85-8A1E-5002ED165DAE@uchicago.edu>
	<1406830708.23317.3.camel@echo>
	<CF77F65F-AD2C-407C-89A0-BB05284E400C@uchicago.edu>
	<CA+_+Ey-QVO_wsw0V-ViFnb9kiYPs7Va6rhZvrWtebgajtZoNdA@mail.gmail.com>
	<53DAA4AD.3010602@anl.gov>
	<EE8DD009-80CC-4822-AAC0-DEC6CC34AB8A@uchicago.edu>
Message-ID: <1406844704.25609.4.camel@echo>

The worker is still being killed though, and there is little feedback to
the user. If possible, we should fix that.

Do you have an idea what is happening? Is it a process similar to an OOM
where the kernel just decides to kill some process?

Mihael

On Thu, 2014-07-31 at 16:42 -0500, Jonathan Ozik wrote:
> Mike, all,
> 
> I?ve done a few more experiments and realized that while I had added the -XX:-UseLargePages command line argument to the main Java invocation, each app also included a post process Java app which I hadn?t added the UseLargePages argument to. Perhaps because this was running at the end of an app, the warnings that I?d seen previously prior to the crashes weren?t making it to stdout before the crashes. Purely speculation.
> In any case, I did put the tasksPerWorker back to 16 and was able to successfully run all the jobs. I guess one bit of consolation in all this is that if there are any future ?mystery? worker failings, this will be one issue to check.
> 
> Thank you all for helping with tracking this down.
> 
> Jonathan
> 
> On Jul 31, 2014, at 3:18 PM, Michael Wilde <wilde at anl.gov> wrote:
> 
> > Its odd that no errors like OOM or OOT would be logged to stdout of the PBS job.
> > 
> > Jonathan, can you check with the Blues Sysadmins if they have any other error logs (e.g. dmesg/syslogs) for your jobs, or for the nodes on which they ran?  
> > 
> > Thanks,
> > 
> > - Mike
> > 
> > On 7/31/14, 1:55 PM, David Kelly wrote:
> >> Is each invocation of the Java app creating a large number of threads? I ran into an issue like that (on another cluster) where I was hitting the maximum number of processes per node, and the scheduler ended up killing my jobs.
> >> 
> >> 
> >> On Thu, Jul 31, 2014 at 1:42 PM, Jonathan Ozik <jozik at uchicago.edu> wrote:
> >> Okay, I?ve launched a new job, with tasksPerWorker=8. This is running on the sSerial queue rather than the shared queue for the other runs. Just wanted to note that for comparison. Each of the java processes that are launched with -Xmx1536m. I believe that Blues advertises each node having access to 64GB (http://www.lcrc.anl.gov/about/Blues), so at least at first glance the memory issue doesn?t seem like it could be an issue.
> >> 
> >> Jonathan
> >> 
> >> On Jul 31, 2014, at 1:18 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> >> 
> >> > Ok, so the workers die while the jobs are running and not much else is
> >> > happening.
> >> > My money is on the apps eating up all RAM and the kernel killing the
> >> > worker.
> >> >
> >> > The question is how we check whether this is true or not. Ideas?
> >> >
> >> > Yadu, can you do me a favor and package all the PBS output files from
> >> > this run?
> >> >
> >> > Jonathan, can you see if you get the same errors with tasksPerWorker=8?
> >> >
> >> > Mihael
> >> >
> >> > On Thu, 2014-07-31 at 12:34 -0500, Jonathan Ozik wrote:
> >> >> Sure thing, it?s attached below.
> >> >>
> >> >> Jonathan
> >> >>
> >> >>
> >> >> On Jul 31, 2014, at 12:09 PM, Mihael Hategan <hategan at mcs.anl.gov>
> >> >> wrote:
> >> >>
> >> >>> Hi Jonathan,
> >> >>>
> >> >>> I can't see anything obvious in the worker logs, but they are pretty
> >> >>> large. Can you also post the swift log from this run? It would make
> >> >> it
> >> >>> easier to focus on the right time frame.
> >> >>>
> >> >>> Mihael
> >> >>>
> >> >>> On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote:
> >> >>>> Hi all,
> >> >>>>
> >> >>>> I?m attaching the stdout and the worker logs below.
> >> >>>>
> >> >>>> Thanks for looking at these!
> >> >>>>
> >> >>>> Jonathan
> >> >>>>
> >> >>>>
> >> >>>> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik <jozik at uchicago.edu>
> >> >>>> wrote:
> >> >>>>
> >> >>>>> Woops, sorry about that. It?s running now and the logs are being
> >> >>>> generated. Once the run is done I?ll send you log files.
> >> >>>>>
> >> >>>>> Thanks!
> >> >>>>>
> >> >>>>> Jonathan
> >> >>>>>
> >> >>>>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan <hategan at mcs.anl.gov>
> >> >>>> wrote:
> >> >>>>>
> >> >>>>>> Right. This isn't your fault. We should, though, probably talk
> >> >>>> about
> >> >>>>>> addressing the issue.
> >> >>>>>>
> >> >>>>>> Mihael
> >> >>>>>>
> >> >>>>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote:
> >> >>>>>>> Mihael, thanks for spotting that.  I added the comments to
> >> >>>> highlight the
> >> >>>>>>> changes in email.
> >> >>>>>>>
> >> >>>>>>> -Yadu
> >> >>>>>>>
> >> >>>>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote:
> >> >>>>>>>> Hi Jonathan,
> >> >>>>>>>>
> >> >>>>>>>> I suspect that the site config is considering the comment to be
> >> >>>> part of
> >> >>>>>>>> the value of the workerLogLevel property. We could confirm this
> >> >>>> if you
> >> >>>>>>>> send us the swift log from this particular run.
> >> >>>>>>>>
> >> >>>>>>>> To fix it, you could try to remove everything after DEBUG
> >> >>>> (including all
> >> >>>>>>>> horizontal white space). In other words:
> >> >>>>>>>>
> >> >>>>>>>> ...
> >> >>>>>>>> workerloglevel=DEBUG
> >> >>>>>>>> workerlogdirectory=/home/$USER/
> >> >>>>>>>> ...
> >> >>>>>>>>
> >> >>>>>>>> Mihael
> >> >>>>>>>>
> >> >>>>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote:
> >> >>>>>>>>> Hi Yadu,
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> I?m getting errors indicating that DEBUG is an invalid worker
> >> >>>> logging
> >> >>>>>>>>> level. I?m attaching the stdout below. Let me know if I?m
> >> >> doing
> >> >>>>>>>>> something silly.
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> Jonathan
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji
> >> >>>> <yadunand at uchicago.edu>
> >> >>>>>>>>> wrote:
> >> >>>>>>>>>
> >> >>>>>>>>>> Hi Jonathan,
> >> >>>>>>>>>>
> >> >>>>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do
> >> >> not
> >> >>>> see
> >> >>>>>>>>>> anything unusual.
> >> >>>>>>>>>>
> >> >>>>>>>>>> From your logs, it looks like workers are failing, so getting
> >> >>>> worker
> >> >>>>>>>>>> logs would help.
> >> >>>>>>>>>> Could you try running on Blues with the following
> >> >>>> swift.properties
> >> >>>>>>>>>> and get us the worker*logs that would show up in the
> >> >>>>>>>>>> workerlogdirectory ?
> >> >>>>>>>>>>
> >> >>>>>>>>>> site=blues
> >> >>>>>>>>>>
> >> >>>>>>>>>> site.blues {
> >> >>>>>>>>>>  jobManager=pbs
> >> >>>>>>>>>>  jobQueue=shared
> >> >>>>>>>>>>  maxJobs=4
> >> >>>>>>>>>>  jobGranularity=1
> >> >>>>>>>>>>  maxNodesPerJob=1
> >> >>>>>>>>>>  tasksPerWorker=16
> >> >>>>>>>>>>  taskThrottle=64
> >> >>>>>>>>>>  initialScore=10000
> >> >>>>>>>>>>  jobWalltime=00:48:00
> >> >>>>>>>>>>  taskWalltime=00:40:00
> >> >>>>>>>>>>  workerloglevel=DEBUG                                  #
> >> >>>> Adding
> >> >>>>>>>>>> debug for workers
> >> >>>>>>>>>>  workerlogdirectory=/home/$USER/                # Logging
> >> >>>>>>>>>> directory on SFS
> >> >>>>>>>>>>  workdir=$RUNDIRECTORY
> >> >>>>>>>>>>  filesystem=local
> >> >>>>>>>>>> }
> >> >>>>>>>>>>
> >> >>>>>>>>>> Thanks,
> >> >>>>>>>>>> Yadu
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote:
> >> >>>>>>>>>>
> >> >>>>>>>>>>> Hi Mike,
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Sorry, I figured there was some busy-ness involved!
> >> >>>>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I
> >> >>>> didn?t
> >> >>>>>>>>>>> get the same issue. That is, the model run completed
> >> >>>> successfully.
> >> >>>>>>>>>>> For the Blues run, I used a trunk distribution (as of May
> >> >> 29,
> >> >>>>>>>>>>> 2014). I?m including one of the log files below. I?m also
> >> >>>>>>>>>>> including the swift.properties file that was used for the
> >> >>>> blues
> >> >>>>>>>>>>> runs below.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Thank you!
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Jonathan
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde <wilde at anl.gov>
> >> >>>> wrote:
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>> Hi Jonathan,
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping!
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> I or one of the team will answer soon, on swift-user.
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> (But the first question is: which Swift release, and can
> >> >> you
> >> >>>>>>>>>>>> point us to, or send, the full log file?)
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> Thanks and regards,
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> - Mike
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote:
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>> Hi Mike,
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> I didn?t get a response yet so just wanted to make sure
> >> >> that
> >> >>>>>>>>>>>>> the message came across.
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> Jonathan
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> Begin forwarded message:
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> From: Jonathan Ozik <jozik at uchicago.edu>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511,
> >> >>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> To: Mihael Hategan <hategan at mcs.anl.gov>,
> >> >>>>>>>>>>>>>> "swift-user at ci.uchicago.edu" <swift-user at ci.uchicago.edu>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> Hi all,
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> I?m getting spurious errors in the jobs that I?m running
> >> >> on
> >> >>>>>>>>>>>>>> Blues. The stdout includes exceptions like:
> >> >>>>>>>>>>>>>> exception @ swift-int.k, line: 511
> >> >>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
> >> >>>>>>>>>>>>>> java.io.IOException: Broken pipe
> >> >>>>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> >> >>>>>>>>>>>>>> at
> >> >>>>>>>>>>>>>>
> >> >> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> >> >>>>>>>>>>>>>> at
> >> >> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> >> >>>>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> >> >>>>>>>>>>>>>> at
> >> >>>>>>>>>>>>>>
> >> >>>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
> >> >>>>>>>>>>>>>> at
> >> >>>>>>>>>>>>>>
> >> >>>> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)
> >> >>>>>>>>>>>>>> at
> >> >>>>>>>>>>>>>>
> >> >>>> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> These seem to occur at different parts of the submitted
> >> >>>>>>>>>>>>>> jobs. Let me know if there?s a log file that you?d like
> >> >> to
> >> >>>>>>>>>>>>>> look at.
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> In earlier attempts I was getting these warnings followed
> >> >>>> by
> >> >>>>>>>>>>>>>> broken pipe errors:
> >> >>>>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO:
> >> >>>>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152,
> >> >>>> 0)
> >> >>>>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot
> >> >>>>>>>>>>>>>> allocate large pages, falling back to regular pages
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> Apparently that?s a known precursor of crashes on Java 7
> >> >> as
> >> >>>>>>>>>>>>>> described here
> >> >>>>>>>>>>>>>>
> >> >>>>
> >> >> (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html):
> >> >>>>>>>>>>>>>> Area: hotspot/gc
> >> >>>>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages.
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> On Linux, failures when allocating large pages can lead
> >> >> to
> >> >>>>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the
> >> >> issue
> >> >>>>>>>>>>>>>> can be recognized in two ways:
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> ? Before the crash happens one or more lines similar to
> >> >>>> this
> >> >>>>>>>>>>>>>> will have been printed to the log:
> >> >>>>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152,
> >> >>>> 0)
> >> >>>>>>>>>>>>>> failed;
> >> >>>>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot
> >> >> allocate
> >> >>>>>>>>>>>>>> large pages, falling back to regular pages
> >> >>>>>>>>>>>>>> ? If a hs_err file is generated it will contain a line
> >> >>>>>>>>>>>>>> similar to this:
> >> >>>>>>>>>>>>>> Large page allocation failures have occurred 3 times
> >> >>>>>>>>>>>>>> The problem can be avoided by running with large page
> >> >>>>>>>>>>>>>> support turned off, for example by passing the
> >> >>>>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary.
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> See 8007074 (not public).
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> So I added the -XX:-UseLargePages option in the
> >> >> invocations
> >> >>>>>>>>>>>>>> of Java code that I was responsible for. That seemed to
> >> >> get
> >> >>>>>>>>>>>>>> rid of the warning and the crashes for a while, but
> >> >> perhaps
> >> >>>>>>>>>>>>>> that was just a coincidence.
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> Jonathan
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> _______________________________________________
> >> >>>>>>>>>>>>>> Swift-user mailing list
> >> >>>>>>>>>>>>>> Swift-user at ci.uchicago.edu
> >> >>>>>>>>>>>>>>
> >> >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>> --
> >> >>>>>>>>>>>> Michael Wilde
> >> >>>>>>>>>>>> Mathematics and Computer Science          Computation
> >> >>>> Institute
> >> >>>>>>>>>>>> Argonne National Laboratory               The University of
> >> >>>> Chicago
> >> >>>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>>
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >
> >> >
> >> 
> >> _______________________________________________
> >> Swift-user mailing list
> >> Swift-user at ci.uchicago.edu
> >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> >> 
> >> 
> >> 
> >> _______________________________________________
> >> Swift-user mailing list
> >> Swift-user at ci.uchicago.edu
> >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > 
> > -- 
> > Michael Wilde
> > Mathematics and Computer Science          Computation Institute
> > Argonne National Laboratory               The University of Chicago
> > _______________________________________________
> > Swift-user mailing list
> > Swift-user at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user


From wilde at anl.gov  Thu Jul 31 17:14:42 2014
From: wilde at anl.gov (Michael Wilde)
Date: Thu, 31 Jul 2014 17:14:42 -0500
Subject: [Swift-user] exception @ swift-int.k, line: 511,
 Caused by: Block task failed: Connection to worker lost
In-Reply-To: <1406844704.25609.4.camel@echo>
References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu>		<E85EF03A-B4EF-4124-A06F-87DFB107207A@uchicago.edu>		<53D96721.6030004@anl.gov>		<CC03A2F2-EF31-4D2F-A59B-0D122008AE13@uchicago.edu>		<53D99769.5050107@uchicago.edu>		<D3E499FD-094F-4173-8DE2-A9986E956A40@uchicago.edu>		<1406783097.19120.2.camel@echo>
	<53D9CFCB.8090803@uchicago.edu>		<1406783556.19312.1.camel@echo>		<28AB72AB-2650-40E2-B084-9D1E20D03AEB@uchicago.edu>		<F3AA70C7-A8A0-4B49-BA04-D03FFF175926@uchicago.edu>		<1406826555.22289.4.camel@echo>		<22271280-84FF-4C85-8A1E-5002ED165DAE@uchicago.edu>		<1406830708.23317.3.camel@echo>		<CF77F65F-AD2C-407C-89A0-BB05284E400C@uchicago.edu>	
	<CA+_+Ey-QVO_wsw0V-ViFnb9kiYPs7Va6rhZvrWtebgajtZoNdA@mail.gmail.com>	
	<53DAA4AD.3010602@anl.gov>	
	<EE8DD009-80CC-4822-AAC0-DEC6CC34AB8A@uchicago.edu>
	<1406844704.25609.4.camel@echo>
Message-ID: <53DABFD2.4030602@anl.gov>

Im also wondering - did these jobs run more child processes that they 
should have, under a shared-queue policy?

I thought the shared queue on blues (and fusion) allowed multiple 
single-processor jobs to run per node, one processor per job?

Or, are we stumbling back into an old bug which ran N^2 tasks per node 
instead of N tasks per node?

- MIke

On 7/31/14, 5:11 PM, Mihael Hategan wrote:
> The worker is still being killed though, and there is little feedback to
> the user. If possible, we should fix that.
>
> Do you have an idea what is happening? Is it a process similar to an OOM
> where the kernel just decides to kill some process?
>
> Mihael
>
> On Thu, 2014-07-31 at 16:42 -0500, Jonathan Ozik wrote:
>> Mike, all,
>>
>> I?ve done a few more experiments and realized that while I had added the -XX:-UseLargePages command line argument to the main Java invocation, each app also included a post process Java app which I hadn?t added the UseLargePages argument to. Perhaps because this was running at the end of an app, the warnings that I?d seen previously prior to the crashes weren?t making it to stdout before the crashes. Purely speculation.
>> In any case, I did put the tasksPerWorker back to 16 and was able to successfully run all the jobs. I guess one bit of consolation in all this is that if there are any future ?mystery? worker failings, this will be one issue to check.
>>
>> Thank you all for helping with tracking this down.
>>
>> Jonathan
>>
>> On Jul 31, 2014, at 3:18 PM, Michael Wilde <wilde at anl.gov> wrote:
>>
>>> Its odd that no errors like OOM or OOT would be logged to stdout of the PBS job.
>>>
>>> Jonathan, can you check with the Blues Sysadmins if they have any other error logs (e.g. dmesg/syslogs) for your jobs, or for the nodes on which they ran?
>>>
>>> Thanks,
>>>
>>> - Mike
>>>
>>> On 7/31/14, 1:55 PM, David Kelly wrote:
>>>> Is each invocation of the Java app creating a large number of threads? I ran into an issue like that (on another cluster) where I was hitting the maximum number of processes per node, and the scheduler ended up killing my jobs.
>>>>
>>>>
>>>> On Thu, Jul 31, 2014 at 1:42 PM, Jonathan Ozik <jozik at uchicago.edu> wrote:
>>>> Okay, I?ve launched a new job, with tasksPerWorker=8. This is running on the sSerial queue rather than the shared queue for the other runs. Just wanted to note that for comparison. Each of the java processes that are launched with -Xmx1536m. I believe that Blues advertises each node having access to 64GB (http://www.lcrc.anl.gov/about/Blues), so at least at first glance the memory issue doesn?t seem like it could be an issue.
>>>>
>>>> Jonathan
>>>>
>>>> On Jul 31, 2014, at 1:18 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
>>>>
>>>>> Ok, so the workers die while the jobs are running and not much else is
>>>>> happening.
>>>>> My money is on the apps eating up all RAM and the kernel killing the
>>>>> worker.
>>>>>
>>>>> The question is how we check whether this is true or not. Ideas?
>>>>>
>>>>> Yadu, can you do me a favor and package all the PBS output files from
>>>>> this run?
>>>>>
>>>>> Jonathan, can you see if you get the same errors with tasksPerWorker=8?
>>>>>
>>>>> Mihael
>>>>>
>>>>> On Thu, 2014-07-31 at 12:34 -0500, Jonathan Ozik wrote:
>>>>>> Sure thing, it?s attached below.
>>>>>>
>>>>>> Jonathan
>>>>>>
>>>>>>
>>>>>> On Jul 31, 2014, at 12:09 PM, Mihael Hategan <hategan at mcs.anl.gov>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Jonathan,
>>>>>>>
>>>>>>> I can't see anything obvious in the worker logs, but they are pretty
>>>>>>> large. Can you also post the swift log from this run? It would make
>>>>>> it
>>>>>>> easier to focus on the right time frame.
>>>>>>>
>>>>>>> Mihael
>>>>>>>
>>>>>>> On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I?m attaching the stdout and the worker logs below.
>>>>>>>>
>>>>>>>> Thanks for looking at these!
>>>>>>>>
>>>>>>>> Jonathan
>>>>>>>>
>>>>>>>>
>>>>>>>> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik <jozik at uchicago.edu>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Woops, sorry about that. It?s running now and the logs are being
>>>>>>>> generated. Once the run is done I?ll send you log files.
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> Jonathan
>>>>>>>>>
>>>>>>>>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan <hategan at mcs.anl.gov>
>>>>>>>> wrote:
>>>>>>>>>> Right. This isn't your fault. We should, though, probably talk
>>>>>>>> about
>>>>>>>>>> addressing the issue.
>>>>>>>>>>
>>>>>>>>>> Mihael
>>>>>>>>>>
>>>>>>>>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote:
>>>>>>>>>>> Mihael, thanks for spotting that.  I added the comments to
>>>>>>>> highlight the
>>>>>>>>>>> changes in email.
>>>>>>>>>>>
>>>>>>>>>>> -Yadu
>>>>>>>>>>>
>>>>>>>>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote:
>>>>>>>>>>>> Hi Jonathan,
>>>>>>>>>>>>
>>>>>>>>>>>> I suspect that the site config is considering the comment to be
>>>>>>>> part of
>>>>>>>>>>>> the value of the workerLogLevel property. We could confirm this
>>>>>>>> if you
>>>>>>>>>>>> send us the swift log from this particular run.
>>>>>>>>>>>>
>>>>>>>>>>>> To fix it, you could try to remove everything after DEBUG
>>>>>>>> (including all
>>>>>>>>>>>> horizontal white space). In other words:
>>>>>>>>>>>>
>>>>>>>>>>>> ...
>>>>>>>>>>>> workerloglevel=DEBUG
>>>>>>>>>>>> workerlogdirectory=/home/$USER/
>>>>>>>>>>>> ...
>>>>>>>>>>>>
>>>>>>>>>>>> Mihael
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote:
>>>>>>>>>>>>> Hi Yadu,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I?m getting errors indicating that DEBUG is an invalid worker
>>>>>>>> logging
>>>>>>>>>>>>> level. I?m attaching the stdout below. Let me know if I?m
>>>>>> doing
>>>>>>>>>>>>> something silly.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Jonathan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji
>>>>>>>> <yadunand at uchicago.edu>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Jonathan,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do
>>>>>> not
>>>>>>>> see
>>>>>>>>>>>>>> anything unusual.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  From your logs, it looks like workers are failing, so getting
>>>>>>>> worker
>>>>>>>>>>>>>> logs would help.
>>>>>>>>>>>>>> Could you try running on Blues with the following
>>>>>>>> swift.properties
>>>>>>>>>>>>>> and get us the worker*logs that would show up in the
>>>>>>>>>>>>>> workerlogdirectory ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> site=blues
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> site.blues {
>>>>>>>>>>>>>>   jobManager=pbs
>>>>>>>>>>>>>>   jobQueue=shared
>>>>>>>>>>>>>>   maxJobs=4
>>>>>>>>>>>>>>   jobGranularity=1
>>>>>>>>>>>>>>   maxNodesPerJob=1
>>>>>>>>>>>>>>   tasksPerWorker=16
>>>>>>>>>>>>>>   taskThrottle=64
>>>>>>>>>>>>>>   initialScore=10000
>>>>>>>>>>>>>>   jobWalltime=00:48:00
>>>>>>>>>>>>>>   taskWalltime=00:40:00
>>>>>>>>>>>>>>   workerloglevel=DEBUG                                  #
>>>>>>>> Adding
>>>>>>>>>>>>>> debug for workers
>>>>>>>>>>>>>>   workerlogdirectory=/home/$USER/                # Logging
>>>>>>>>>>>>>> directory on SFS
>>>>>>>>>>>>>>   workdir=$RUNDIRECTORY
>>>>>>>>>>>>>>   filesystem=local
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Yadu
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Mike,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sorry, I figured there was some busy-ness involved!
>>>>>>>>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I
>>>>>>>> didn?t
>>>>>>>>>>>>>>> get the same issue. That is, the model run completed
>>>>>>>> successfully.
>>>>>>>>>>>>>>> For the Blues run, I used a trunk distribution (as of May
>>>>>> 29,
>>>>>>>>>>>>>>> 2014). I?m including one of the log files below. I?m also
>>>>>>>>>>>>>>> including the swift.properties file that was used for the
>>>>>>>> blues
>>>>>>>>>>>>>>> runs below.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Jonathan
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde <wilde at anl.gov>
>>>>>>>> wrote:
>>>>>>>>>>>>>>>> Hi Jonathan,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I or one of the team will answer soon, on swift-user.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (But the first question is: which Swift release, and can
>>>>>> you
>>>>>>>>>>>>>>>> point us to, or send, the full log file?)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks and regards,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> - Mike
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Mike,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I didn?t get a response yet so just wanted to make sure
>>>>>> that
>>>>>>>>>>>>>>>>> the message came across.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Jonathan
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Begin forwarded message:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> From: Jonathan Ozik <jozik at uchicago.edu>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511,
>>>>>>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> To: Mihael Hategan <hategan at mcs.anl.gov>,
>>>>>>>>>>>>>>>>>> "swift-user at ci.uchicago.edu" <swift-user at ci.uchicago.edu>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I?m getting spurious errors in the jobs that I?m running
>>>>>> on
>>>>>>>>>>>>>>>>>> Blues. The stdout includes exceptions like:
>>>>>>>>>>>>>>>>>> exception @ swift-int.k, line: 511
>>>>>>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
>>>>>>>>>>>>>>>>>> java.io.IOException: Broken pipe
>>>>>>>>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>
>>>>>> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>>>>>>>>>>>>>>>>>> at
>>>>>> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>>>>>>>>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>
>>>>>>>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>
>>>>>>>> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)
>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>
>>>>>>>> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)
>>>>>>>>>>>>>>>>>> These seem to occur at different parts of the submitted
>>>>>>>>>>>>>>>>>> jobs. Let me know if there?s a log file that you?d like
>>>>>> to
>>>>>>>>>>>>>>>>>> look at.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In earlier attempts I was getting these warnings followed
>>>>>>>> by
>>>>>>>>>>>>>>>>>> broken pipe errors:
>>>>>>>>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO:
>>>>>>>>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152,
>>>>>>>> 0)
>>>>>>>>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot
>>>>>>>>>>>>>>>>>> allocate large pages, falling back to regular pages
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Apparently that?s a known precursor of crashes on Java 7
>>>>>> as
>>>>>>>>>>>>>>>>>> described here
>>>>>>>>>>>>>>>>>>
>>>>>> (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html):
>>>>>>>>>>>>>>>>>> Area: hotspot/gc
>>>>>>>>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Linux, failures when allocating large pages can lead
>>>>>> to
>>>>>>>>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the
>>>>>> issue
>>>>>>>>>>>>>>>>>> can be recognized in two ways:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ? Before the crash happens one or more lines similar to
>>>>>>>> this
>>>>>>>>>>>>>>>>>> will have been printed to the log:
>>>>>>>>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152,
>>>>>>>> 0)
>>>>>>>>>>>>>>>>>> failed;
>>>>>>>>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot
>>>>>> allocate
>>>>>>>>>>>>>>>>>> large pages, falling back to regular pages
>>>>>>>>>>>>>>>>>> ? If a hs_err file is generated it will contain a line
>>>>>>>>>>>>>>>>>> similar to this:
>>>>>>>>>>>>>>>>>> Large page allocation failures have occurred 3 times
>>>>>>>>>>>>>>>>>> The problem can be avoided by running with large page
>>>>>>>>>>>>>>>>>> support turned off, for example by passing the
>>>>>>>>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> See 8007074 (not public).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> So I added the -XX:-UseLargePages option in the
>>>>>> invocations
>>>>>>>>>>>>>>>>>> of Java code that I was responsible for. That seemed to
>>>>>> get
>>>>>>>>>>>>>>>>>> rid of the warning and the crashes for a while, but
>>>>>> perhaps
>>>>>>>>>>>>>>>>>> that was just a coincidence.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Jonathan
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>> Swift-user mailing list
>>>>>>>>>>>>>>>>>> Swift-user at ci.uchicago.edu
>>>>>>>>>>>>>>>>>>
>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Michael Wilde
>>>>>>>>>>>>>>>> Mathematics and Computer Science          Computation
>>>>>>>> Institute
>>>>>>>>>>>>>>>> Argonne National Laboratory               The University of
>>>>>>>> Chicago
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>> _______________________________________________
>>>> Swift-user mailing list
>>>> Swift-user at ci.uchicago.edu
>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Swift-user mailing list
>>>> Swift-user at ci.uchicago.edu
>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>> -- 
>>> Michael Wilde
>>> Mathematics and Computer Science          Computation Institute
>>> Argonne National Laboratory               The University of Chicago
>>> _______________________________________________
>>> Swift-user mailing list
>>> Swift-user at ci.uchicago.edu
>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>

-- 
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory               The University of Chicago


From jozik at anl.gov  Thu Jul  3 14:36:02 2014
From: jozik at anl.gov (Ozik, Jonathan)
Date: Thu, 03 Jul 2014 19:36:02 -0000
Subject: [Swift-user] Swift 0.95 RC6 on Windows
Message-ID: <97A406BA-6E0C-4F2E-AEAA-F9F0D5E9DCDE@anl.gov>

Hi all,

I have two colleagues that are trying to run Swift 0.95 RC6 on Windows using the included swift.bat Windows script. They haven?t been successful yet. Trying the quickstart guide http://swift-lang.org/guides/quickstart.html and the hello.swift script under examples/swift/misc/ they see the following error:

Swift 0.95 RC6 swift-r7900 cog-r3908
RunID: 20140703-1433-7iq5x697
Progress: Thu, 03 Jul 2014 14:33:49-0500
 
Execution failed:
Exception in echo:
    Arguments: [hi]
    Host: localhost
    Directory: hello-20140703-1433-7iq5x697/jobs/6/echo-6nydmzsl
        exception @ swift-int.k, line: 530
Caused by: null
Caused by: java.io.IOException: Cannot run program "/bin/bash" (in directory "\var\tmp\hello-20140703-1433-7iq5x697"): CreateProcess error=267, The directory name is invalid
Caused by: java.io.IOException: CreateProcess error=267, The directory name is invalid

Any thoughts on this?

Jonathan

From jozik at anl.gov  Thu Jul 31 10:41:39 2014
From: jozik at anl.gov (Ozik, Jonathan)
Date: Thu, 31 Jul 2014 15:41:39 -0000
Subject: [Swift-user] exception @ swift-int.k, line: 511,
 Caused by: Block task failed: Connection to worker lost
In-Reply-To: <53DA5020.7050402@anl.gov>
References: <7D4846A1-DCBC-481D-AE0F-9EF08F28B54B@uchicago.edu>
	<53DA4EEE.5010800@anl.gov> <53DA5020.7050402@anl.gov>
Message-ID: <85241D2A-C5EC-4538-9D74-9E8439F60C0C@anl.gov>

Thank you Mike.
Regarding the location of the ER files, to reduce variables the last few runs were done with 0.95-RC6.

Jonathan

On Jul 31, 2014, at 9:18 AM, Michael Wilde <wilde at anl.gov> wrote:

> I see this from PBS in your home dir:
> 
> blues$ cat 583937.bmgt1.lcrc.anl.gov.ER
> Use of uninitialized value $s in concatenation (.) or string at 
> /home/ozik/.globus/coasters/cscript4312030037430783094.pl line 2220.
> Use of uninitialized value $s in concatenation (.) or string at 
> /home/ozik/.globus/coasters/cscript4312030037430783094.pl line 2220.
> blues$
> 
> That looks to me like a Swift bug in worker.pl
> 
> We'll look into this angle.
> 
> Also I'm curious why these files are not going into your run dir (but 
> perhaps thats because youre running an older trunk release, not 0.95? 
> Or, thats a separate 0.95 bug).
> 
> - Mike
> 
> On 7/31/14, 9:13 AM, Michael Wilde wrote:
>> Some discussion and diagnosis of this incident has taken place off list.
>> 
>> In a quick scan of the worker logs, I don't spot an obvious error that
>> would cause workers to exit.
>> Hopefully others on the Swift team can check those as well.
>> 
>> Jonathan, do you have stdout/err files from the PBS scheduler on blues,
>> in your runNNN log dirs?
>> 
>> If so, can you point us to them?
>> 
>> Thanks,
>> 
>> - Mike
>> 
>> On 7/29/14, 8:56 PM, Jonathan Ozik wrote:
>>> Hi all,
>>> 
>>> I?m getting spurious errors in the jobs that I?m running on Blues. The stdout includes exceptions like:
>>> 	exception @ swift-int.k, line: 511
>>> Caused by: Block task failed: Connection to worker lost
>>> java.io.IOException: Broken pipe
>>> 	at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>>> 	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>>> 	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>>> 	at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>>> 	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
>>> 	at org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)
>>> 	at org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)
>>> 
>>> These seem to occur at different parts of the submitted jobs. Let me know if there?s a log file that you?d like to look at.
>>> 
>>> In earlier attempts I was getting these warnings followed by broken pipe errors:
>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000a0000000, 704643072, 2097152, 0) failed; error='Cannot allocate memory' (errno=12); Cannot allocate large pages, falling back to regular pages
>>> 
>>> Apparently that?s a known precursor of crashes on Java 7 as described here (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html):
>>> Area: hotspot/gc
>>> Synopsis: Crashes due to failure to allocate large pages.
>>> 
>>> On Linux, failures when allocating large pages can lead to crashes. When running JDK 7u51 or later versions, the issue can be recognized in two ways:
>>> 
>>> 	? Before the crash happens one or more lines similar to this will have been printed to the log:
>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152, 0) failed;
>>> error='Cannot allocate memory' (errno=12); Cannot allocate large pages, falling back to regular pages
>>> 	? If a hs_err file is generated it will contain a line similar to this:
>>> Large page allocation failures have occurred 3 times
>>> The problem can be avoided by running with large page support turned off, for example by passing the "-XX:-UseLargePages" option to the java binary.
>>> 
>>> See 8007074 (not public).
>>> 
>>> So I added the -XX:-UseLargePages option in the invocations of Java code that I was responsible for. That seemed to get rid of the warning and the crashes for a while, but perhaps that was just a coincidence.
>>> 
>>> Jonathan
>>> 
>>> _______________________________________________
>>> Swift-user mailing list
>>> Swift-user at ci.uchicago.edu
>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> 
> -- 
> Michael Wilde
> Mathematics and Computer Science          Computation Institute
> Argonne National Laboratory               The University of Chicago
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user