From wilde at mcs.anl.gov  Mon Oct  8 09:40:12 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 8 Oct 2012 09:40:12 -0500
Subject: [Swift-devel] CFP: 2nd IEEE International Workshop on Workflow
 Models, Systems, Services and Applications in the Cloud (CloudFlow) 2013
In-Reply-To: <CAAMCf2uDRz82beRknWbCukJ9mN20zsvn_iC8V=Vo3WNCWo8XOw@mail.gmail.com>
References: <CAAMCf2ts2jDc1a-+ztOzPDTVW-Pr=KqgY3Cni25AUwDzarXD7A@mail.gmail.com>
	<CAAMCf2uDRz82beRknWbCukJ9mN20zsvn_iC8V=Vo3WNCWo8XOw@mail.gmail.com>
Message-ID: <CAF1H9xp49ddN9e22-my7jeLdHnRVoQPu+0hiLjj_M4Nq7jKP5A@mail.gmail.com>

*
*
*Second IEEE International Workshop on Workflow Models, Systems, Services
and Applications in the Cloud (CloudFlow) 2013*

*To be held in conjunction with the 27th IEEE International Parallel &
Distributed Processing Symposium (IPDPS) 2013, Cambridge, Boston,
Massachusetts, USA, May 20-24, 2013.*

http://www.cloud-uestc.cn/cloudflow/home.html

*Overview*

Cloud computing is gaining tremendous momentum in both academia and
industry, more and more people are migrating their data and applications
into the Cloud. We have observed wide adoption of the MapReduce computing
model and the open source Hadoop system for large scale distributed data
processing, and a variety of ad hoc mashup techniques that weave together
Web applications. However, these are just first steps towards managing
complex task and data dependencies in the Cloud, as there are more
challenging issues such as large parameter space exploration, data
partitioning and distribution, scheduling and optimization, smart reruns,
and provenance tracking associated with workflow execution.

Cloud needs structured and mature workflow technologies to handle such
issues, and vice versa, as Cloud offers unprecedented scalability to
workflow systems, and could potentially change the way we perceive and
conduct research and experiments. The scale and complexity of the science
and data analytics problems that can be handled can be greatly increased on
the Cloud, and the on-demand nature of resource allocation on the Cloud
will also help improve resource utilization and user experience.
As Cloud computing provides a paradigm-shifting utility-oriented computing
model in terms of the unprecedented size of datacenter-level resource pool
and the on-demand resource provisioning mechanism, there are lots of
challenges in bringing Cloud and workflows together. We need high level
languages and computing models for large scale workflow specification; we
need to adapt existing workflow architectures into the Cloud, and integrate
workflow systems with Cloud infrastructure and resources; we also need to
leverage Cloud data storage technologies to efficiently distribute data
over a large number of nodes and explore data locality during computation
etc. We organize the CloudFlow workshop as a venue for the workflow and
Cloud communities to define models and paradigms, present their
state-of-the-art work, share their thoughts and experiences, and explore
new directions in realizing workflows in the Cloud.

*Topics:*

We welcome the submission of original work related to the topics listed
below, which include (in the context of Cloud):
? Models and Languages for Large Scale Workflow Specification
? Workflow Architecture and Framework
? Large Scale Workflow Systems
? Service Workflow
? Workflow Composition and Orchestration
? Workflow Migration into the Cloud
? Workflow Scheduling and Optimization
? Cloud Middleware in Support of Workflow
? Virtualized Environment
? Workflow Applications and Case Studies
? Performance and Scalability Analysis
? Peta-Scale Data Processing
? Event Processing and Messaging
? Real-Time Analytics
? Provenance

*Paper Submission*

Authors are invited to submit papers with unpublished, original work. The
papers should not exceed 10 single-spaced double-column pages using
10-point size font on 8.5x11 inch pages (IEEE conference style), including
figures, tables, and references.

Paper submission should be done via the online CMT system, Microsoft?s
Academic Conference Management Service (*
https://cmt.research.microsoft.com/CF2013*) by midnight January 9th, 2013
Pacific Time. The final format should be in PDF. Proceedings of the
workshop will be published by the IEEE Digital Library (indexed by EI) and
distributed at the conference. Selected excellent work may be eligible for
additional post-conference publication as journal articles or book
chapters. Submission implies the willingness of at least one of the authors
to register and present the paper.

*Important Dates*
**
Paper submission: January 9th, 2013
Acceptance notification: February 8th, 2013
Final paper due: Feb 19th, 2013

*Organization*

Workshop Chairs:
Dr. Yong Zhao
University of Electronic Science and Technology of China, China
yongzh04 at gmail.com

Dr. Cui Lin
California State University, Fresno, USA
clin at csufresno.edu

Dr. Shiyong Lu
Wayne State University, USA
shiyong at wayne.edu

Program Chair:
Dr. Wenhong Tian
University of Electronic Science and Technology of China, China

Publicity Chair:
Dr. Ruini Xue
University of Electronic Science and Technology of China, China

*Steering Committee *

? Daniel S. Katz, University of Chicago, U.S.A.
? Mike Wilde, University of Chicago, U.S.A.
? Ewa Deelman, University of South California, U.S.A.
? Tevfik Kosar, University at Buffalo, U.S.A.
? Ilkay Altintas, San Diego Supercomputer Center, U.S.A.
? Ioan Raicu, Illinois Institute of Technology, U.S.A.
? Yogesh Simmhan, University of Southern California, U.S.A.
? Ian Taylor, Cardiff University, U.K.
? Weimin Zheng, Tsinghua University, China
? Hai Jin, Huazhong University of Science and Engineering, China
? Wanchun Dou, Nanjing University, China
? Hui Zhang, National Science  and Technology Infrastructure, China

*Program Committee *

? Shawn Bowers, Gonzaga University, U.S.A.
? Douglas Thain, University of Notre Dame, U.S.A.
? Ian Gorton, Pacific Northwest National Laboratory, U.S.A.
? Artem Chebotko, University of Texas at Pan American, U.S.A.
? Weisong Shi, Wayne State University, U.S.A.
? Paolo Missier, Newcastle University, U.K.
? Wei Tan, IBM T. J. Watson Research Center, U.S.A.
? Jianwu Wang, San Diego Super Computer Center, U.S.A.
? Ping Yang, Binghamton University, U.S.A.
? Jian Guo, Harvard University, U.S.A.
? Liqiang Wang, University of Wyoming, U.S.A.
? Paul Groth, VU University Amsterdam, the Netherlands
? Zhiming Zhao, University of Amsterdam, the Netherlands
? Marta Mattoso, Federal University of Rio de Janeiro, Brazil
? Wenhong Tian, University of Electronic Science and Technology of China,
China
? Ruini Xue, Tsinghua University, China
? Jian Cao, Shanghai Jiaotong University, China
? Jianxun Liu, Hunan University of Science and Technology, China
? Song Zhang, Chinese Academy of Sciences, China
? Hua Hu, Hangzhou Dianzi University, China
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20121008/4b969806/attachment.html>

From hategan at mcs.anl.gov  Sun Oct 14 22:48:54 2012
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 14 Oct 2012 20:48:54 -0700
Subject: [Swift-devel] [swift-support] Channel multiplexer error
In-Reply-To: <CAK8C4JUvgu2y7aF1_=xh4SvtSSFHB3uADD4fhEbXp+U2=XPwqg@mail.gmail.com>
References: <73B3B09E-B386-4B6D-A9D6-58EEB00B747C@uchicago.edu>
	<931001436.34144.1349713641059.JavaMail.root@zimbra.anl.gov>
	<CAK8C4JVPBFp_cMKe--27rxbQ4BpFBrWDFV-zYf0fji8RyPudeA@mail.gmail.com>
	<1350000735.8355.0.camel@blabla>
	<CAK8C4JUvgu2y7aF1_=xh4SvtSSFHB3uADD4fhEbXp+U2=XPwqg@mail.gmail.com>
Message-ID: <1350272934.15993.5.camel@blabla>

I spoke to Mike on the phone on Friday, and we agreed that
foreach.max.threads is a bit difficult to use.

So I removed the throttling for foreach and added it to app invocations.
It might help with memory consumption if you set it to around the number
of cpus you have access to.

I also did some small optimizations to improve memory use. I can run
about 50K jobs with your script on a 32 bit system with 1GB of heap
space. I suspect that on 64 bit systems this might require more heap.

I also added an option to the swift executable to automatically dump a
copy of the heap when an out of memory condition occurs. Hopefully that
will help us troubleshoot such problems in the future.

This is all in trunk.

Mihael

On Sat, 2012-10-13 at 04:47 -0500, Kazutaka Takahashi wrote:
> Hi All, 
> 
> Sorry for a late reply, but I am already at a conference and had not
> had a chance to try what Mike proposed. I will try later during the
> second half of the conference, if not after the conference ending next
> wed. 
> Taka
> 
> 
> On Thu, Oct 11, 2012 at 7:12 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
>         Did you try what Mike proposed?
>         
>         Mihael
>         
>         On Thu, 2012-10-11 at 17:49 -0500, Kazutaka Takahashi wrote:
>         > OK...
>         > The last one died with the same error msg... Please check
>         the both
>         > directories...
>         >
>         >
>         taka at login4:/lustre/beagle/GCNet/RG/Athena/a080521_BS_new/Cond2>
>         >
>         taka at login4:/lustre/beagle/GCNet/RG/Athena/a080521_BS_new/Cond3>
>         >
>         > /lustre/beagle/GCNet/bin/Swift/swift: line 164:  8527 Killed
>         > java -Xmx5120M
>         -Djava.endorsed.dirs=/soft/swift/0.93//lib/endorsed
>         > -DUID=3023 -DGLOBUS_HOSTNAME=login4.beagle.ci.uchicago.edu
>         > -DCOG_INSTALL_PATH=/soft/swift/0.93/
>         -Dswift.home=/soft/swift/0.93/
>         > -Duser.home=/lustre/beagle/GCNet
>         > -Djava.security.egd=file:///dev/urandom -XX:+UseParallelGC
>         > -XX:ParallelGCThreads=1
>         >
>         -classpath /soft/swift/0.93//etc:/soft/swift/0.93//libexec:/soft/swift/0.93//lib/addressing-1.0.jar:/soft/swift/0.93//lib/ant.jar:/soft/swift/0.93//lib/antlr-2.7.5.jar:/soft/swift/0.93//lib/axis.jar:/soft/swift/0.93//lib/axis-url.jar:/soft/swift/0.93//lib/castor-0.9.6.jar:/soft/swift/0.93//lib/coaster-bootstrap.jar:/soft/swift/0.93//lib/cog-abstraction-common-2.4.jar:/soft/swift/0.93//lib/cog-axis.jar:/soft/swift/0.93//lib/cog-grapheditor-0.47.jar:/soft/swift/0.93//lib/cog-jglobus-1.7.0.jar:/soft/swift/0.93//lib/cog-karajan-0.36-dev.jar:/soft/swift/0.93//lib/cog-provider-clref-gt4_0_0.jar:/soft/swift/0.93//lib/cog-provider-coaster-0.3.jar:/soft/swift/0.93//lib/cog-provider-dcache-0.1.jar:/soft/swift/0.93//lib/cog-provider-gt2-2.4.jar:/soft/swift/0.93//lib/cog-provider-gt4_0_0-2.5.jar:/soft/swift/0.93//lib/cog-provider-local-2.2.jar:/soft/swift/0.93//lib/cog-provider-localscheduler-0.4.jar:/soft/swift/0.93//lib/cog-provider-ssh-2.4.jar:/soft/swift/0.93//lib/cog-provider-webdav-2.1.jar:/soft/swift/0.93//lib/cog-resources-1.0.jar:/soft/swift/0.93//lib/cog-swift-svn.jar:/soft/swift/0.93//lib/cog-trap-1.0.jar:/soft/swift/0.93//lib/cog-url.jar:/soft/swift/0.93//lib/cog-util-0.92.jar:/soft/swift/0.93//lib/commonj.jar:/soft/swift/0.93//lib/commons-beanutils.jar:/soft/swift/0.93//lib/commons-collections-3.0.jar:/soft/swift/0.93//lib/commons-digester.jar:/soft/swift/0.93//lib/commons-discovery.jar:/soft/swift/0.93//lib/commons-httpclient.jar:/soft/swift/0.93//lib/commons-logging-1.1.jar:/soft/swift/0.93//lib/concurrent.jar:/soft/swift/0.93//lib/cryptix32.jar:/soft/swift/0.93//lib/cryptix-asn1.jar:/soft/swift/0.93//lib/cryptix.jar:/soft/swift/0.93//lib/globus_delegation_service.jar:/soft/swift/0.93//lib/globus_delegation_stubs.jar:/soft/swift/0.93//lib/globus_wsrf_mds_aggregator_stubs.jar:/soft/swift/0.93//lib/globus_wsrf_rendezvous_service.jar:/soft/swift/0.93//lib/globus_wsrf_rendezvous_stubs.jar:/soft/swift/0.93//lib/globus_wsrf_rft_stubs.jar:/soft/swift/0.93//lib/gram-client.jar:/soft/swift/0.93//lib/gram-stubs.jar:/soft/swift/0.93//lib/gram-utils.jar:/soft/swift/0.93//lib/j2ssh-common-0.2.2.jar:/soft/swift/0.93//lib/j2ssh-core-0.2.2-patch-b.jar:/soft/swift/0.93//lib/jakarta-regexp-1.2.jar:/soft/swift/0.93//lib/jakarta-slide-webdavlib-2.0.jar:/soft/swift/0.93//lib/jaxrpc.jar:/soft/swift/0.93//lib/jce-jdk13-131.jar:/soft/swift/0.93//lib/jgss.jar:/soft/swift/0.93//lib/jline-0.9.94.jar:/soft/swift/0.93//lib/jsr173_1.0_api.jar:/soft/swift/0.93//lib/jug-lgpl-2.0.0.jar:/soft/swift/0.93//lib/junit.jar:/soft/swift/0.93//lib/log4j-1.2.16.jar:/soft/swift/0.93//lib/naming-common.jar:/soft/swift/0.93//lib/naming-factory.jar:/soft/swift/0.93//lib/naming-java.jar:/soft/swift/0.93//lib/naming-resources.jar:/soft/swift/0.93//lib/opensaml.jar:/soft/swift/0.93//lib/puretls.jar:/soft/swift/0.93//lib/resolver.jar:/soft/swift/0.93//lib/saaj.jar:/soft/swift/0.93//lib/stringtemplate.jar:/soft/swift/0.93//lib/vdldefinitions.jar:/soft/swift/0.93//lib/wsdl4j.jar:/soft/swift/0.93//lib/wsrf_core.jar:/soft/swift/0.93//lib/wsrf_core_stubs.jar:/soft/swift/0.93//lib/wsrf_mds_index_stubs.jar:/soft/swift/0.93//lib/wsrf_mds_usefulrp_schema_stubs.jar:/soft/swift/0.93//lib/wsrf_provider_jce.jar:/soft/swift/0.93//lib/wsrf_tools.jar:/soft/swift/0.93//lib/wss4j.jar:/soft/swift/0.93//lib/xalan.jar:/soft/swift/0.93//lib/xbean.jar:/soft/swift/0.93//lib/xbean_xpath.jar:/soft/swift/0.93//lib/xercesImpl.jar:/soft/swift/0.93//lib/xml-apis.jar:/soft/swift/0.93//lib/xmlsec.jar:/soft/swift/0.93//lib/xpp3-1.1.3.4d_b4_min.jar:/soft/swift/0.93//lib/xstream-1.1.1-patched.jar: org.griphyn.vdl.karajan.Loader '-config' 'demo_realcf.cf' '-sites.file' 'demo_realSites.xml' '-tc.file' 'demo_realtc.tc' 'demo_real.swift'
>         >
>         >
>         
>         
>         
> 
> 
> 
> -- 
> What is essential is invisible to the eye
> 


From davidk at ci.uchicago.edu  Tue Oct 16 11:04:43 2012
From: davidk at ci.uchicago.edu (David Kelly)
Date: Tue, 16 Oct 2012 11:04:43 -0500 (CDT)
Subject: [Swift-devel] foreach.max.threads question
In-Reply-To: <2090783833.125077.1350402510973.JavaMail.root@zimbra-mb2.anl.gov>
Message-ID: <555650193.125244.1350403483404.JavaMail.root@zimbra-mb2.anl.gov>

Hello,

I have noticed that since the foreach.max.threads changes, the DSSAT script is now running out of memory. I have the heap size set to 4 gigabytes. There are 120K items in gridLists. 
The main foreach loop of the script looks like this:

foreach g,i in gridLists {
   file tar_output <single_file_mapper; file=@strcat("output/", gridLists[i], "output.tar.gz")>;
   file part_output <single_file_mapper; file=@strcat("parts/", gridLists[i], ".part")>;

   file in1[] <filesys_mapper; location=@strcat(@arg("scenarios"), "/", gridLists[i]),  pattern="*">; // Scenario files
   file in2[] <filesys_mapper; location=@strcat(@arg("weather"), "/", gridLists[i]), pattern="*">;    // Weather files
   file in3[] <filesys_mapper; location=@arg("refdata"), pattern="*">;				      // Common data
   file in4[] <filesys_mapper; location=@arg("bindata"), pattern="*.EXE">;			      // Binaries
   file in5[] <filesys_mapper; location=@arg("bindata"), pattern="*.pl">;			      // Perl scripts
   file wrapper <single_file_mapper; file="RunDSSAT.sh">;                                             // RunDSSAT wrapper

   (tar_output, part_output) = RunDSSAT(xfile, in1, in2, in3, in4, in5, wrapper);
}

Is there any way to throttle foreach again, or any other workarounds I could use to avoid this?

Thanks,
David


From hategan at mcs.anl.gov  Tue Oct 16 13:44:47 2012
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 16 Oct 2012 11:44:47 -0700
Subject: [Swift-devel] foreach.max.threads question
In-Reply-To: <555650193.125244.1350403483404.JavaMail.root@zimbra-mb2.anl.gov>
References: <555650193.125244.1350403483404.JavaMail.root@zimbra-mb2.anl.gov>
Message-ID: <1350413087.31891.0.camel@blabla>

What was foreach.max.threads set to before?

On Tue, 2012-10-16 at 11:04 -0500, David Kelly wrote:
> Hello,
> 
> I have noticed that since the foreach.max.threads changes, the DSSAT script is now running out of memory. I have the heap size set to 4 gigabytes. There are 120K items in gridLists. 
> The main foreach loop of the script looks like this:
> 
> foreach g,i in gridLists {
>    file tar_output <single_file_mapper; file=@strcat("output/", gridLists[i], "output.tar.gz")>;
>    file part_output <single_file_mapper; file=@strcat("parts/", gridLists[i], ".part")>;
> 
>    file in1[] <filesys_mapper; location=@strcat(@arg("scenarios"), "/", gridLists[i]),  pattern="*">; // Scenario files
>    file in2[] <filesys_mapper; location=@strcat(@arg("weather"), "/", gridLists[i]), pattern="*">;    // Weather files
>    file in3[] <filesys_mapper; location=@arg("refdata"), pattern="*">;				      // Common data
>    file in4[] <filesys_mapper; location=@arg("bindata"), pattern="*.EXE">;			      // Binaries
>    file in5[] <filesys_mapper; location=@arg("bindata"), pattern="*.pl">;			      // Perl scripts
>    file wrapper <single_file_mapper; file="RunDSSAT.sh">;                                             // RunDSSAT wrapper
> 
>    (tar_output, part_output) = RunDSSAT(xfile, in1, in2, in3, in4, in5, wrapper);
> }
> 
> Is there any way to throttle foreach again, or any other workarounds I could use to avoid this?
> 
> Thanks,
> David
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel


From davidk at ci.uchicago.edu  Tue Oct 16 13:56:53 2012
From: davidk at ci.uchicago.edu (David Kelly)
Date: Tue, 16 Oct 2012 13:56:53 -0500 (CDT)
Subject: [Swift-devel] foreach.max.threads question
In-Reply-To: <1350413087.31891.0.camel@blabla>
Message-ID: <433307766.126520.1350413813202.JavaMail.root@zimbra-mb2.anl.gov>


Previously it was not explicitly set, so I am assuming it would have been 1024. As a test I tried setting it to 520 (the maximum number of available cores), but that did not seem to help.

----- Original Message -----
> From: "Mihael Hategan" <hategan at mcs.anl.gov>
> To: "David Kelly" <davidk at ci.uchicago.edu>
> Cc: "swift-devel Devel" <swift-devel at ci.uchicago.edu>
> Sent: Tuesday, October 16, 2012 1:44:47 PM
> Subject: Re: [Swift-devel] foreach.max.threads question
> What was foreach.max.threads set to before?
> 
> On Tue, 2012-10-16 at 11:04 -0500, David Kelly wrote:
> > Hello,
> >
> > I have noticed that since the foreach.max.threads changes, the DSSAT
> > script is now running out of memory. I have the heap size set to 4
> > gigabytes. There are 120K items in gridLists.
> > The main foreach loop of the script looks like this:
> >
> > foreach g,i in gridLists {
> >    file tar_output <single_file_mapper; file=@strcat("output/",
> >    gridLists[i], "output.tar.gz")>;
> >    file part_output <single_file_mapper; file=@strcat("parts/",
> >    gridLists[i], ".part")>;
> >
> >    file in1[] <filesys_mapper; location=@strcat(@arg("scenarios"),
> >    "/", gridLists[i]), pattern="*">; // Scenario files
> >    file in2[] <filesys_mapper; location=@strcat(@arg("weather"),
> >    "/", gridLists[i]), pattern="*">; // Weather files
> >    file in3[] <filesys_mapper; location=@arg("refdata"),
> >    pattern="*">; // Common data
> >    file in4[] <filesys_mapper; location=@arg("bindata"),
> >    pattern="*.EXE">; // Binaries
> >    file in5[] <filesys_mapper; location=@arg("bindata"),
> >    pattern="*.pl">; // Perl scripts
> >    file wrapper <single_file_mapper; file="RunDSSAT.sh">; //
> >    RunDSSAT wrapper
> >
> >    (tar_output, part_output) = RunDSSAT(xfile, in1, in2, in3, in4,
> >    in5, wrapper);
> > }
> >
> > Is there any way to throttle foreach again, or any other workarounds
> > I could use to avoid this?
> >
> > Thanks,
> > David
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel


From wilde at mcs.anl.gov  Tue Oct 16 13:56:56 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 16 Oct 2012 13:56:56 -0500 (CDT)
Subject: [Swift-devel] foreach.max.threads question
In-Reply-To: <1350413087.31891.0.camel@blabla>
Message-ID: <1014451256.48685.1350413816558.JavaMail.root@zimbra.anl.gov>

Mihael, I though that when we discussed this last FRiday, the plan was to add the app() throttle but also leave the foreach throttle in place, just in case we needed it.  Was there a reason that you needed to remove the foreach throttle to do the app throttle?

- Mike


----- Original Message -----
> From: "Mihael Hategan" <hategan at mcs.anl.gov>
> To: "David Kelly" <davidk at ci.uchicago.edu>
> Cc: "swift-devel Devel" <swift-devel at ci.uchicago.edu>
> Sent: Tuesday, October 16, 2012 1:44:47 PM
> Subject: Re: [Swift-devel] foreach.max.threads question
> What was foreach.max.threads set to before?
> 
> On Tue, 2012-10-16 at 11:04 -0500, David Kelly wrote:
> > Hello,
> >
> > I have noticed that since the foreach.max.threads changes, the DSSAT
> > script is now running out of memory. I have the heap size set to 4
> > gigabytes. There are 120K items in gridLists.
> > The main foreach loop of the script looks like this:
> >
> > foreach g,i in gridLists {
> >    file tar_output <single_file_mapper; file=@strcat("output/",
> >    gridLists[i], "output.tar.gz")>;
> >    file part_output <single_file_mapper; file=@strcat("parts/",
> >    gridLists[i], ".part")>;
> >
> >    file in1[] <filesys_mapper; location=@strcat(@arg("scenarios"),
> >    "/", gridLists[i]), pattern="*">; // Scenario files
> >    file in2[] <filesys_mapper; location=@strcat(@arg("weather"),
> >    "/", gridLists[i]), pattern="*">; // Weather files
> >    file in3[] <filesys_mapper; location=@arg("refdata"),
> >    pattern="*">; // Common data
> >    file in4[] <filesys_mapper; location=@arg("bindata"),
> >    pattern="*.EXE">; // Binaries
> >    file in5[] <filesys_mapper; location=@arg("bindata"),
> >    pattern="*.pl">; // Perl scripts
> >    file wrapper <single_file_mapper; file="RunDSSAT.sh">; //
> >    RunDSSAT wrapper
> >
> >    (tar_output, part_output) = RunDSSAT(xfile, in1, in2, in3, in4,
> >    in5, wrapper);
> > }
> >
> > Is there any way to throttle foreach again, or any other workarounds
> > I could use to avoid this?
> >
> > Thanks,
> > David
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From hategan at mcs.anl.gov  Tue Oct 16 14:17:06 2012
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 16 Oct 2012 12:17:06 -0700
Subject: [Swift-devel] foreach.max.threads question
In-Reply-To: <433307766.126520.1350413813202.JavaMail.root@zimbra-mb2.anl.gov>
References: <433307766.126520.1350413813202.JavaMail.root@zimbra-mb2.anl.gov>
Message-ID: <1350415026.32618.0.camel@blabla>

How many cores do you run this on?

On Tue, 2012-10-16 at 13:56 -0500, David Kelly wrote:
> Previously it was not explicitly set, so I am assuming it would have been 1024. As a test I tried setting it to 520 (the maximum number of available cores), but that did not seem to help.
> 
> ----- Original Message -----
> > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > To: "David Kelly" <davidk at ci.uchicago.edu>
> > Cc: "swift-devel Devel" <swift-devel at ci.uchicago.edu>
> > Sent: Tuesday, October 16, 2012 1:44:47 PM
> > Subject: Re: [Swift-devel] foreach.max.threads question
> > What was foreach.max.threads set to before?
> > 
> > On Tue, 2012-10-16 at 11:04 -0500, David Kelly wrote:
> > > Hello,
> > >
> > > I have noticed that since the foreach.max.threads changes, the DSSAT
> > > script is now running out of memory. I have the heap size set to 4
> > > gigabytes. There are 120K items in gridLists.
> > > The main foreach loop of the script looks like this:
> > >
> > > foreach g,i in gridLists {
> > >    file tar_output <single_file_mapper; file=@strcat("output/",
> > >    gridLists[i], "output.tar.gz")>;
> > >    file part_output <single_file_mapper; file=@strcat("parts/",
> > >    gridLists[i], ".part")>;
> > >
> > >    file in1[] <filesys_mapper; location=@strcat(@arg("scenarios"),
> > >    "/", gridLists[i]), pattern="*">; // Scenario files
> > >    file in2[] <filesys_mapper; location=@strcat(@arg("weather"),
> > >    "/", gridLists[i]), pattern="*">; // Weather files
> > >    file in3[] <filesys_mapper; location=@arg("refdata"),
> > >    pattern="*">; // Common data
> > >    file in4[] <filesys_mapper; location=@arg("bindata"),
> > >    pattern="*.EXE">; // Binaries
> > >    file in5[] <filesys_mapper; location=@arg("bindata"),
> > >    pattern="*.pl">; // Perl scripts
> > >    file wrapper <single_file_mapper; file="RunDSSAT.sh">; //
> > >    RunDSSAT wrapper
> > >
> > >    (tar_output, part_output) = RunDSSAT(xfile, in1, in2, in3, in4,
> > >    in5, wrapper);
> > > }
> > >
> > > Is there any way to throttle foreach again, or any other workarounds
> > > I could use to avoid this?
> > >
> > > Thanks,
> > > David
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel


From hategan at mcs.anl.gov  Tue Oct 16 14:19:23 2012
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 16 Oct 2012 12:19:23 -0700
Subject: [Swift-devel] foreach.max.threads question
In-Reply-To: <1014451256.48685.1350413816558.JavaMail.root@zimbra.anl.gov>
References: <1014451256.48685.1350413816558.JavaMail.root@zimbra.anl.gov>
Message-ID: <1350415163.32618.2.camel@blabla>

On Tue, 2012-10-16 at 13:56 -0500, Michael Wilde wrote:
> Mihael, I though that when we discussed this last FRiday, the plan was to add the app() throttle but also leave the foreach throttle in place, just in case we needed it.  Was there a reason that you needed to remove the foreach throttle to do the app throttle?

Too much complexity.

I can add it back if needed, but I'd rather fix the problem here.


From davidk at ci.uchicago.edu  Tue Oct 16 14:25:25 2012
From: davidk at ci.uchicago.edu (David Kelly)
Date: Tue, 16 Oct 2012 14:25:25 -0500 (CDT)
Subject: [Swift-devel] foreach.max.threads question
In-Reply-To: <1350415026.32618.0.camel@blabla>
Message-ID: <270799010.126716.1350415525047.JavaMail.root@zimbra-mb2.anl.gov>

Swift and coaster-service are running on communicado/bridled and the work is being done on UC3 via persistent coasters. UC3 has 520 cores, 1 core per node. In most cases I will get all 520 cores.

----- Original Message -----
> From: "Mihael Hategan" <hategan at mcs.anl.gov>
> To: "David Kelly" <davidk at ci.uchicago.edu>
> Cc: "swift-devel Devel" <swift-devel at ci.uchicago.edu>
> Sent: Tuesday, October 16, 2012 2:17:06 PM
> Subject: Re: [Swift-devel] foreach.max.threads question
> How many cores do you run this on?
> 
> On Tue, 2012-10-16 at 13:56 -0500, David Kelly wrote:
> > Previously it was not explicitly set, so I am assuming it would have
> > been 1024. As a test I tried setting it to 520 (the maximum number
> > of available cores), but that did not seem to help.
> >
> > ----- Original Message -----
> > > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > > To: "David Kelly" <davidk at ci.uchicago.edu>
> > > Cc: "swift-devel Devel" <swift-devel at ci.uchicago.edu>
> > > Sent: Tuesday, October 16, 2012 1:44:47 PM
> > > Subject: Re: [Swift-devel] foreach.max.threads question
> > > What was foreach.max.threads set to before?
> > >
> > > On Tue, 2012-10-16 at 11:04 -0500, David Kelly wrote:
> > > > Hello,
> > > >
> > > > I have noticed that since the foreach.max.threads changes, the
> > > > DSSAT
> > > > script is now running out of memory. I have the heap size set to
> > > > 4
> > > > gigabytes. There are 120K items in gridLists.
> > > > The main foreach loop of the script looks like this:
> > > >
> > > > foreach g,i in gridLists {
> > > >    file tar_output <single_file_mapper; file=@strcat("output/",
> > > >    gridLists[i], "output.tar.gz")>;
> > > >    file part_output <single_file_mapper; file=@strcat("parts/",
> > > >    gridLists[i], ".part")>;
> > > >
> > > >    file in1[] <filesys_mapper;
> > > >    location=@strcat(@arg("scenarios"),
> > > >    "/", gridLists[i]), pattern="*">; // Scenario files
> > > >    file in2[] <filesys_mapper; location=@strcat(@arg("weather"),
> > > >    "/", gridLists[i]), pattern="*">; // Weather files
> > > >    file in3[] <filesys_mapper; location=@arg("refdata"),
> > > >    pattern="*">; // Common data
> > > >    file in4[] <filesys_mapper; location=@arg("bindata"),
> > > >    pattern="*.EXE">; // Binaries
> > > >    file in5[] <filesys_mapper; location=@arg("bindata"),
> > > >    pattern="*.pl">; // Perl scripts
> > > >    file wrapper <single_file_mapper; file="RunDSSAT.sh">; //
> > > >    RunDSSAT wrapper
> > > >
> > > >    (tar_output, part_output) = RunDSSAT(xfile, in1, in2, in3,
> > > >    in4,
> > > >    in5, wrapper);
> > > > }
> > > >
> > > > Is there any way to throttle foreach again, or any other
> > > > workarounds
> > > > I could use to avoid this?
> > > >
> > > > Thanks,
> > > > David
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel


From hategan at mcs.anl.gov  Tue Oct 16 14:44:59 2012
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 16 Oct 2012 12:44:59 -0700
Subject: [Swift-devel] foreach.max.threads question
In-Reply-To: <270799010.126716.1350415525047.JavaMail.root@zimbra-mb2.anl.gov>
References: <270799010.126716.1350415525047.JavaMail.root@zimbra-mb2.anl.gov>
Message-ID: <1350416699.32618.3.camel@blabla>

The default is 16384. Can you set foreach.max.threads to 520 and try
again?

On Tue, 2012-10-16 at 14:25 -0500, David Kelly wrote:
> Swift and coaster-service are running on communicado/bridled and the work is being done on UC3 via persistent coasters. UC3 has 520 cores, 1 core per node. In most cases I will get all 520 cores.
> 
> ----- Original Message -----
> > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > To: "David Kelly" <davidk at ci.uchicago.edu>
> > Cc: "swift-devel Devel" <swift-devel at ci.uchicago.edu>
> > Sent: Tuesday, October 16, 2012 2:17:06 PM
> > Subject: Re: [Swift-devel] foreach.max.threads question
> > How many cores do you run this on?
> > 
> > On Tue, 2012-10-16 at 13:56 -0500, David Kelly wrote:
> > > Previously it was not explicitly set, so I am assuming it would have
> > > been 1024. As a test I tried setting it to 520 (the maximum number
> > > of available cores), but that did not seem to help.
> > >
> > > ----- Original Message -----
> > > > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > > > To: "David Kelly" <davidk at ci.uchicago.edu>
> > > > Cc: "swift-devel Devel" <swift-devel at ci.uchicago.edu>
> > > > Sent: Tuesday, October 16, 2012 1:44:47 PM
> > > > Subject: Re: [Swift-devel] foreach.max.threads question
> > > > What was foreach.max.threads set to before?
> > > >
> > > > On Tue, 2012-10-16 at 11:04 -0500, David Kelly wrote:
> > > > > Hello,
> > > > >
> > > > > I have noticed that since the foreach.max.threads changes, the
> > > > > DSSAT
> > > > > script is now running out of memory. I have the heap size set to
> > > > > 4
> > > > > gigabytes. There are 120K items in gridLists.
> > > > > The main foreach loop of the script looks like this:
> > > > >
> > > > > foreach g,i in gridLists {
> > > > >    file tar_output <single_file_mapper; file=@strcat("output/",
> > > > >    gridLists[i], "output.tar.gz")>;
> > > > >    file part_output <single_file_mapper; file=@strcat("parts/",
> > > > >    gridLists[i], ".part")>;
> > > > >
> > > > >    file in1[] <filesys_mapper;
> > > > >    location=@strcat(@arg("scenarios"),
> > > > >    "/", gridLists[i]), pattern="*">; // Scenario files
> > > > >    file in2[] <filesys_mapper; location=@strcat(@arg("weather"),
> > > > >    "/", gridLists[i]), pattern="*">; // Weather files
> > > > >    file in3[] <filesys_mapper; location=@arg("refdata"),
> > > > >    pattern="*">; // Common data
> > > > >    file in4[] <filesys_mapper; location=@arg("bindata"),
> > > > >    pattern="*.EXE">; // Binaries
> > > > >    file in5[] <filesys_mapper; location=@arg("bindata"),
> > > > >    pattern="*.pl">; // Perl scripts
> > > > >    file wrapper <single_file_mapper; file="RunDSSAT.sh">; //
> > > > >    RunDSSAT wrapper
> > > > >
> > > > >    (tar_output, part_output) = RunDSSAT(xfile, in1, in2, in3,
> > > > >    in4,
> > > > >    in5, wrapper);
> > > > > }
> > > > >
> > > > > Is there any way to throttle foreach again, or any other
> > > > > workarounds
> > > > > I could use to avoid this?
> > > > >
> > > > > Thanks,
> > > > > David
> > > > > _______________________________________________
> > > > > Swift-devel mailing list
> > > > > Swift-devel at ci.uchicago.edu
> > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel


From wilde at mcs.anl.gov  Tue Oct 16 14:45:24 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 16 Oct 2012 14:45:24 -0500 (CDT)
Subject: [Swift-devel] Fixing Cobalt provider to allow multi-node coaster
	jobs on Eureka
In-Reply-To: <1774919427.48822.1350416438502.JavaMail.root@zimbra.anl.gov>
Message-ID: <968451630.48843.1350416724538.JavaMail.root@zimbra.anl.gov>

Mihael,

Is this a bug that you could fix while waiting for info from David on the DSSAT problems?

  https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=746

I ask because ParVis people are starting a large set of campaigns on Eureka. At the moment, the cluster is idle, but they can only use 32 of 100 nodes because of the 1-node-per-job limitation of the Cobalt provider.

You (or we) can test on Gadzooks if you can add in the multi-node-job code.

Thanks,

- Mike


From davidk at ci.uchicago.edu  Tue Oct 16 16:19:40 2012
From: davidk at ci.uchicago.edu (David Kelly)
Date: Tue, 16 Oct 2012 16:19:40 -0500 (CDT)
Subject: [Swift-devel] foreach.max.threads question
In-Reply-To: <1350416699.32618.3.camel@blabla>
Message-ID: <1996203325.127536.1350422380287.JavaMail.root@zimbra-mb2.anl.gov>

I set the max threads to 520 and saw the same behavior. The only difference is that I had to do the work on OSG rather than UC3 since all nodes were in use. Is it possible that since there is no foreach throttle, all the files and file arrays needed for each of the 120K tasks are getting set at once? 

All of the log files, configuration files, and a heap dump are at /scratch/local/davidk/DSSAT/run067 on communicado.

Thanks,
David

----- Original Message -----
> From: "Mihael Hategan" <hategan at mcs.anl.gov>
> To: "David Kelly" <davidk at ci.uchicago.edu>
> Cc: "swift-devel Devel" <swift-devel at ci.uchicago.edu>
> Sent: Tuesday, October 16, 2012 2:44:59 PM
> Subject: Re: [Swift-devel] foreach.max.threads question
> The default is 16384. Can you set foreach.max.threads to 520 and try
> again?
> 
> On Tue, 2012-10-16 at 14:25 -0500, David Kelly wrote:
> > Swift and coaster-service are running on communicado/bridled and the
> > work is being done on UC3 via persistent coasters. UC3 has 520
> > cores, 1 core per node. In most cases I will get all 520 cores.
> >
> > ----- Original Message -----
> > > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > > To: "David Kelly" <davidk at ci.uchicago.edu>
> > > Cc: "swift-devel Devel" <swift-devel at ci.uchicago.edu>
> > > Sent: Tuesday, October 16, 2012 2:17:06 PM
> > > Subject: Re: [Swift-devel] foreach.max.threads question
> > > How many cores do you run this on?
> > >
> > > On Tue, 2012-10-16 at 13:56 -0500, David Kelly wrote:
> > > > Previously it was not explicitly set, so I am assuming it would
> > > > have
> > > > been 1024. As a test I tried setting it to 520 (the maximum
> > > > number
> > > > of available cores), but that did not seem to help.
> > > >
> > > > ----- Original Message -----
> > > > > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > > > > To: "David Kelly" <davidk at ci.uchicago.edu>
> > > > > Cc: "swift-devel Devel" <swift-devel at ci.uchicago.edu>
> > > > > Sent: Tuesday, October 16, 2012 1:44:47 PM
> > > > > Subject: Re: [Swift-devel] foreach.max.threads question
> > > > > What was foreach.max.threads set to before?
> > > > >
> > > > > On Tue, 2012-10-16 at 11:04 -0500, David Kelly wrote:
> > > > > > Hello,
> > > > > >
> > > > > > I have noticed that since the foreach.max.threads changes,
> > > > > > the
> > > > > > DSSAT
> > > > > > script is now running out of memory. I have the heap size
> > > > > > set to
> > > > > > 4
> > > > > > gigabytes. There are 120K items in gridLists.
> > > > > > The main foreach loop of the script looks like this:
> > > > > >
> > > > > > foreach g,i in gridLists {
> > > > > >    file tar_output <single_file_mapper;
> > > > > >    file=@strcat("output/",
> > > > > >    gridLists[i], "output.tar.gz")>;
> > > > > >    file part_output <single_file_mapper;
> > > > > >    file=@strcat("parts/",
> > > > > >    gridLists[i], ".part")>;
> > > > > >
> > > > > >    file in1[] <filesys_mapper;
> > > > > >    location=@strcat(@arg("scenarios"),
> > > > > >    "/", gridLists[i]), pattern="*">; // Scenario files
> > > > > >    file in2[] <filesys_mapper;
> > > > > >    location=@strcat(@arg("weather"),
> > > > > >    "/", gridLists[i]), pattern="*">; // Weather files
> > > > > >    file in3[] <filesys_mapper; location=@arg("refdata"),
> > > > > >    pattern="*">; // Common data
> > > > > >    file in4[] <filesys_mapper; location=@arg("bindata"),
> > > > > >    pattern="*.EXE">; // Binaries
> > > > > >    file in5[] <filesys_mapper; location=@arg("bindata"),
> > > > > >    pattern="*.pl">; // Perl scripts
> > > > > >    file wrapper <single_file_mapper; file="RunDSSAT.sh">; //
> > > > > >    RunDSSAT wrapper
> > > > > >
> > > > > >    (tar_output, part_output) = RunDSSAT(xfile, in1, in2,
> > > > > >    in3,
> > > > > >    in4,
> > > > > >    in5, wrapper);
> > > > > > }
> > > > > >
> > > > > > Is there any way to throttle foreach again, or any other
> > > > > > workarounds
> > > > > > I could use to avoid this?
> > > > > >
> > > > > > Thanks,
> > > > > > David
> > > > > > _______________________________________________
> > > > > > Swift-devel mailing list
> > > > > > Swift-devel at ci.uchicago.edu
> > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel


From hategan at mcs.anl.gov  Tue Oct 16 18:41:18 2012
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 16 Oct 2012 16:41:18 -0700
Subject: [Swift-devel] foreach.max.threads question
In-Reply-To: <1996203325.127536.1350422380287.JavaMail.root@zimbra-mb2.anl.gov>
References: <1996203325.127536.1350422380287.JavaMail.root@zimbra-mb2.anl.gov>
Message-ID: <1350430878.32618.6.camel@blabla>

On Tue, 2012-10-16 at 16:19 -0500, David Kelly wrote:
> I set the max threads to 520 and saw the same behavior. The only
> difference is that I had to do the work on OSG rather than UC3 since
> all nodes were in use. Is it possible that since there is no foreach
> throttle, all the files and file arrays needed for each of the 120K
> tasks are getting set at once? 

That's correct.
I guess that was a bit of wishful thinking on my side. I will revert the
change. In the mean-time, please use the revision before the change.

Mihael


From lpesce at uchicago.edu  Wed Oct 17 09:21:40 2012
From: lpesce at uchicago.edu (Lorenzo Pesce)
Date: Wed, 17 Oct 2012 09:21:40 -0500
Subject: [Swift-devel] [swift-support] Channel multiplexer error
In-Reply-To: <1350272934.15993.5.camel@blabla>
References: <73B3B09E-B386-4B6D-A9D6-58EEB00B747C@uchicago.edu>
	<931001436.34144.1349713641059.JavaMail.root@zimbra.anl.gov>
	<CAK8C4JVPBFp_cMKe--27rxbQ4BpFBrWDFV-zYf0fji8RyPudeA@mail.gmail.com>
	<1350000735.8355.0.camel@blabla>
	<CAK8C4JUvgu2y7aF1_=xh4SvtSSFHB3uADD4fhEbXp+U2=XPwqg@mail.gmail.com>
	<1350272934.15993.5.camel@blabla>
Message-ID: <B90168CD-4FAB-49EC-8D5F-1F5B46090AB9@uchicago.edu>

OK, I am trying to rerun the scripts i have using the official version of swift on beagle (that should be trunk)
and put the value of foreach... to 200

Silly problem:

RunID: 20121017-1420-55eqmvlb
Execution failed:
        Could not aquire exclusive lock on log file: /lustre/beagle/GCNet/RG/Oreo/o080522_BS1/demo_real-20121004-0239-gcqgkdrf.0.rlog

I had to kill swift with -9 because it did not want to die before.

What is the lock file I need to remove?


On Oct 14, 2012, at 10:48 PM, Mihael Hategan wrote:

> I spoke to Mike on the phone on Friday, and we agreed that
> foreach.max.threads is a bit difficult to use.
> 
> So I removed the throttling for foreach and added it to app invocations.
> It might help with memory consumption if you set it to around the number
> of cpus you have access to.
> 
> I also did some small optimizations to improve memory use. I can run
> about 50K jobs with your script on a 32 bit system with 1GB of heap
> space. I suspect that on 64 bit systems this might require more heap.
> 
> I also added an option to the swift executable to automatically dump a
> copy of the heap when an out of memory condition occurs. Hopefully that
> will help us troubleshoot such problems in the future.
> 
> This is all in trunk.
> 
> Mihael
> 
> On Sat, 2012-10-13 at 04:47 -0500, Kazutaka Takahashi wrote:
>> Hi All, 
>> 
>> Sorry for a late reply, but I am already at a conference and had not
>> had a chance to try what Mike proposed. I will try later during the
>> second half of the conference, if not after the conference ending next
>> wed. 
>> Taka
>> 
>> 
>> On Thu, Oct 11, 2012 at 7:12 PM, Mihael Hategan <hategan at mcs.anl.gov>
>> wrote:
>>        Did you try what Mike proposed?
>> 
>>        Mihael
>> 
>>        On Thu, 2012-10-11 at 17:49 -0500, Kazutaka Takahashi wrote:
>>> OK...
>>> The last one died with the same error msg... Please check
>>        the both
>>> directories...
>>> 
>>> 
>>        taka at login4:/lustre/beagle/GCNet/RG/Athena/a080521_BS_new/Cond2>
>>> 
>>        taka at login4:/lustre/beagle/GCNet/RG/Athena/a080521_BS_new/Cond3>
>>> 
>>> /lustre/beagle/GCNet/bin/Swift/swift: line 164:  8527 Killed
>>> java -Xmx5120M
>>        -Djava.endorsed.dirs=/soft/swift/0.93//lib/endorsed
>>> -DUID=3023 -DGLOBUS_HOSTNAME=login4.beagle.ci.uchicago.edu
>>> -DCOG_INSTALL_PATH=/soft/swift/0.93/
>>        -Dswift.home=/soft/swift/0.93/
>>> -Duser.home=/lustre/beagle/GCNet
>>> -Djava.security.egd=file:///dev/urandom -XX:+UseParallelGC
>>> -XX:ParallelGCThreads=1
>>> 
>>        -classpath /soft/swift/0.93//etc:/soft/swift/0.93//libexec:/soft/swift/0.93//lib/addressing-1.0.jar:/soft/swift/0.93//lib/ant.jar:/soft/swift/0.93//lib/antlr-2.7.5.jar:/soft/swift/0.93//lib/axis.jar:/soft/swift/0.93//lib/axis-url.jar:/soft/swift/0.93//lib/castor-0.9.6.jar:/soft/swift/0.93//lib/coaster-bootstrap.jar:/soft/swift/0.93//lib/cog-abstraction-common-2.4.jar:/soft/swift/0.93//lib/cog-axis.jar:/soft/swift/0.93//lib/cog-grapheditor-0.47.jar:/soft/swift/0.93//lib/cog-jglobus-1.7.0.jar:/soft/swift/0.93//lib/cog-karajan-0.36-dev.jar:/soft/swift/0.93//lib/cog-provider-clref-gt4_0_0.jar:/soft/swift/0.93//lib/cog-provider-coaster-0.3.jar:/soft/swift/0.93//lib/cog-provider-dcache-0.1.jar:/soft/swift/0.93//lib/cog-provider-gt2-2.4.jar:/soft/swift/0.93//lib/cog-provider-gt4_0_0-2.5.jar:/soft/swift/0.93//lib/cog-provider-local-2.2.jar:/soft/swift/0.93//lib/cog-provider-localscheduler-0.4.jar:/soft/swift/0.93//lib/cog-provider-ssh-2.4.jar:/soft/swift/0.93//lib/cog-provider-webdav-2.1.jar:/soft/swift/0.93//lib/cog-resources-1.0.jar:/soft/swift/0.93//lib/cog-swift-svn.jar:/soft/swift/0.93//lib/cog-trap-1.0.jar:/soft/swift/0.93//lib/cog-url.jar:/soft/swift/0.93//lib/cog-util-0.92.jar:/soft/swift/0.93//lib/commonj.jar:/soft/swift/0.93//lib/commons-beanutils.jar:/soft/swift/0.93//lib/commons-collections-3.0.jar:/soft/swift/0.93//lib/commons-digester.jar:/soft/swift/0.93//lib/commons-discovery.jar:/soft/swift/0.93//lib/commons-httpclient.jar:/soft/swift/0.93//lib/commons-logging-1.1.jar:/soft/swift/0.93//lib/concurrent.jar:/soft/swift/0.93//lib/cryptix32.jar:/soft/swift/0.93//lib/cryptix-asn1.jar:/soft/swift/0.93//lib/cryptix.jar:/soft/swift/0.93//lib/globus_delegation_service.jar:/soft/swift/0.93//lib/globus_delegation_stubs.jar:/soft/swift/0.93//lib/globus_wsrf_mds_aggregator_stubs.jar:/soft/swift/0.93//lib/globus_wsrf_rendezvous_service.jar:/soft/swift/0.93//lib/globus_wsrf_rendezvous_stubs.jar:/soft/swift/0.93//lib/globus_wsrf_rft_stubs.jar:/soft/swift/0.93//lib/gram-client.jar:/soft/swift/0.93//lib/gram-stubs.jar:/soft/swift/0.93//lib/gram-utils.jar:/soft/swift/0.93//lib/j2ssh-common-0.2.2.jar:/soft/swift/0.93//lib/j2ssh-core-0.2.2-patch-b.jar:/soft/swift/0.93//lib/jakarta-regexp-1.2.jar:/soft/swift/0.93//lib/jakarta-slide-webdavlib-2.0.jar:/soft/swift/0.93//lib/jaxrpc.jar:/soft/swift/0.93//lib/jce-jdk13-131.jar:/soft/swift/0.93//lib/jgss.jar:/soft/swift/0.93//lib/jline-0.9.94.jar:/soft/swift/0.93//lib/jsr173_1.0_api.jar:/soft/swift/0.93//lib/jug-lgpl-2.0.0.jar:/soft/swift/0.93//lib/junit.jar:/soft/swift/0.93//lib/log4j-1.2.16.jar:/soft/swift/0.93//lib/naming-common.jar:/soft/swift/0.93//lib/naming-factory.jar:/soft/swift/0.93//lib/naming-java.jar:/soft/swift/0.93//lib/naming-resources.jar:/soft/swift/0.93//lib/opensaml.jar:/soft/swift/0.93//lib/puretls.jar:/soft/swift/0.93//lib/resolver.jar:/soft/swift/0.93//lib/saaj.jar:/soft/swift/0.93//lib/stringtemplate.jar:/soft/swift/0.93//lib/vdldefinitions.jar:/soft/swift/0.93//lib/wsdl4j.jar:/soft/swift/0.93//lib/wsrf_core.jar:/soft/swift/0.93//lib/wsrf_core_stubs.jar:/soft/swift/0.93//lib/wsrf_mds_index_stubs.jar:/soft/swift/0.93//lib/wsrf_mds_usefulrp_schema_stubs.jar:/soft/swift/0.93//lib/wsrf_provider_jce.jar:/soft/swift/0.93//lib/wsrf_tools.jar:/soft/swift/0.93//lib/wss4j.jar:/soft/swift/0.93//lib/xalan.jar:/soft/swift/0.93//lib/xbean.jar:/soft/swift/0.93//lib/xbean_xpath.jar:/soft/swift/0.93//lib/xercesImpl.jar:/soft/swift/0.93//lib/xml-apis.jar:/soft/swift/0.93//lib/xmlsec.jar:/soft/swift/0.93//lib/xpp3-1.1.3.4d_b4_min.jar:/soft/swift/0.93//lib/xstream-1.1.1-patched.jar: org.griphyn.vdl.karajan.Loader '-config' 'demo_realcf.cf' '-sites.file' 'demo_realSites.xml' '-tc.file' 'demo_realtc.tc' 'demo_real.swift'
>>> 
>>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> -- 
>> What is essential is invisible to the eye
>> 
> 
> 


From tim.g.armstrong at gmail.com  Thu Oct 18 10:20:06 2012
From: tim.g.armstrong at gmail.com (Tim Armstrong)
Date: Thu, 18 Oct 2012 10:20:06 -0500
Subject: [Swift-devel] Minor question about semantics
Message-ID: <CAC0jiV711kwSL_Ui5A+3VuFaZQXPgJEVfY9=h-8Yocpw8FM+ug@mail.gmail.com>

I'm just looking at app function semantics and had a few questions about
how some things worked that weren't clear from the user manual.  There were
a few things I didn't quite get and wanted to make sure I wasn't missing
something.

Is there any difference between @x and @filename(x) in an app command
line?  E.g.

app (binaryfile bf) myproc (int i, string s="foo") {
    myapp i s @filename(bf);
}

versus

app (binaryfile bf) myproc (int i, string s="foo") {
    myapp i s @bf;
}

Also, what is the intended behaviour if you omit the @ in front of the
variable name? Is this valid? E.g.

app (binaryfile bf) myproc (int i, string s="foo") {
    myapp i s bf;
}


Cheers,
Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20121018/ca1907eb/attachment.html>

From wilde at mcs.anl.gov  Thu Oct 18 13:58:10 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 18 Oct 2012 13:58:10 -0500 (CDT)
Subject: [Swift-devel] Minor question about semantics
In-Reply-To: <CAC0jiV711kwSL_Ui5A+3VuFaZQXPgJEVfY9=h-8Yocpw8FM+ug@mail.gmail.com>
Message-ID: <728906983.52895.1350586690342.JavaMail.root@zimbra.anl.gov>

Tim, here's what I think the answers are, but others should verify what I say here.

> Is there any difference between @x and @filename(x) in an app command
> line? E.g.
> 
> app (binaryfile bf) myproc (int i, string s="foo") {
> myapp i s @filename(bf);
> }
> 
> versus
> 
> app (binaryfile bf) myproc ( int i, string s= "foo" ) {
> myapp i s @ bf;
> }

These should behave identically, except that I don't know if lexically one can leave a space between @ and the variable (i.e. I think it needs to be @bf vs @ bf). I assume you weren't asking about the space.

The place where I believe that @bf and @filename(bf) behave differently is that the shorthand @bf is not accepted syntactically everywhere. I thought we have an open ticket on this but I cant locate it.

We need to test, but I think @f doesn't work in ordinary expressions, where you would expect it to be equivalent to @filename(f). I.e. it only works on an app() command line template.

I think that @f should always be identical to @filename(f) in all cases. If we deprecate the use of @ as a prefix for intrinsic functions (as I agree we should) then we perhaps want to retain the use of @f as a syntactic shorthand for filename(f).

In swift/t, for a transition period, can we just ignore the @ in all expressions of the form @fname(args)? Perhaps with a deprecation warning?

> Also, what is the intended behaviour if you omit the @ in front of the
> variable name? Is this valid? E.g.
> 
> app (binaryfile bf) myproc ( int i, string s= "foo" ) {
> myapp i s bf;
> }

As I recall this inserts a string representation of the file object on the command line, as if one is doing a tracef() on the file object instead of its mapped filename string.

That seems a somewhat useful behavior for debugging but is seldom what you want to running an app() command.

- Mike

> 
> Cheers,
> Tim
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From hategan at mcs.anl.gov  Thu Oct 18 14:07:34 2012
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 18 Oct 2012 12:07:34 -0700
Subject: [Swift-devel] Minor question about semantics
In-Reply-To: <728906983.52895.1350586690342.JavaMail.root@zimbra.anl.gov>
References: <728906983.52895.1350586690342.JavaMail.root@zimbra.anl.gov>
Message-ID: <1350587254.2303.3.camel@blabla>

I'll add that at least at one point we discussed the fact that the
symbol "@" as a mechanism to distinguish between an app/compound and a
built-in function is not really necessary (and a bit distasteful), as
well as limiting (one may want to be able to have user-defined functions
that can be invoked as part of the app command line).

Mihael

On Thu, 2012-10-18 at 13:58 -0500, Michael Wilde wrote:
> Tim, here's what I think the answers are, but others should verify what I say here.
> 
> > Is there any difference between @x and @filename(x) in an app command
> > line? E.g.
> > 
> > app (binaryfile bf) myproc (int i, string s="foo") {
> > myapp i s @filename(bf);
> > }
> > 
> > versus
> > 
> > app (binaryfile bf) myproc ( int i, string s= "foo" ) {
> > myapp i s @ bf;
> > }
> 
> These should behave identically, except that I don't know if lexically one can leave a space between @ and the variable (i.e. I think it needs to be @bf vs @ bf). I assume you weren't asking about the space.
> 
> The place where I believe that @bf and @filename(bf) behave differently is that the shorthand @bf is not accepted syntactically everywhere. I thought we have an open ticket on this but I cant locate it.
> 
> We need to test, but I think @f doesn't work in ordinary expressions, where you would expect it to be equivalent to @filename(f). I.e. it only works on an app() command line template.
> 
> I think that @f should always be identical to @filename(f) in all cases. If we deprecate the use of @ as a prefix for intrinsic functions (as I agree we should) then we perhaps want to retain the use of @f as a syntactic shorthand for filename(f).
> 
> In swift/t, for a transition period, can we just ignore the @ in all expressions of the form @fname(args)? Perhaps with a deprecation warning?
> 
> > Also, what is the intended behaviour if you omit the @ in front of the
> > variable name? Is this valid? E.g.
> > 
> > app (binaryfile bf) myproc ( int i, string s= "foo" ) {
> > myapp i s bf;
> > }
> 
> As I recall this inserts a string representation of the file object on the command line, as if one is doing a tracef() on the file object instead of its mapped filename string.
> 
> That seems a somewhat useful behavior for debugging but is seldom what you want to running an app() command.
> 
> - Mike
> 
> > 
> > Cheers,
> > Tim
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 


From wilde at mcs.anl.gov  Fri Oct 19 15:10:01 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 19 Oct 2012 15:10:01 -0500 (CDT)
Subject: [Swift-devel] coaster error on Midway
Message-ID: <827474706.55305.1350677401813.JavaMail.root@zimbra.anl.gov>

Im getting the error below on Midway (running the swift from module load swift)

Is this a know issue, with a known fix?

I'll try latest trunk next.

This happened both without and with provider staging.

I will post bug with log files if it recurs with latest trunk.

- Mike

Swift trunk swift-r5939 cog-r3472

RunID: 20121019-2005-yptzr8l6
Progress:  time: Fri, 19 Oct 2012 20:05:54 +0000
Progress:  time: Fri, 19 Oct 2012 20:05:56 +0000  Stage in:1  Submitted:99
Progress:  time: Fri, 19 Oct 2012 20:05:57 +0000  Stage in:33  Submitted:67
Progress:  time: Fri, 19 Oct 2012 20:05:58 +0000  Stage in:56  Submitted:36  Active:8
Progress:  time: Fri, 19 Oct 2012 20:05:59 +0000  Stage in:65  Submitted:3  Active:32
Progress:  time: Fri, 19 Oct 2012 20:06:00 +0000  Stage in:51  Active:49
Progress:  time: Fri, 19 Oct 2012 20:06:08 +0000  Active:99  Stage out:1
Progress:  time: Fri, 19 Oct 2012 20:06:24 +0000  Active:96  Finished successfully:4
Progress:  time: Fri, 19 Oct 2012 20:06:28 +0000  Active:95  Stage out:1  Finished successfully:4
Progress:  time: Fri, 19 Oct 2012 20:06:29 +0000  Active:74  Stage out:12  Finished successfully:14
Progress:  time: Fri, 19 Oct 2012 20:06:30 +0000  Active:52  Stage out:3  Finished successfully:45
Progress:  time: Fri, 19 Oct 2012 20:06:31 +0000  Active:25  Stage out:24  Finished successfully:51
Exception caught while processing reply
java.lang.IllegalArgumentException: Wrong data size: 4. Data was @].z
	at org.globus.cog.karajan.workflow.service.RequestReply.unpackLong(RequestReply.java:237)
	at org.globus.cog.karajan.workflow.service.RequestReply.getInDataAsLong(RequestReply.java:232)
	at org.globus.cog.karajan.workflow.service.commands.HeartBeatCommand.replyReceived(HeartBeatCommand.java:40)
	at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleReply(AbstractKarajanChannel.java:401)
	at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.stepNIO(AbstractStreamKarajanChannel.java:234)
	at org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.loop(NIOMultiplexer.java:97)
	at org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.run(NIOMultiplexer.java:56)
Exception caught while processing reply
java.lang.IllegalArgumentException: Wrong data size: 4. Data was @].z
	at org.globus.cog.karajan.workflow.service.RequestReply.unpackLong(RequestReply.java:237)
	at org.globus.cog.karajan.workflow.service.RequestReply.getInDataAsLong(RequestReply.java:232)
	at org.globus.cog.karajan.workflow.service.commands.HeartBeatCommand.replyReceived(HeartBeatCommand.java:40)
	at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleReply(AbstractKarajanChannel.java:401)
	at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.stepNIO(AbstractStreamKarajanChannel.java:234)
	at org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.loop(NIOMultiplexer.java:97)
	at org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.run(NIOMultiplexer.java:56)
Exception caught while processing reply
java.lang.IllegalArgumentException: Wrong data size: 4. Data was @].z
	at org.globus.cog.karajan.workflow.service.RequestReply.unpackLong(RequestReply.java:237)
	at org.globus.cog.karajan.workflow.service.RequestReply.getInDataAsLong(RequestReply.java:232)
	at org.globus.cog.karajan.workflow.service.commands.HeartBeatCommand.replyReceived(HeartBeatCommand.java:40)
	at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleReply(AbstractKarajanChannel.java:401)
	at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.stepNIO(AbstractStreamKarajanChannel.java:234)
	at org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.loop(NIOMultiplexer.java:97)
	at org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.run(NIOMultiplexer.java:56)
Progress:  time: Fri, 19 Oct 2012 20:06:33 +0000  Stage out:10  Finished successfully:90
Final status: Fri, 19 Oct 2012 20:06:33 +0000  Finished successfully:100
mid$ 


-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From davidk at ci.uchicago.edu  Fri Oct 19 17:59:56 2012
From: davidk at ci.uchicago.edu (David Kelly)
Date: Fri, 19 Oct 2012 17:59:56 -0500 (CDT)
Subject: [Swift-devel] coaster error on Midway
In-Reply-To: <827474706.55305.1350677401813.JavaMail.root@zimbra.anl.gov>
Message-ID: <241664185.145376.1350687596439.JavaMail.root@zimbra-mb2.anl.gov>

I remember something like this happening in one of the versions of trunk - I believe it has been fixed in the latest version.

----- Original Message -----
> From: "Michael Wilde" <wilde at mcs.anl.gov>
> To: "David Kelly" <davidk at ci.uchicago.edu>, "Mihael Hategan" <hategan at mcs.anl.gov>
> Cc: "swift-devel" <swift-devel at ci.uchicago.edu>
> Sent: Friday, October 19, 2012 3:10:01 PM
> Subject: coaster error on Midway
> Im getting the error below on Midway (running the swift from module
> load swift)
> 
> Is this a know issue, with a known fix?
> 
> I'll try latest trunk next.
> 
> This happened both without and with provider staging.
> 
> I will post bug with log files if it recurs with latest trunk.
> 
> - Mike
> 
> Swift trunk swift-r5939 cog-r3472
> 
> RunID: 20121019-2005-yptzr8l6
> Progress: time: Fri, 19 Oct 2012 20:05:54 +0000
> Progress: time: Fri, 19 Oct 2012 20:05:56 +0000 Stage in:1
> Submitted:99
> Progress: time: Fri, 19 Oct 2012 20:05:57 +0000 Stage in:33
> Submitted:67
> Progress: time: Fri, 19 Oct 2012 20:05:58 +0000 Stage in:56
> Submitted:36 Active:8
> Progress: time: Fri, 19 Oct 2012 20:05:59 +0000 Stage in:65
> Submitted:3 Active:32
> Progress: time: Fri, 19 Oct 2012 20:06:00 +0000 Stage in:51 Active:49
> Progress: time: Fri, 19 Oct 2012 20:06:08 +0000 Active:99 Stage out:1
> Progress: time: Fri, 19 Oct 2012 20:06:24 +0000 Active:96 Finished
> successfully:4
> Progress: time: Fri, 19 Oct 2012 20:06:28 +0000 Active:95 Stage out:1
> Finished successfully:4
> Progress: time: Fri, 19 Oct 2012 20:06:29 +0000 Active:74 Stage out:12
> Finished successfully:14
> Progress: time: Fri, 19 Oct 2012 20:06:30 +0000 Active:52 Stage out:3
> Finished successfully:45
> Progress: time: Fri, 19 Oct 2012 20:06:31 +0000 Active:25 Stage out:24
> Finished successfully:51
> Exception caught while processing reply
> java.lang.IllegalArgumentException: Wrong data size: 4. Data was @].z
> at
> org.globus.cog.karajan.workflow.service.RequestReply.unpackLong(RequestReply.java:237)
> at
> org.globus.cog.karajan.workflow.service.RequestReply.getInDataAsLong(RequestReply.java:232)
> at
> org.globus.cog.karajan.workflow.service.commands.HeartBeatCommand.replyReceived(HeartBeatCommand.java:40)
> at
> org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleReply(AbstractKarajanChannel.java:401)
> at
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.stepNIO(AbstractStreamKarajanChannel.java:234)
> at
> org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.loop(NIOMultiplexer.java:97)
> at
> org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.run(NIOMultiplexer.java:56)
> Exception caught while processing reply
> java.lang.IllegalArgumentException: Wrong data size: 4. Data was @].z
> at
> org.globus.cog.karajan.workflow.service.RequestReply.unpackLong(RequestReply.java:237)
> at
> org.globus.cog.karajan.workflow.service.RequestReply.getInDataAsLong(RequestReply.java:232)
> at
> org.globus.cog.karajan.workflow.service.commands.HeartBeatCommand.replyReceived(HeartBeatCommand.java:40)
> at
> org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleReply(AbstractKarajanChannel.java:401)
> at
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.stepNIO(AbstractStreamKarajanChannel.java:234)
> at
> org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.loop(NIOMultiplexer.java:97)
> at
> org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.run(NIOMultiplexer.java:56)
> Exception caught while processing reply
> java.lang.IllegalArgumentException: Wrong data size: 4. Data was @].z
> at
> org.globus.cog.karajan.workflow.service.RequestReply.unpackLong(RequestReply.java:237)
> at
> org.globus.cog.karajan.workflow.service.RequestReply.getInDataAsLong(RequestReply.java:232)
> at
> org.globus.cog.karajan.workflow.service.commands.HeartBeatCommand.replyReceived(HeartBeatCommand.java:40)
> at
> org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleReply(AbstractKarajanChannel.java:401)
> at
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.stepNIO(AbstractStreamKarajanChannel.java:234)
> at
> org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.loop(NIOMultiplexer.java:97)
> at
> org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.run(NIOMultiplexer.java:56)
> Progress: time: Fri, 19 Oct 2012 20:06:33 +0000 Stage out:10 Finished
> successfully:90
> Final status: Fri, 19 Oct 2012 20:06:33 +0000 Finished
> successfully:100
> mid$
> 
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory


From wilde at mcs.anl.gov  Fri Oct 19 18:04:49 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 19 Oct 2012 18:04:49 -0500 (CDT)
Subject: [Swift-devel] coaster error on Midway
In-Reply-To: <241664185.145376.1350687596439.JavaMail.root@zimbra-mb2.anl.gov>
Message-ID: <1077353399.55624.1350687889533.JavaMail.root@zimbra.anl.gov>

Indeed, when I used latest trunk, the error has not recurred, and Ive sone several 1000-job runs of the EpiSnp app.

But that means that the trunk version in Midway's default swift module has this bug.

Lets talk on Monday about how to make and test an 0.94 release, which I would define simply as "a trustable trunk snapshot", and then to get that out to all systems and users.

- Mike


----- Original Message -----
> From: "David Kelly" <davidk at ci.uchicago.edu>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "swift-devel" <swift-devel at ci.uchicago.edu>, "Mihael Hategan" <hategan at mcs.anl.gov>
> Sent: Friday, October 19, 2012 5:59:56 PM
> Subject: Re: coaster error on Midway
> I remember something like this happening in one of the versions of
> trunk - I believe it has been fixed in the latest version.
> 
> ----- Original Message -----
> > From: "Michael Wilde" <wilde at mcs.anl.gov>
> > To: "David Kelly" <davidk at ci.uchicago.edu>, "Mihael Hategan"
> > <hategan at mcs.anl.gov>
> > Cc: "swift-devel" <swift-devel at ci.uchicago.edu>
> > Sent: Friday, October 19, 2012 3:10:01 PM
> > Subject: coaster error on Midway
> > Im getting the error below on Midway (running the swift from module
> > load swift)
> >
> > Is this a know issue, with a known fix?
> >
> > I'll try latest trunk next.
> >
> > This happened both without and with provider staging.
> >
> > I will post bug with log files if it recurs with latest trunk.
> >
> > - Mike
> >
> > Swift trunk swift-r5939 cog-r3472
> >
> > RunID: 20121019-2005-yptzr8l6
> > Progress: time: Fri, 19 Oct 2012 20:05:54 +0000
> > Progress: time: Fri, 19 Oct 2012 20:05:56 +0000 Stage in:1
> > Submitted:99
> > Progress: time: Fri, 19 Oct 2012 20:05:57 +0000 Stage in:33
> > Submitted:67
> > Progress: time: Fri, 19 Oct 2012 20:05:58 +0000 Stage in:56
> > Submitted:36 Active:8
> > Progress: time: Fri, 19 Oct 2012 20:05:59 +0000 Stage in:65
> > Submitted:3 Active:32
> > Progress: time: Fri, 19 Oct 2012 20:06:00 +0000 Stage in:51
> > Active:49
> > Progress: time: Fri, 19 Oct 2012 20:06:08 +0000 Active:99 Stage
> > out:1
> > Progress: time: Fri, 19 Oct 2012 20:06:24 +0000 Active:96 Finished
> > successfully:4
> > Progress: time: Fri, 19 Oct 2012 20:06:28 +0000 Active:95 Stage
> > out:1
> > Finished successfully:4
> > Progress: time: Fri, 19 Oct 2012 20:06:29 +0000 Active:74 Stage
> > out:12
> > Finished successfully:14
> > Progress: time: Fri, 19 Oct 2012 20:06:30 +0000 Active:52 Stage
> > out:3
> > Finished successfully:45
> > Progress: time: Fri, 19 Oct 2012 20:06:31 +0000 Active:25 Stage
> > out:24
> > Finished successfully:51
> > Exception caught while processing reply
> > java.lang.IllegalArgumentException: Wrong data size: 4. Data was
> > @].z
> > at
> > org.globus.cog.karajan.workflow.service.RequestReply.unpackLong(RequestReply.java:237)
> > at
> > org.globus.cog.karajan.workflow.service.RequestReply.getInDataAsLong(RequestReply.java:232)
> > at
> > org.globus.cog.karajan.workflow.service.commands.HeartBeatCommand.replyReceived(HeartBeatCommand.java:40)
> > at
> > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleReply(AbstractKarajanChannel.java:401)
> > at
> > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.stepNIO(AbstractStreamKarajanChannel.java:234)
> > at
> > org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.loop(NIOMultiplexer.java:97)
> > at
> > org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.run(NIOMultiplexer.java:56)
> > Exception caught while processing reply
> > java.lang.IllegalArgumentException: Wrong data size: 4. Data was
> > @].z
> > at
> > org.globus.cog.karajan.workflow.service.RequestReply.unpackLong(RequestReply.java:237)
> > at
> > org.globus.cog.karajan.workflow.service.RequestReply.getInDataAsLong(RequestReply.java:232)
> > at
> > org.globus.cog.karajan.workflow.service.commands.HeartBeatCommand.replyReceived(HeartBeatCommand.java:40)
> > at
> > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleReply(AbstractKarajanChannel.java:401)
> > at
> > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.stepNIO(AbstractStreamKarajanChannel.java:234)
> > at
> > org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.loop(NIOMultiplexer.java:97)
> > at
> > org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.run(NIOMultiplexer.java:56)
> > Exception caught while processing reply
> > java.lang.IllegalArgumentException: Wrong data size: 4. Data was
> > @].z
> > at
> > org.globus.cog.karajan.workflow.service.RequestReply.unpackLong(RequestReply.java:237)
> > at
> > org.globus.cog.karajan.workflow.service.RequestReply.getInDataAsLong(RequestReply.java:232)
> > at
> > org.globus.cog.karajan.workflow.service.commands.HeartBeatCommand.replyReceived(HeartBeatCommand.java:40)
> > at
> > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleReply(AbstractKarajanChannel.java:401)
> > at
> > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.stepNIO(AbstractStreamKarajanChannel.java:234)
> > at
> > org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.loop(NIOMultiplexer.java:97)
> > at
> > org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.run(NIOMultiplexer.java:56)
> > Progress: time: Fri, 19 Oct 2012 20:06:33 +0000 Stage out:10
> > Finished
> > successfully:90
> > Final status: Fri, 19 Oct 2012 20:06:33 +0000 Finished
> > successfully:100
> > mid$
> >
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From lpesce at uchicago.edu  Mon Oct 22 10:35:06 2012
From: lpesce at uchicago.edu (Lorenzo Pesce)
Date: Mon, 22 Oct 2012 10:35:06 -0500
Subject: [Swift-devel] coaster error on Midway
In-Reply-To: <1077353399.55624.1350687889533.JavaMail.root@zimbra.anl.gov>
References: <1077353399.55624.1350687889533.JavaMail.root@zimbra.anl.gov>
Message-ID: <38F69461-BA4B-4D2C-97C4-4C26932286D2@uchicago.edu>

Is there a current trunk module installed on Beagle?

It would be handy, since usually our users end up needing to use trunk....


On Oct 19, 2012, at 6:04 PM, Michael Wilde wrote:

> Indeed, when I used latest trunk, the error has not recurred, and Ive sone several 1000-job runs of the EpiSnp app.
> 
> But that means that the trunk version in Midway's default swift module has this bug.
> 
> Lets talk on Monday about how to make and test an 0.94 release, which I would define simply as "a trustable trunk snapshot", and then to get that out to all systems and users.
> 
> - Mike
> 
> 
> 
> 
> ----- Original Message -----
>> From: "David Kelly" <davidk at ci.uchicago.edu>
>> To: "Michael Wilde" <wilde at mcs.anl.gov>
>> Cc: "swift-devel" <swift-devel at ci.uchicago.edu>, "Mihael Hategan" <hategan at mcs.anl.gov>
>> Sent: Friday, October 19, 2012 5:59:56 PM
>> Subject: Re: coaster error on Midway
>> I remember something like this happening in one of the versions of
>> trunk - I believe it has been fixed in the latest version.
>> 
>> ----- Original Message -----
>>> From: "Michael Wilde" <wilde at mcs.anl.gov>
>>> To: "David Kelly" <davidk at ci.uchicago.edu>, "Mihael Hategan"
>>> <hategan at mcs.anl.gov>
>>> Cc: "swift-devel" <swift-devel at ci.uchicago.edu>
>>> Sent: Friday, October 19, 2012 3:10:01 PM
>>> Subject: coaster error on Midway
>>> Im getting the error below on Midway (running the swift from module
>>> load swift)
>>> 
>>> Is this a know issue, with a known fix?
>>> 
>>> I'll try latest trunk next.
>>> 
>>> This happened both without and with provider staging.
>>> 
>>> I will post bug with log files if it recurs with latest trunk.
>>> 
>>> - Mike
>>> 
>>> Swift trunk swift-r5939 cog-r3472
>>> 
>>> RunID: 20121019-2005-yptzr8l6
>>> Progress: time: Fri, 19 Oct 2012 20:05:54 +0000
>>> Progress: time: Fri, 19 Oct 2012 20:05:56 +0000 Stage in:1
>>> Submitted:99
>>> Progress: time: Fri, 19 Oct 2012 20:05:57 +0000 Stage in:33
>>> Submitted:67
>>> Progress: time: Fri, 19 Oct 2012 20:05:58 +0000 Stage in:56
>>> Submitted:36 Active:8
>>> Progress: time: Fri, 19 Oct 2012 20:05:59 +0000 Stage in:65
>>> Submitted:3 Active:32
>>> Progress: time: Fri, 19 Oct 2012 20:06:00 +0000 Stage in:51
>>> Active:49
>>> Progress: time: Fri, 19 Oct 2012 20:06:08 +0000 Active:99 Stage
>>> out:1
>>> Progress: time: Fri, 19 Oct 2012 20:06:24 +0000 Active:96 Finished
>>> successfully:4
>>> Progress: time: Fri, 19 Oct 2012 20:06:28 +0000 Active:95 Stage
>>> out:1
>>> Finished successfully:4
>>> Progress: time: Fri, 19 Oct 2012 20:06:29 +0000 Active:74 Stage
>>> out:12
>>> Finished successfully:14
>>> Progress: time: Fri, 19 Oct 2012 20:06:30 +0000 Active:52 Stage
>>> out:3
>>> Finished successfully:45
>>> Progress: time: Fri, 19 Oct 2012 20:06:31 +0000 Active:25 Stage
>>> out:24
>>> Finished successfully:51
>>> Exception caught while processing reply
>>> java.lang.IllegalArgumentException: Wrong data size: 4. Data was
>>> @].z
>>> at
>>> org.globus.cog.karajan.workflow.service.RequestReply.unpackLong(RequestReply.java:237)
>>> at
>>> org.globus.cog.karajan.workflow.service.RequestReply.getInDataAsLong(RequestReply.java:232)
>>> at
>>> org.globus.cog.karajan.workflow.service.commands.HeartBeatCommand.replyReceived(HeartBeatCommand.java:40)
>>> at
>>> org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleReply(AbstractKarajanChannel.java:401)
>>> at
>>> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.stepNIO(AbstractStreamKarajanChannel.java:234)
>>> at
>>> org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.loop(NIOMultiplexer.java:97)
>>> at
>>> org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.run(NIOMultiplexer.java:56)
>>> Exception caught while processing reply
>>> java.lang.IllegalArgumentException: Wrong data size: 4. Data was
>>> @].z
>>> at
>>> org.globus.cog.karajan.workflow.service.RequestReply.unpackLong(RequestReply.java:237)
>>> at
>>> org.globus.cog.karajan.workflow.service.RequestReply.getInDataAsLong(RequestReply.java:232)
>>> at
>>> org.globus.cog.karajan.workflow.service.commands.HeartBeatCommand.replyReceived(HeartBeatCommand.java:40)
>>> at
>>> org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleReply(AbstractKarajanChannel.java:401)
>>> at
>>> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.stepNIO(AbstractStreamKarajanChannel.java:234)
>>> at
>>> org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.loop(NIOMultiplexer.java:97)
>>> at
>>> org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.run(NIOMultiplexer.java:56)
>>> Exception caught while processing reply
>>> java.lang.IllegalArgumentException: Wrong data size: 4. Data was
>>> @].z
>>> at
>>> org.globus.cog.karajan.workflow.service.RequestReply.unpackLong(RequestReply.java:237)
>>> at
>>> org.globus.cog.karajan.workflow.service.RequestReply.getInDataAsLong(RequestReply.java:232)
>>> at
>>> org.globus.cog.karajan.workflow.service.commands.HeartBeatCommand.replyReceived(HeartBeatCommand.java:40)
>>> at
>>> org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleReply(AbstractKarajanChannel.java:401)
>>> at
>>> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.stepNIO(AbstractStreamKarajanChannel.java:234)
>>> at
>>> org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.loop(NIOMultiplexer.java:97)
>>> at
>>> org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.run(NIOMultiplexer.java:56)
>>> Progress: time: Fri, 19 Oct 2012 20:06:33 +0000 Stage out:10
>>> Finished
>>> successfully:90
>>> Final status: Fri, 19 Oct 2012 20:06:33 +0000 Finished
>>> successfully:100
>>> mid$
>>> 
>>> 
>>> --
>>> Michael Wilde
>>> Computation Institute, University of Chicago
>>> Mathematics and Computer Science Division
>>> Argonne National Laboratory
> 
> -- 
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel


From davidk at ci.uchicago.edu  Tue Oct 23 00:28:39 2012
From: davidk at ci.uchicago.edu (David Kelly)
Date: Tue, 23 Oct 2012 00:28:39 -0500 (CDT)
Subject: [Swift-devel] Swift 0.94 release planning
In-Reply-To: <1192836350.154803.1350968155597.JavaMail.root@zimbra-mb2.anl.gov>
Message-ID: <40830745.154854.1350970119511.JavaMail.root@zimbra-mb2.anl.gov>

Hello all,

I just wanted to let you all know that I've created branches to prepare for an eventual 0.94 release.

The SVN paths are:
https://cogkit.svn.sourceforge.net/svnroot/cogkit/branches/4.1.10/src/cog
https://svn.ci.uchicago.edu/svn/vdl2/branches/release-0.94

I will be working to add more tests to the suite, and to make sure that any known issues are documented in bugzilla.

Regards,
David


From ketancmaheshwari at gmail.com  Tue Oct 23 10:34:25 2012
From: ketancmaheshwari at gmail.com (Ketan Maheshwari)
Date: Tue, 23 Oct 2012 11:34:25 -0400
Subject: [Swift-devel] jobthrottle value does not correspond to number of
 parallel jobs on local provider
Message-ID: <CAMUuvipALBz_4tcNqbvvQBJetQtymd0dPFJ=UgtM5gUrfJtp7w@mail.gmail.com>

Hi,

I am trying to run an experiment on a 32-core machine with the hope of
running 8, 16, 24 and 32 jobs in parallel. I am trying to control these
numbers of parallel jobs by setting the Karajan jobthrottle values in
sites.xml to 0.07, 0.15, and so on.

However, it seems that the values are not corresponding to what I see in
the Swift progress text.

Initially, when I set jobthrottle to 0.07, only 2 jobs started in parallel.
Then I added the line setting "Initialscore" value to 10000, which improved
the jobs to 5. After this a 10-fold increase in "initialscore" did not
improve the jobs count.

Furthermore, a new batch of 5 jobs get started only when *all* jobs from
the old batch are over as opposed to a continuous supply of jobs from "site
selection" to "stage out" state which happens in the case of coaster and
other providers.

The behavior is same in Swift 0.93.1 and latest trunk.

Thank you for any clues on how to set the expected number of parallel jobs
to these values.

Please find attached one such log of this run.
Thanks,
-- 
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20121023/3467a75e/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mars-localproviderlog.tgz
Type: application/x-gzip
Size: 112743 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20121023/3467a75e/attachment.bin>

From davidk at ci.uchicago.edu  Tue Oct 23 11:47:02 2012
From: davidk at ci.uchicago.edu (David Kelly)
Date: Tue, 23 Oct 2012 11:47:02 -0500 (CDT)
Subject: [Swift-devel] Sourceforge down
Message-ID: <1681942315.157497.1351010822027.JavaMail.root@zimbra-mb2.anl.gov>

Just an FYI - sourceforge has been having issues today and we seem to be unable to access the CoG SVN.

https://sourceforge.net/blog/various-sourceforge-services-down/


From wilde at mcs.anl.gov  Tue Oct 23 12:23:34 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 23 Oct 2012 12:23:34 -0500 (CDT)
Subject: [Swift-devel] jobthrottle value does not correspond to number
 of parallel jobs on local provider
In-Reply-To: <CAMUuvipALBz_4tcNqbvvQBJetQtymd0dPFJ=UgtM5gUrfJtp7w@mail.gmail.com>
Message-ID: <601640800.60237.1351013014109.JavaMail.root@zimbra.anl.gov>

Hi Ketan,

In the log you attached I see this:

    <profile key="jobThrottle" namespace="karajan">0.10</profile>
    <profile namespace="karajan" key="initialScore">100000</profile>

You should leave initialScore constant, and set to a large number, no matter what level of manual throttling you want to specify via sites.xml. We always use 10000 for this value. Don't attempt to vary the initialScore value for manual throttle: just use jobThrottle to set what you want.

A jobThrottle value of 0.10 should run 11 jobs in parallel (jobThrottle * 100) + 1 (for historical reasons related to the automatic throttling algorithm).

If you are seeing less than that, one common cause is that the ratio of your input staging times to your job run times is so high as to make it impossible for Swift to keep the expected/desired number of jobs in active state at once.

I suggest you test the throttle behavior with a simple app script like "catsnsleep" (catsn with an artificial sleep to increase job duration). If your settings (sites + cf) work for that test, then they should work for the real app, within the staging constraints.  Using CDM "direct" mode is likely what you want here to eliminate unnecessary staging on a local cluster.

In your test, what was this ratio?  Can you also post your cf file and the progress log from stdout/stderr?

- Mike

----- Original Message -----
> From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> To: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Tuesday, October 23, 2012 10:34:25 AM
> Subject: [Swift-devel] jobthrottle value does not correspond to number of parallel jobs on local provider
> Hi,
> 
> 
> I am trying to run an experiment on a 32-core machine with the hope of
> running 8, 16, 24 and 32 jobs in parallel. I am trying to control
> these numbers of parallel jobs by setting the Karajan jobthrottle
> values in sites.xml to 0.07, 0.15, and so on.
> 
> 
> However, it seems that the values are not corresponding to what I see
> in the Swift progress text.
> 
> 
> Initially, when I set jobthrottle to 0.07, only 2 jobs started in
> parallel. Then I added the line setting "Initialscore" value to 10000,
> which improved the jobs to 5. After this a 10-fold increase in
> "initialscore" did not improve the jobs count.
> 
> 
> Furthermore, a new batch of 5 jobs get started only when *all* jobs
> from the old batch are over as opposed to a continuous supply of jobs
> from "site selection" to "stage out" state which happens in the case
> of coaster and other providers.
> 
> 
> The behavior is same in Swift 0.93.1 and latest trunk.
> 
> 
> 
> Thank you for any clues on how to set the expected number of parallel
> jobs to these values.
> 
> 
> Please find attached one such log of this run.
> Thanks, --
> Ketan
> 
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From wilde at mcs.anl.gov  Tue Oct 23 13:36:42 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 23 Oct 2012 13:36:42 -0500 (CDT)
Subject: [Swift-devel] jobthrottle value does not correspond to number
 of parallel jobs on local provider
In-Reply-To: <601640800.60237.1351013014109.JavaMail.root@zimbra.anl.gov>
Message-ID: <1140982512.60437.1351017402648.JavaMail.root@zimbra.anl.gov>

Ketan, looking further I see that your app has a large number of output files, O(100). Depending on their size, and the speed of the filesystem on which you are testing, that re-inforces my suspicion that low concurrency you are seeing is due to staging IO.

If this is a local 32-core host, try running with your input and output data and workdirectory all on a local hard disk (or even /dev/shm if it has sufficient RAM/space). Then try using CDM direct as explained at: 

  http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_specific_use_cases

- Mike

----- Original Message -----
> From: "Michael Wilde" <wilde at mcs.anl.gov>
> To: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Tuesday, October 23, 2012 12:23:34 PM
> Subject: Re: [Swift-devel] jobthrottle value does not correspond to number of parallel jobs on local provider
> Hi Ketan,
> 
> In the log you attached I see this:
> 
> <profile key="jobThrottle" namespace="karajan">0.10</profile>
> <profile namespace="karajan" key="initialScore">100000</profile>
> 
> You should leave initialScore constant, and set to a large number, no
> matter what level of manual throttling you want to specify via
> sites.xml. We always use 10000 for this value. Don't attempt to vary
> the initialScore value for manual throttle: just use jobThrottle to
> set what you want.
> 
> A jobThrottle value of 0.10 should run 11 jobs in parallel
> (jobThrottle * 100) + 1 (for historical reasons related to the
> automatic throttling algorithm).
> 
> If you are seeing less than that, one common cause is that the ratio
> of your input staging times to your job run times is so high as to
> make it impossible for Swift to keep the expected/desired number of
> jobs in active state at once.
> 
> I suggest you test the throttle behavior with a simple app script like
> "catsnsleep" (catsn with an artificial sleep to increase job
> duration). If your settings (sites + cf) work for that test, then they
> should work for the real app, within the staging constraints. Using
> CDM "direct" mode is likely what you want here to eliminate
> unnecessary staging on a local cluster.
> 
> In your test, what was this ratio? Can you also post your cf file and
> the progress log from stdout/stderr?
> 
> - Mike
> 
> ----- Original Message -----
> > From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> > To: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > Sent: Tuesday, October 23, 2012 10:34:25 AM
> > Subject: [Swift-devel] jobthrottle value does not correspond to
> > number of parallel jobs on local provider
> > Hi,
> >
> >
> > I am trying to run an experiment on a 32-core machine with the hope
> > of
> > running 8, 16, 24 and 32 jobs in parallel. I am trying to control
> > these numbers of parallel jobs by setting the Karajan jobthrottle
> > values in sites.xml to 0.07, 0.15, and so on.
> >
> >
> > However, it seems that the values are not corresponding to what I
> > see
> > in the Swift progress text.
> >
> >
> > Initially, when I set jobthrottle to 0.07, only 2 jobs started in
> > parallel. Then I added the line setting "Initialscore" value to
> > 10000,
> > which improved the jobs to 5. After this a 10-fold increase in
> > "initialscore" did not improve the jobs count.
> >
> >
> > Furthermore, a new batch of 5 jobs get started only when *all* jobs
> > from the old batch are over as opposed to a continuous supply of
> > jobs
> > from "site selection" to "stage out" state which happens in the case
> > of coaster and other providers.
> >
> >
> > The behavior is same in Swift 0.93.1 and latest trunk.
> >
> >
> >
> > Thank you for any clues on how to set the expected number of
> > parallel
> > jobs to these values.
> >
> >
> > Please find attached one such log of this run.
> > Thanks, --
> > Ketan
> >
> >
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From ketancmaheshwari at gmail.com  Tue Oct 23 14:02:15 2012
From: ketancmaheshwari at gmail.com (Ketan Maheshwari)
Date: Tue, 23 Oct 2012 15:02:15 -0400
Subject: [Swift-devel] jobthrottle value does not correspond to number
 of parallel jobs on local provider
In-Reply-To: <1140982512.60437.1351017402648.JavaMail.root@zimbra.anl.gov>
References: <601640800.60237.1351013014109.JavaMail.root@zimbra.anl.gov>
	<1140982512.60437.1351017402648.JavaMail.root@zimbra.anl.gov>
Message-ID: <CAMUuviooUGvaiutbj6D9un3e9ATN5jwb=b7orykcW9NR_qG7cA@mail.gmail.com>

Mike,

Thank you for your answers.

I tried catsnsleep with n=100 and s=10 and indeed the number of parallel
jobs corresponded to the jobthrottle value.
Surprisingly, when I started the mars application immediately after this,
it also started 32 jobs in parallel. However, the run failed with "too many
open files" error after a while.

Now, I am trying cdm method. Will keep you posted.

On Tue, Oct 23, 2012 at 2:36 PM, Michael Wilde <wilde at mcs.anl.gov> wrote:

> Ketan, looking further I see that your app has a large number of output
> files, O(100). Depending on their size, and the speed of the filesystem on
> which you are testing, that re-inforces my suspicion that low concurrency
> you are seeing is due to staging IO.
>
> If this is a local 32-core host, try running with your input and output
> data and workdirectory all on a local hard disk (or even /dev/shm if it has
> sufficient RAM/space). Then try using CDM direct as explained at:
>
>
> http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_specific_use_cases
>
> - Mike
>
> ----- Original Message -----
> > From: "Michael Wilde" <wilde at mcs.anl.gov>
> > To: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > Sent: Tuesday, October 23, 2012 12:23:34 PM
> > Subject: Re: [Swift-devel] jobthrottle value does not correspond to
> number of parallel jobs on local provider
> > Hi Ketan,
> >
> > In the log you attached I see this:
> >
> > <profile key="jobThrottle" namespace="karajan">0.10</profile>
> > <profile namespace="karajan" key="initialScore">100000</profile>
> >
> > You should leave initialScore constant, and set to a large number, no
> > matter what level of manual throttling you want to specify via
> > sites.xml. We always use 10000 for this value. Don't attempt to vary
> > the initialScore value for manual throttle: just use jobThrottle to
> > set what you want.
> >
> > A jobThrottle value of 0.10 should run 11 jobs in parallel
> > (jobThrottle * 100) + 1 (for historical reasons related to the
> > automatic throttling algorithm).
> >
> > If you are seeing less than that, one common cause is that the ratio
> > of your input staging times to your job run times is so high as to
> > make it impossible for Swift to keep the expected/desired number of
> > jobs in active state at once.
> >
> > I suggest you test the throttle behavior with a simple app script like
> > "catsnsleep" (catsn with an artificial sleep to increase job
> > duration). If your settings (sites + cf) work for that test, then they
> > should work for the real app, within the staging constraints. Using
> > CDM "direct" mode is likely what you want here to eliminate
> > unnecessary staging on a local cluster.
> >
> > In your test, what was this ratio? Can you also post your cf file and
> > the progress log from stdout/stderr?
> >
> > - Mike
> >
> > ----- Original Message -----
> > > From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> > > To: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > > Sent: Tuesday, October 23, 2012 10:34:25 AM
> > > Subject: [Swift-devel] jobthrottle value does not correspond to
> > > number of parallel jobs on local provider
> > > Hi,
> > >
> > >
> > > I am trying to run an experiment on a 32-core machine with the hope
> > > of
> > > running 8, 16, 24 and 32 jobs in parallel. I am trying to control
> > > these numbers of parallel jobs by setting the Karajan jobthrottle
> > > values in sites.xml to 0.07, 0.15, and so on.
> > >
> > >
> > > However, it seems that the values are not corresponding to what I
> > > see
> > > in the Swift progress text.
> > >
> > >
> > > Initially, when I set jobthrottle to 0.07, only 2 jobs started in
> > > parallel. Then I added the line setting "Initialscore" value to
> > > 10000,
> > > which improved the jobs to 5. After this a 10-fold increase in
> > > "initialscore" did not improve the jobs count.
> > >
> > >
> > > Furthermore, a new batch of 5 jobs get started only when *all* jobs
> > > from the old batch are over as opposed to a continuous supply of
> > > jobs
> > > from "site selection" to "stage out" state which happens in the case
> > > of coaster and other providers.
> > >
> > >
> > > The behavior is same in Swift 0.93.1 and latest trunk.
> > >
> > >
> > >
> > > Thank you for any clues on how to set the expected number of
> > > parallel
> > > jobs to these values.
> > >
> > >
> > > Please find attached one such log of this run.
> > > Thanks, --
> > > Ketan
> > >
> > >
> > >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
>
>


-- 
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20121023/307e258e/attachment.html>

From ketancmaheshwari at gmail.com  Tue Oct 23 14:52:48 2012
From: ketancmaheshwari at gmail.com (Ketan Maheshwari)
Date: Tue, 23 Oct 2012 15:52:48 -0400
Subject: [Swift-devel] jobthrottle value does not correspond to number
 of parallel jobs on local provider
In-Reply-To: <CAMUuviooUGvaiutbj6D9un3e9ATN5jwb=b7orykcW9NR_qG7cA@mail.gmail.com>
References: <601640800.60237.1351013014109.JavaMail.root@zimbra.anl.gov>
	<1140982512.60437.1351017402648.JavaMail.root@zimbra.anl.gov>
	<CAMUuviooUGvaiutbj6D9un3e9ATN5jwb=b7orykcW9NR_qG7cA@mail.gmail.com>
Message-ID: <CAMUuvipS6kS8WbZ5ncOk+iLmNwAS1SBW1QQ8Rj1CuA9=LPYOBw@mail.gmail.com>

Now trying with cdm. My cdm policy file contains a single line as follows:

rule .* DEFAULT /

This seems to be working at stage in because I immediately see my jobs
starting. However, it fails immediately after with a message:
"Execution failed:
The following output files were not created by the application:"

Followed by a list of outputs. I recall this could happen if absolute
pathnames are not provided, so I updated my mappers.sh scripts with
absolute pathnames including a double // in the beginning without success.

The run log do not show any specific indicators of error other than the
above message.

I see a bunch of CDM_POLICY CDM_ACTION lines in the wrapper.log in one of
the many jobdirs as follows:
CDM_POLICY: /home/train07/ketan_mars/swift/result52/mars.ot48 -> DEFAULT /
CDM_ACTION:
/home/train07/ketan_mars/swift/swift.workdir/mars-20121023-1240-vbptd8i9/jobs/g/mars-gtln0yzk
OUTPUT /home/train07/ketan_mars/swift/result52/mars.ot48 DEFAULT /

Not sure if something could've gone wrong here.

Attaching the log file and one of the job dirs.

Regards,
Ketan

On Tue, Oct 23, 2012 at 3:02 PM, Ketan Maheshwari <
ketancmaheshwari at gmail.com> wrote:

> Mike,
>
> Thank you for your answers.
>
> I tried catsnsleep with n=100 and s=10 and indeed the number of parallel
> jobs corresponded to the jobthrottle value.
> Surprisingly, when I started the mars application immediately after this,
> it also started 32 jobs in parallel. However, the run failed with "too many
> open files" error after a while.
>
> Now, I am trying cdm method. Will keep you posted.
>
>
> On Tue, Oct 23, 2012 at 2:36 PM, Michael Wilde <wilde at mcs.anl.gov> wrote:
>
>> Ketan, looking further I see that your app has a large number of output
>> files, O(100). Depending on their size, and the speed of the filesystem on
>> which you are testing, that re-inforces my suspicion that low concurrency
>> you are seeing is due to staging IO.
>>
>> If this is a local 32-core host, try running with your input and output
>> data and workdirectory all on a local hard disk (or even /dev/shm if it has
>> sufficient RAM/space). Then try using CDM direct as explained at:
>>
>>
>> http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_specific_use_cases
>>
>> - Mike
>>
>> ----- Original Message -----
>> > From: "Michael Wilde" <wilde at mcs.anl.gov>
>> > To: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
>> > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
>> > Sent: Tuesday, October 23, 2012 12:23:34 PM
>> > Subject: Re: [Swift-devel] jobthrottle value does not correspond to
>> number of parallel jobs on local provider
>> > Hi Ketan,
>> >
>> > In the log you attached I see this:
>> >
>> > <profile key="jobThrottle" namespace="karajan">0.10</profile>
>> > <profile namespace="karajan" key="initialScore">100000</profile>
>> >
>> > You should leave initialScore constant, and set to a large number, no
>> > matter what level of manual throttling you want to specify via
>> > sites.xml. We always use 10000 for this value. Don't attempt to vary
>> > the initialScore value for manual throttle: just use jobThrottle to
>> > set what you want.
>> >
>> > A jobThrottle value of 0.10 should run 11 jobs in parallel
>> > (jobThrottle * 100) + 1 (for historical reasons related to the
>> > automatic throttling algorithm).
>> >
>> > If you are seeing less than that, one common cause is that the ratio
>> > of your input staging times to your job run times is so high as to
>> > make it impossible for Swift to keep the expected/desired number of
>> > jobs in active state at once.
>> >
>> > I suggest you test the throttle behavior with a simple app script like
>> > "catsnsleep" (catsn with an artificial sleep to increase job
>> > duration). If your settings (sites + cf) work for that test, then they
>> > should work for the real app, within the staging constraints. Using
>> > CDM "direct" mode is likely what you want here to eliminate
>> > unnecessary staging on a local cluster.
>> >
>> > In your test, what was this ratio? Can you also post your cf file and
>> > the progress log from stdout/stderr?
>> >
>> > - Mike
>> >
>> > ----- Original Message -----
>> > > From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
>> > > To: "Swift Devel" <swift-devel at ci.uchicago.edu>
>> > > Sent: Tuesday, October 23, 2012 10:34:25 AM
>> > > Subject: [Swift-devel] jobthrottle value does not correspond to
>> > > number of parallel jobs on local provider
>> > > Hi,
>> > >
>> > >
>> > > I am trying to run an experiment on a 32-core machine with the hope
>> > > of
>> > > running 8, 16, 24 and 32 jobs in parallel. I am trying to control
>> > > these numbers of parallel jobs by setting the Karajan jobthrottle
>> > > values in sites.xml to 0.07, 0.15, and so on.
>> > >
>> > >
>> > > However, it seems that the values are not corresponding to what I
>> > > see
>> > > in the Swift progress text.
>> > >
>> > >
>> > > Initially, when I set jobthrottle to 0.07, only 2 jobs started in
>> > > parallel. Then I added the line setting "Initialscore" value to
>> > > 10000,
>> > > which improved the jobs to 5. After this a 10-fold increase in
>> > > "initialscore" did not improve the jobs count.
>> > >
>> > >
>> > > Furthermore, a new batch of 5 jobs get started only when *all* jobs
>> > > from the old batch are over as opposed to a continuous supply of
>> > > jobs
>> > > from "site selection" to "stage out" state which happens in the case
>> > > of coaster and other providers.
>> > >
>> > >
>> > > The behavior is same in Swift 0.93.1 and latest trunk.
>> > >
>> > >
>> > >
>> > > Thank you for any clues on how to set the expected number of
>> > > parallel
>> > > jobs to these values.
>> > >
>> > >
>> > > Please find attached one such log of this run.
>> > > Thanks, --
>> > > Ketan
>> > >
>> > >
>> > >
>> > > _______________________________________________
>> > > Swift-devel mailing list
>> > > Swift-devel at ci.uchicago.edu
>> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>> >
>> > --
>> > Michael Wilde
>> > Computation Institute, University of Chicago
>> > Mathematics and Computer Science Division
>> > Argonne National Laboratory
>> >
>> > _______________________________________________
>> > Swift-devel mailing list
>> > Swift-devel at ci.uchicago.edu
>> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>>
>> --
>> Michael Wilde
>> Computation Institute, University of Chicago
>> Mathematics and Computer Science Division
>> Argonne National Laboratory
>>
>>
>
>
> --
> Ketan
>
>
>


-- 
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20121023/7d90bfaa/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mars-debug.tgz
Type: application/x-gzip
Size: 101159 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20121023/7d90bfaa/attachment.bin>

From wilde at mcs.anl.gov  Tue Oct 23 15:14:41 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 23 Oct 2012 15:14:41 -0500 (CDT)
Subject: [Swift-devel] jobthrottle value does not correspond to number
 of parallel jobs on local provider
In-Reply-To: <CAMUuvipS6kS8WbZ5ncOk+iLmNwAS1SBW1QQ8Rj1CuA9=LPYOBw@mail.gmail.com>
Message-ID: <199523820.60743.1351023281527.JavaMail.root@zimbra.anl.gov>

> Now trying with cdm. My cdm policy file contains a single line as
> follows:
> 
> 
> rule .* DEFAULT /

Change DEFAULT to DIRECT

Look at the example at:

http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_specific_use_cases

sec 20.5.1 and 20.5.2

Its best to test this first with simple "catsn-like" examples before you try your science app with it, to make sure that the direct processing is behaving as you expect.

- Mike


From ketancmaheshwari at gmail.com  Tue Oct 23 15:23:16 2012
From: ketancmaheshwari at gmail.com (Ketan Maheshwari)
Date: Tue, 23 Oct 2012 16:23:16 -0400
Subject: [Swift-devel] jobthrottle value does not correspond to number
 of parallel jobs on local provider
In-Reply-To: <199523820.60743.1351023281527.JavaMail.root@zimbra.anl.gov>
References: <CAMUuvipS6kS8WbZ5ncOk+iLmNwAS1SBW1QQ8Rj1CuA9=LPYOBw@mail.gmail.com>
	<199523820.60743.1351023281527.JavaMail.root@zimbra.anl.gov>
Message-ID: <CAMUuviogntYhoej2g1HAkcM4Cr=bDJgAnS5AnVWatCtCKf3LHg@mail.gmail.com>

I tried CDM DIRECT policy with catsnsleep which gives the following:

CDM[DIRECT]: Linking to /home/train07/ketan_mars/swift/data.txt failed!

the file exists as mentioned in the path above.

When specifying relative path to data.txt, I get /data.txt doesn't exist,
presumably because cdm DIRECT assumes paths to be absolute.

Log attached.

On Tue, Oct 23, 2012 at 4:14 PM, Michael Wilde <wilde at mcs.anl.gov> wrote:

> > Now trying with cdm. My cdm policy file contains a single line as
> > follows:
> >
> >
> > rule .* DEFAULT /
>
> Change DEFAULT to DIRECT
>
> Look at the example at:
>
>
> http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_specific_use_cases
>
> sec 20.5.1 and 20.5.2
>
> Its best to test this first with simple "catsn-like" examples before you
> try your science app with it, to make sure that the direct processing is
> behaving as you expect.
>
> - Mike
>


-- 
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20121023/4c6b49fa/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: catsnsleep-20121023-1317-cd9bgebg.log
Type: application/octet-stream
Size: 22557 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20121023/4c6b49fa/attachment.obj>

From wilde at mcs.anl.gov  Tue Oct 23 16:00:31 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 23 Oct 2012 16:00:31 -0500 (CDT)
Subject: [Swift-devel] jobthrottle value does not correspond to number
 of parallel jobs on local provider
In-Reply-To: <CAMUuviogntYhoej2g1HAkcM4Cr=bDJgAnS5AnVWatCtCKf3LHg@mail.gmail.com>
Message-ID: <1301166271.60909.1351026031769.JavaMail.root@zimbra.anl.gov>

Ketan,

I tested this with a recent trunk rev and it seems to work. Can you try to replicate the test below and proceed from there to make your specific app pattern work?

This was run on pads.

- Mike


login2$ swift -config cf -cdm.file direct -tc.file tc -sites.file sites.xml catsndirect.swift -n=1

Swift trunk swift-r5920 cog-r3471 (cog modified locally)

RunID: 20121023-1554-vese3gh6
Progress:  time: Tue, 23 Oct 2012 15:54:47 -0500
Final status: Tue, 23 Oct 2012 15:54:47 -0500  Finished successfully:1

login2$ cat catsndirect.swift
type file;

app (file o) cat (file i)
{
  cat @i stdout=@o;
}

file out[]<simple_mapper; location="/tmp/wilde/outdir", prefix="f.",suffix=".out">;

foreach j in [1:@toint(@arg("n","1"))] {
  file data<"/tmp/wilde/indir/data.txt">;
  out[j] = cat(data);
}

login2$ cat cf
wrapperlog.always.transfer=true
sitedir.keep=true
execution.retries=0
lazy.errors=false
status.mode=provider
use.provider.staging=false
provider.staging.pin.swiftfiles=false
use.wrapper.staging=false

login2$ cat direct
rule  .* DIRECT /

login2$ cat sites.xml
<config>
  <pool handle="localhost">
    <execution provider="local" />
    <filesystem provider="local" />
    <workdirectory>/scratch/local/wilde/pstest/swiftwork</workdirectory>
  </pool>
</config>
login2$

----- Original Message -----
> From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Tuesday, October 23, 2012 3:23:16 PM
> Subject: Re: [Swift-devel] jobthrottle value does not correspond to number of parallel jobs on local provider
> I tried CDM DIRECT policy with catsnsleep which gives the following:
> 
> 
> CDM[DIRECT]: Linking to /home/train07/ketan_mars/swift/data.txt
> failed!
> 
> 
> the file exists as mentioned in the path above.
> 
> 
> When specifying relative path to data.txt, I get /data.txt doesn't
> exist, presumably because cdm DIRECT assumes paths to be absolute.
> 
> 
> Log attached.
> 
> 
> On Tue, Oct 23, 2012 at 4:14 PM, Michael Wilde < wilde at mcs.anl.gov >
> wrote:
> 
> 
> 
> > Now trying with cdm. My cdm policy file contains a single line as
> > follows:
> >
> >
> > rule .* DEFAULT /
> 
> Change DEFAULT to DIRECT
> 
> Look at the example at:
> 
> http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_specific_use_cases
> 
> sec 20.5.1 and 20.5.2
> 
> Its best to test this first with simple "catsn-like" examples before
> you try your science app with it, to make sure that the direct
> processing is behaving as you expect.
> 
> - Mike
> 
> 
> 
> 
> --
> Ketan

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From wilde at mcs.anl.gov  Tue Oct 23 18:14:02 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 23 Oct 2012 18:14:02 -0500 (CDT)
Subject: [Swift-devel] jobthrottle value does not correspond to number
 of parallel jobs on local provider
In-Reply-To: <CAMUuviooUGvaiutbj6D9un3e9ATN5jwb=b7orykcW9NR_qG7cA@mail.gmail.com>
Message-ID: <1605523736.61205.1351034042528.JavaMail.root@zimbra.anl.gov>

I just noticed your mention here of a "too many open files" problem.

Can you tell me what "ulimit -n" (max # of open files) reports for your system?

Can you alter your app script to return the 100+ files in a tarball instead of individually?

What may be happening here is:

- if you have low -n limit (eg 1024) and

- you are using provider staging, meaning the swift or coaster service jvm will be writing the final output files directly and

- you are writing 32 jobs x 100 files files concurrently then

-> you will exceed your limit of open files.

Just a hypothesis - you'll need to dig deeper and see if you can extend the ulimit for -n.

- Mike

----- Original Message -----
> From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Tuesday, October 23, 2012 2:02:15 PM
> Subject: Re: [Swift-devel] jobthrottle value does not correspond to number of parallel jobs on local provider
> Mike,
> 
> 
> Thank you for your answers.
> 
> 
> I tried catsnsleep with n=100 and s=10 and indeed the number of
> parallel jobs corresponded to the jobthrottle value.
> Surprisingly, when I started the mars application immediately after
> this, it also started 32 jobs in parallel. However, the run failed
> with "too many open files" error after a while.
> 
> 
> Now, I am trying cdm method. Will keep you posted.
> 
> 
> On Tue, Oct 23, 2012 at 2:36 PM, Michael Wilde < wilde at mcs.anl.gov >
> wrote:
> 
> 
> Ketan, looking further I see that your app has a large number of
> output files, O(100). Depending on their size, and the speed of the
> filesystem on which you are testing, that re-inforces my suspicion
> that low concurrency you are seeing is due to staging IO.
> 
> If this is a local 32-core host, try running with your input and
> output data and workdirectory all on a local hard disk (or even
> /dev/shm if it has sufficient RAM/space). Then try using CDM direct as
> explained at:
> 
> http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_specific_use_cases
> 
> 
> - Mike
> 
> ----- Original Message -----
> 
> 
> > From: "Michael Wilde" < wilde at mcs.anl.gov >
> > To: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
> > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >
> > Sent: Tuesday, October 23, 2012 12:23:34 PM
> > Subject: Re: [Swift-devel] jobthrottle value does not correspond to
> > number of parallel jobs on local provider
> > Hi Ketan,
> >
> > In the log you attached I see this:
> >
> > <profile key="jobThrottle" namespace="karajan">0.10</profile>
> > <profile namespace="karajan" key="initialScore">100000</profile>
> >
> > You should leave initialScore constant, and set to a large number,
> > no
> > matter what level of manual throttling you want to specify via
> > sites.xml. We always use 10000 for this value. Don't attempt to vary
> > the initialScore value for manual throttle: just use jobThrottle to
> > set what you want.
> >
> > A jobThrottle value of 0.10 should run 11 jobs in parallel
> > (jobThrottle * 100) + 1 (for historical reasons related to the
> > automatic throttling algorithm).
> >
> > If you are seeing less than that, one common cause is that the ratio
> > of your input staging times to your job run times is so high as to
> > make it impossible for Swift to keep the expected/desired number of
> > jobs in active state at once.
> >
> > I suggest you test the throttle behavior with a simple app script
> > like
> > "catsnsleep" (catsn with an artificial sleep to increase job
> > duration). If your settings (sites + cf) work for that test, then
> > they
> > should work for the real app, within the staging constraints. Using
> > CDM "direct" mode is likely what you want here to eliminate
> > unnecessary staging on a local cluster.
> >
> > In your test, what was this ratio? Can you also post your cf file
> > and
> > the progress log from stdout/stderr?
> >
> > - Mike
> >
> > ----- Original Message -----
> > > From: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
> > > To: "Swift Devel" < swift-devel at ci.uchicago.edu >
> > > Sent: Tuesday, October 23, 2012 10:34:25 AM
> > > Subject: [Swift-devel] jobthrottle value does not correspond to
> > > number of parallel jobs on local provider
> > > Hi,
> > >
> > >
> > > I am trying to run an experiment on a 32-core machine with the
> > > hope
> > > of
> > > running 8, 16, 24 and 32 jobs in parallel. I am trying to control
> > > these numbers of parallel jobs by setting the Karajan jobthrottle
> > > values in sites.xml to 0.07, 0.15, and so on.
> > >
> > >
> > > However, it seems that the values are not corresponding to what I
> > > see
> > > in the Swift progress text.
> > >
> > >
> > > Initially, when I set jobthrottle to 0.07, only 2 jobs started in
> > > parallel. Then I added the line setting "Initialscore" value to
> > > 10000,
> > > which improved the jobs to 5. After this a 10-fold increase in
> > > "initialscore" did not improve the jobs count.
> > >
> > >
> > > Furthermore, a new batch of 5 jobs get started only when *all*
> > > jobs
> > > from the old batch are over as opposed to a continuous supply of
> > > jobs
> > > from "site selection" to "stage out" state which happens in the
> > > case
> > > of coaster and other providers.
> > >
> > >
> > > The behavior is same in Swift 0.93.1 and latest trunk.
> > >
> > >
> > >
> > > Thank you for any clues on how to set the expected number of
> > > parallel
> > > jobs to these values.
> > >
> > >
> > > Please find attached one such log of this run.
> > > Thanks, --
> > > Ketan
> > >
> > >
> > >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> 
> 
> 
> 
> --
> Ketan

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From ketancmaheshwari at gmail.com  Wed Oct 24 08:25:47 2012
From: ketancmaheshwari at gmail.com (Ketan Maheshwari)
Date: Wed, 24 Oct 2012 09:25:47 -0400
Subject: [Swift-devel] jobthrottle value does not correspond to number
 of parallel jobs on local provider
In-Reply-To: <1605523736.61205.1351034042528.JavaMail.root@zimbra.anl.gov>
References: <CAMUuviooUGvaiutbj6D9un3e9ATN5jwb=b7orykcW9NR_qG7cA@mail.gmail.com>
	<1605523736.61205.1351034042528.JavaMail.root@zimbra.anl.gov>
Message-ID: <CAMUuvioVfHpTY31AC4Heb4YoG=qhp5qsF1s5qZhimYL1L=m+gQ@mail.gmail.com>

Hi Mike,

Seems it is resolved now. There were multiple issues:

In my config file use provider staging was set to true and in sites file
staging method was set to file. This was conflicting with cdm link creation
because the file with link name was already present. This was resolved by
setting the above option to false and removing the staging method line from
sites.xml

Turns out that Mars only works when the licence file is present in the same
dir as data. It does not like licence file symlinked for some reason. So,
it had to be excluded from getting cdm'd. I use individual patterns to cdm
inputs.

In one of the configuration, where I set all my output file mappings to
absolute paths in source swift script as well as mappers.sh, I was getting
falsely successful jobs: swift did not complain but only blank output files
were touch'd (by cdm?). It complained in the end when the files were not
found to the last job which accepts them as input.

Another issue was with the workdir in my sites.xml. It was a relative path
in mine whereas was absolute path in your case. Swift complained with exit
status 127 in my case and worked when I provide absolute path. I am not
sure if this was trunk or 0.93.1. I will check again.

In an earlier issue where I mentioned Swift not starting the number of
parallel jobs for local provider corresponding to the jobthrottle value, I
observe that indeed this is true for the local provider but does not seem
to be true when using coasters *locally*. Consequently, I tried both
approaches on a 32-core machine and found that in the case of coaster
provider the performance was better compared to the local provider *with*
CDM (Although only the inputs were cdm'd: 7M per job). Here are the results
for different throttle values (intended to use different number of cpus)
with coasters:

8 cores -- 13m 25sec
16 cores -- 12m 40sec
24 cores -- 10m 51sec
32 cores -- 10m 57sec

With local provider, some inputs cdm'd:

8 cores -- 15m 8sec
16 cores -- 12m 4sec
24 cores -- 12m 37sec
32 cores -- 11m 39sec

It looks like coaster provider does not take the datamovement to jobs ratio
into account and in this case it turns out to be faster.

I observe that local provider starts with a much less number of jobs and
slowly picks up with more jobs and reached the peak intended number almost
always after 25% of jobs completes.

Regards,
Ketan

On Tue, Oct 23, 2012 at 7:14 PM, Michael Wilde <wilde at mcs.anl.gov> wrote:

> I just noticed your mention here of a "too many open files" problem.
>
> Can you tell me what "ulimit -n" (max # of open files) reports for your
> system?
>
> Can you alter your app script to return the 100+ files in a tarball
> instead of individually?
>
> What may be happening here is:
>
> - if you have low -n limit (eg 1024) and
>
> - you are using provider staging, meaning the swift or coaster service jvm
> will be writing the final output files directly and
>
> - you are writing 32 jobs x 100 files files concurrently then
>
> -> you will exceed your limit of open files.
>
> Just a hypothesis - you'll need to dig deeper and see if you can extend
> the ulimit for -n.
>
> - Mike
>
> ----- Original Message -----
> > From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > Sent: Tuesday, October 23, 2012 2:02:15 PM
> > Subject: Re: [Swift-devel] jobthrottle value does not correspond to
> number of parallel jobs on local provider
> > Mike,
> >
> >
> > Thank you for your answers.
> >
> >
> > I tried catsnsleep with n=100 and s=10 and indeed the number of
> > parallel jobs corresponded to the jobthrottle value.
> > Surprisingly, when I started the mars application immediately after
> > this, it also started 32 jobs in parallel. However, the run failed
> > with "too many open files" error after a while.
> >
> >
> > Now, I am trying cdm method. Will keep you posted.
> >
> >
> > On Tue, Oct 23, 2012 at 2:36 PM, Michael Wilde < wilde at mcs.anl.gov >
> > wrote:
> >
> >
> > Ketan, looking further I see that your app has a large number of
> > output files, O(100). Depending on their size, and the speed of the
> > filesystem on which you are testing, that re-inforces my suspicion
> > that low concurrency you are seeing is due to staging IO.
> >
> > If this is a local 32-core host, try running with your input and
> > output data and workdirectory all on a local hard disk (or even
> > /dev/shm if it has sufficient RAM/space). Then try using CDM direct as
> > explained at:
> >
> >
> http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_specific_use_cases
> >
> >
> > - Mike
> >
> > ----- Original Message -----
> >
> >
> > > From: "Michael Wilde" < wilde at mcs.anl.gov >
> > > To: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
> > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >
> > > Sent: Tuesday, October 23, 2012 12:23:34 PM
> > > Subject: Re: [Swift-devel] jobthrottle value does not correspond to
> > > number of parallel jobs on local provider
> > > Hi Ketan,
> > >
> > > In the log you attached I see this:
> > >
> > > <profile key="jobThrottle" namespace="karajan">0.10</profile>
> > > <profile namespace="karajan" key="initialScore">100000</profile>
> > >
> > > You should leave initialScore constant, and set to a large number,
> > > no
> > > matter what level of manual throttling you want to specify via
> > > sites.xml. We always use 10000 for this value. Don't attempt to vary
> > > the initialScore value for manual throttle: just use jobThrottle to
> > > set what you want.
> > >
> > > A jobThrottle value of 0.10 should run 11 jobs in parallel
> > > (jobThrottle * 100) + 1 (for historical reasons related to the
> > > automatic throttling algorithm).
> > >
> > > If you are seeing less than that, one common cause is that the ratio
> > > of your input staging times to your job run times is so high as to
> > > make it impossible for Swift to keep the expected/desired number of
> > > jobs in active state at once.
> > >
> > > I suggest you test the throttle behavior with a simple app script
> > > like
> > > "catsnsleep" (catsn with an artificial sleep to increase job
> > > duration). If your settings (sites + cf) work for that test, then
> > > they
> > > should work for the real app, within the staging constraints. Using
> > > CDM "direct" mode is likely what you want here to eliminate
> > > unnecessary staging on a local cluster.
> > >
> > > In your test, what was this ratio? Can you also post your cf file
> > > and
> > > the progress log from stdout/stderr?
> > >
> > > - Mike
> > >
> > > ----- Original Message -----
> > > > From: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
> > > > To: "Swift Devel" < swift-devel at ci.uchicago.edu >
> > > > Sent: Tuesday, October 23, 2012 10:34:25 AM
> > > > Subject: [Swift-devel] jobthrottle value does not correspond to
> > > > number of parallel jobs on local provider
> > > > Hi,
> > > >
> > > >
> > > > I am trying to run an experiment on a 32-core machine with the
> > > > hope
> > > > of
> > > > running 8, 16, 24 and 32 jobs in parallel. I am trying to control
> > > > these numbers of parallel jobs by setting the Karajan jobthrottle
> > > > values in sites.xml to 0.07, 0.15, and so on.
> > > >
> > > >
> > > > However, it seems that the values are not corresponding to what I
> > > > see
> > > > in the Swift progress text.
> > > >
> > > >
> > > > Initially, when I set jobthrottle to 0.07, only 2 jobs started in
> > > > parallel. Then I added the line setting "Initialscore" value to
> > > > 10000,
> > > > which improved the jobs to 5. After this a 10-fold increase in
> > > > "initialscore" did not improve the jobs count.
> > > >
> > > >
> > > > Furthermore, a new batch of 5 jobs get started only when *all*
> > > > jobs
> > > > from the old batch are over as opposed to a continuous supply of
> > > > jobs
> > > > from "site selection" to "stage out" state which happens in the
> > > > case
> > > > of coaster and other providers.
> > > >
> > > >
> > > > The behavior is same in Swift 0.93.1 and latest trunk.
> > > >
> > > >
> > > >
> > > > Thank you for any clues on how to set the expected number of
> > > > parallel
> > > > jobs to these values.
> > > >
> > > >
> > > > Please find attached one such log of this run.
> > > > Thanks, --
> > > > Ketan
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> > >
> > > --
> > > Michael Wilde
> > > Computation Institute, University of Chicago
> > > Mathematics and Computer Science Division
> > > Argonne National Laboratory
> > >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> >
> >
> >
> >
> >
> > --
> > Ketan
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
>
>


-- 
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20121024/b09c98f5/attachment.html>

From lpesce at uchicago.edu  Thu Oct 25 10:05:52 2012
From: lpesce at uchicago.edu (Lorenzo Pesce)
Date: Thu, 25 Oct 2012 10:05:52 -0500
Subject: [Swift-devel] Experiments on Beagle
Message-ID: <7E532B0D-81A5-4DBA-A688-4F4A5D61F7C8@uchicago.edu>

Hi --
I am running on more than 10,000 cores because there are a good number of users having problems running their jobs, which left the machine for me =)

I am doing work for a user, so don't worry too much. 

I am just writing in case you want to take a look at how the simulations are proceeding, how the memory is used (login5, user lpesce) and how the number of tasks goes up and down as jobs are completed.
(For example, I asked for 500 nodes and I am getting only a little over 400, but the machine has available the nodes I asked for, swift is just not taking them as far as I can tell, the number first spiked up to the requested number of nodes --I think-- then winded down)


From wilde at mcs.anl.gov  Thu Oct 25 10:22:28 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 25 Oct 2012 10:22:28 -0500 (CDT)
Subject: [Swift-devel] Experiments on Beagle
In-Reply-To: <7E532B0D-81A5-4DBA-A688-4F4A5D61F7C8@uchicago.edu>
Message-ID: <964396745.64603.1351178548304.JavaMail.root@zimbra.anl.gov>

Lorenzo, 10K cores sounds great.

Regarding not using all the nodes: I have seen that on Cray test runs, but only at >16K cores.  Its also possible that one or more throttle settings are holding back your runs.

Can you point us to the run directory where we can watch your log file and see your config files and your script?

- Mike


----- Original Message -----
> From: "Lorenzo Pesce" <lpesce at uchicago.edu>
> To: "swift-devel Devel" <swift-devel at ci.uchicago.edu>
> Sent: Thursday, October 25, 2012 10:05:52 AM
> Subject: [Swift-devel] Experiments on Beagle
> Hi --
> I am running on more than 10,000 cores because there are a good number
> of users having problems running their jobs, which left the machine
> for me =)
> 
> I am doing work for a user, so don't worry too much.
> 
> I am just writing in case you want to take a look at how the
> simulations are proceeding, how the memory is used (login5, user
> lpesce) and how the number of tasks goes up and down as jobs are
> completed.
> (For example, I asked for 500 nodes and I am getting only a little
> over 400, but the machine has available the nodes I asked for, swift
> is just not taking them as far as I can tell, the number first spiked
> up to the requested number of nodes --I think-- then winded down)
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From lpesce at uchicago.edu  Thu Oct 25 10:25:35 2012
From: lpesce at uchicago.edu (Lorenzo Pesce)
Date: Thu, 25 Oct 2012 10:25:35 -0500
Subject: [Swift-devel] Experiments on Beagle
In-Reply-To: <964396745.64603.1351178548304.JavaMail.root@zimbra.anl.gov>
References: <964396745.64603.1351178548304.JavaMail.root@zimbra.anl.gov>
Message-ID: <6A52B499-3249-4CBD-A7EF-DB369C3A690D@uchicago.edu>

Sorry for the lapse:

/lustre/beagle/GCNet/RG/Oreo/o080522_BS1

On Oct 25, 2012, at 10:22 AM, Michael Wilde wrote:

> Lorenzo, 10K cores sounds great.
> 
> Regarding not using all the nodes: I have seen that on Cray test runs, but only at >16K cores.  Its also possible that one or more throttle settings are holding back your runs.
> 
> Can you point us to the run directory where we can watch your log file and see your config files and your script?
> 
> - Mike
> 
> 
> ----- Original Message -----
>> From: "Lorenzo Pesce" <lpesce at uchicago.edu>
>> To: "swift-devel Devel" <swift-devel at ci.uchicago.edu>
>> Sent: Thursday, October 25, 2012 10:05:52 AM
>> Subject: [Swift-devel] Experiments on Beagle
>> Hi --
>> I am running on more than 10,000 cores because there are a good number
>> of users having problems running their jobs, which left the machine
>> for me =)
>> 
>> I am doing work for a user, so don't worry too much.
>> 
>> I am just writing in case you want to take a look at how the
>> simulations are proceeding, how the memory is used (login5, user
>> lpesce) and how the number of tasks goes up and down as jobs are
>> completed.
>> (For example, I asked for 500 nodes and I am getting only a little
>> over 400, but the machine has available the nodes I asked for, swift
>> is just not taking them as far as I can tell, the number first spiked
>> up to the requested number of nodes --I think-- then winded down)
>> 
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 
> -- 
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 


From lpesce at uchicago.edu  Thu Oct 25 10:50:04 2012
From: lpesce at uchicago.edu (Lorenzo Pesce)
Date: Thu, 25 Oct 2012 10:50:04 -0500
Subject: [Swift-devel] Experiments on Beagle
In-Reply-To: <964396745.64603.1351178548304.JavaMail.root@zimbra.anl.gov>
References: <964396745.64603.1351178548304.JavaMail.root@zimbra.anl.gov>
Message-ID: <D7BBB819-E4A4-4370-922C-5ADE94945219@uchicago.edu>

Run started to falter and I killed it. I resent it out capturing the screen to *.screenlog
(lots of failed wrappers, it might be the app itself or the system, I don't know).

I hope now I am capturing all the info.

On Oct 25, 2012, at 10:22 AM, Michael Wilde wrote:

> Lorenzo, 10K cores sounds great.
> 
> Regarding not using all the nodes: I have seen that on Cray test runs, but only at >16K cores.  Its also possible that one or more throttle settings are holding back your runs.
> 
> Can you point us to the run directory where we can watch your log file and see your config files and your script?
> 
> - Mike
> 
> 
> ----- Original Message -----
>> From: "Lorenzo Pesce" <lpesce at uchicago.edu>
>> To: "swift-devel Devel" <swift-devel at ci.uchicago.edu>
>> Sent: Thursday, October 25, 2012 10:05:52 AM
>> Subject: [Swift-devel] Experiments on Beagle
>> Hi --
>> I am running on more than 10,000 cores because there are a good number
>> of users having problems running their jobs, which left the machine
>> for me =)
>> 
>> I am doing work for a user, so don't worry too much.
>> 
>> I am just writing in case you want to take a look at how the
>> simulations are proceeding, how the memory is used (login5, user
>> lpesce) and how the number of tasks goes up and down as jobs are
>> completed.
>> (For example, I asked for 500 nodes and I am getting only a little
>> over 400, but the machine has available the nodes I asked for, swift
>> is just not taking them as far as I can tell, the number first spiked
>> up to the requested number of nodes --I think-- then winded down)
>> 
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 
> -- 
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 


From lpesce at uchicago.edu  Thu Oct 25 11:33:54 2012
From: lpesce at uchicago.edu (Lorenzo Pesce)
Date: Thu, 25 Oct 2012 11:33:54 -0500
Subject: [Swift-devel] Experiments on Beagle
In-Reply-To: <964396745.64603.1351178548304.JavaMail.root@zimbra.anl.gov>
References: <964396745.64603.1351178548304.JavaMail.root@zimbra.anl.gov>
Message-ID: <748781B7-1E2A-4B9B-9EF7-DF9250C590AE@uchicago.edu>

Running smoothly on 500 nodes exactly (12,000 jobs).
So far the problems of explosions hasn't materialized yet.


On Oct 25, 2012, at 10:22 AM, Michael Wilde wrote:

> Lorenzo, 10K cores sounds great.
> 
> Regarding not using all the nodes: I have seen that on Cray test runs, but only at >16K cores.  Its also possible that one or more throttle settings are holding back your runs.
> 
> Can you point us to the run directory where we can watch your log file and see your config files and your script?
> 
> - Mike
> 
> 
> ----- Original Message -----
>> From: "Lorenzo Pesce" <lpesce at uchicago.edu>
>> To: "swift-devel Devel" <swift-devel at ci.uchicago.edu>
>> Sent: Thursday, October 25, 2012 10:05:52 AM
>> Subject: [Swift-devel] Experiments on Beagle
>> Hi --
>> I am running on more than 10,000 cores because there are a good number
>> of users having problems running their jobs, which left the machine
>> for me =)
>> 
>> I am doing work for a user, so don't worry too much.
>> 
>> I am just writing in case you want to take a look at how the
>> simulations are proceeding, how the memory is used (login5, user
>> lpesce) and how the number of tasks goes up and down as jobs are
>> completed.
>> (For example, I asked for 500 nodes and I am getting only a little
>> over 400, but the machine has available the nodes I asked for, swift
>> is just not taking them as far as I can tell, the number first spiked
>> up to the requested number of nodes --I think-- then winded down)
>> 
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 
> -- 
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 


From lpesce at uchicago.edu  Thu Oct 25 11:52:17 2012
From: lpesce at uchicago.edu (Lorenzo Pesce)
Date: Thu, 25 Oct 2012 11:52:17 -0500
Subject: [Swift-devel] Experiments on Beagle
In-Reply-To: <964396745.64603.1351178548304.JavaMail.root@zimbra.anl.gov>
References: <964396745.64603.1351178548304.JavaMail.root@zimbra.anl.gov>
Message-ID: <4A4AF125-7A97-49AE-B321-DB8B61D3D00C@uchicago.edu>

Failures are starting to appear and dominate the results. I am letting it run anyway so you get a chance to look at it live.

On Oct 25, 2012, at 10:22 AM, Michael Wilde wrote:

> Lorenzo, 10K cores sounds great.
> 
> Regarding not using all the nodes: I have seen that on Cray test runs, but only at >16K cores.  Its also possible that one or more throttle settings are holding back your runs.
> 
> Can you point us to the run directory where we can watch your log file and see your config files and your script?
> 
> - Mike
> 
> 
> ----- Original Message -----
>> From: "Lorenzo Pesce" <lpesce at uchicago.edu>
>> To: "swift-devel Devel" <swift-devel at ci.uchicago.edu>
>> Sent: Thursday, October 25, 2012 10:05:52 AM
>> Subject: [Swift-devel] Experiments on Beagle
>> Hi --
>> I am running on more than 10,000 cores because there are a good number
>> of users having problems running their jobs, which left the machine
>> for me =)
>> 
>> I am doing work for a user, so don't worry too much.
>> 
>> I am just writing in case you want to take a look at how the
>> simulations are proceeding, how the memory is used (login5, user
>> lpesce) and how the number of tasks goes up and down as jobs are
>> completed.
>> (For example, I asked for 500 nodes and I am getting only a little
>> over 400, but the machine has available the nodes I asked for, swift
>> is just not taking them as far as I can tell, the number first spiked
>> up to the requested number of nodes --I think-- then winded down)
>> 
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 
> -- 
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 


From iraicu at cs.iit.edu  Fri Oct 26 12:43:40 2012
From: iraicu at cs.iit.edu (Ioan Raicu)
Date: Fri, 26 Oct 2012 13:43:40 -0400
Subject: [Swift-devel] Call for Workshops: ACM HPDC 2013 -- deadline
 extended to November 1, 2012
Message-ID: <508ACBCC.706@cs.iit.edu>


      Call for Workshops

The organizers of the /22nd International ACM Symposium on 
High-Performance Parallel and Distributed Computing/ (HPDC'13) *call for 
proposals for workshops* to be held with HPDC'13. The workshops will be 
held on June 17-18, 2013.

Workshops should provide forums for discussion among researchers and 
practitioners on focused topics or emerging research areas relevant to 
the HPDC community. Organizers may structure workshops as they see fit, 
including invited talks, panel discussions, presentations of work in 
progress, fully peer-reviewed papers, or some combination. Workshops 
could be scheduled for half a day or a full day, depending on interest, 
space constraints, and organizer preference. Organizers should design 
workshops for approximately 20-40 participants, to balance impact and 
effective discussion.

*Workshop proposals* must be sent in PDF format to the HPDC'13 Workshops 
Chair, Abhishek Chandra (Email: chandra AT cs DOT umn DOT edu 
<mailto:chandra at cs.umn.edu>) with the subject line *"HPDC 2013 Workshop 
Proposal"*, and should include:

  * The name and acronym of the workshop
  * A description (0.5-1 page) of the theme of the workshop
  * A description (one paragraph) of the relation between the theme of
    the workshop and of HPDC
  * A list of topics of interest
  * The names and affiliations of the workshop organizers, and if
    applicable, of a significant portion of the program committee
  * A description of the expected structure of the workshop (papers,
    invited talks, panel discussions, etc.)
  * Data about previous offerings of the workshop (if any), including
    the attendance, the numbers of papers or presentations submitted and
    accepted, and the links to the corresponding websites
  * A publicity plan for attracting submissions and attendees. Please
    also include expected number of submissions, accepted papers, and
    attendees that you anticipate for a successful workshop.

Due to publication deadlines, workshops must operate within roughly the 
following timeline: papers due mid February (2-3 weeks after the HPDC 
deadline), and selected and sent to the publisher by mid April.


      Important dates:

*Workshop Proposals Due: * 	*November 1, 2012*
Notifications: 	November 7, 2012
Workshop CFPs Online and Distributed: 	November 25, 2012


-- 
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor, Illinois Institute of Technology (IIT)
Guest Research Faculty, Argonne National Laboratory (ANL)
=================================================================
Data-Intensive Distributed Systems Laboratory, CS/IIT
Distributed Systems Laboratory, MCS/ANL
=================================================================
Cel:    1-847-722-0876
Office: 1-312-567-5704
Email:  iraicu at cs.iit.edu
Web:    http://www.cs.iit.edu/~iraicu/
Web:    http://datasys.cs.iit.edu/
=================================================================
=================================================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20121026/b01f12e9/attachment.html>

From iraicu at cs.iit.edu  Fri Oct 26 13:13:51 2012
From: iraicu at cs.iit.edu (Ioan Raicu)
Date: Fri, 26 Oct 2012 14:13:51 -0400
Subject: [Swift-devel] CFP: 22nd Int. ACM Symposium on High-Performance
 Parallel and Distributed Computing (HPDC'13)
Message-ID: <508AD2DF.8060609@cs.iit.edu>

                       **** CALL FOR PAPERS ****

                 The 22nd International ACM Symposium on
          High-Performance Parallel and Distributed Computing
                             (HPDC'13)

               New York City, USA - June 17-21, 2013

                     http://www.hpdc.org/2013


The ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC)
is the premier annual conference for presenting the latest research on the design,
implementation, evaluation, and the use of parallel and distributed systems for high-end computing.

In 2013, the 22nd HPDC and affiliated workshops will take place in the heart of iconic New York
City from June 17-21.

**** SUBMISSION DEADLINES ****
Abstracts:      14 January 2013
Papers:         21 January 2013 (no extensions)

**** HPDC'13 GENERAL CO-CHAIRS ****
Manish Parashar, Rutgers University
Jon Weissman, University of Minnesota

**** HPDC'13 PROGRAM CO-CHAIRS ****
Dick Epema, Delft University of Technology
Renato Figueiredo, University of Florida

**** HPDC'13 WORKSHOPS CHAIR ****
Abhishek Chandra, University of Minnesota

**** HPDC'13 LOCAL ARRANGEMENTS CHAIR ****
Daniele Scarpazza, DEShaw Research

**** HPDC'13 SPONSORSHIP CHAIR ****
Dean Hildebrand, IBM Almaden

**** HPDC'13 PUBLICITY CO-CHAIRS ****
Alexandru Iosup, Delft University of Technology, the Netherlands
Ioan Raicu, Illinois Institute of Technology, USA
Kenjiro Taura, University of Tokyo, Japan
Bruno Schulze, National Laboratory for Scientific Computing, Brazil


**** SCOPE AND TOPICS ****
Submissions are welcomed on high-performance parallel and distributed computing topics including but not limited to:
clusters, clouds, grids, data-intensive computing, massively multicore, and global-scale computing systems.
New scholarly research showing empirical and reproducible results in architectures, systems, and networks is strongly encouraged,
as are experience reports of operational deployments that can provide insights for future research on HPDC applications and systems.
All papers will be evaluated for their originality, technical depth and correctness, potential impact,
relevance to the conference, and quality of presentation. Research papers must clearly demonstrate research contributions and novelty,
while experience reports must clearly describe lessons learned and demonstrate impact.

In the context of high-?????performance parallel and distributed computing, the topics of interest include, but are not limited to:

   * Systems, networks, and architectures for high-end computing
   * Massively multicore systems
   * Resource virtualization
   * Programming languages and environments
   * I/O, storage systems, and data management
   * Resource management and scheduling, including energy-?????aware techniques
   * Performance modeling and analysis
   * Fault tolerance, reliability, and availability
   * Data-intensive computing
   * Applications of parallel and distributed computing

**** PAPER SUBMISSION GUIDELINES ****
Authors are invited to submit technical papers of at most 12 pages in PDF format, including figures and references.
Papers should be formatted in the ACM Proceedings Style and submitted via the conference web site.
No changes to the margins, spacing, or font sizes as specified by the style file are allowed.
Accepted papers will appear in the conference proceedings, and will be incorporated into the ACM Digital Library.
A limited number of papers will be accepted as posters.

Papers must be self-contained and provide the technical substance required for the program committee to evaluate their contributions.
Papers should thoughtfully address all related work, particularly work presented at previous HPDC events.
Submitted papers must be original work that has not appeared in and is not under consideration for another conference or a journal.
See the ACM Prior Publication Policy for more details.

**** IMPORTANT DATES ****
Abstracts Due:      14 January 2013
Papers Due:         21 January 2013 (no extensions)

**** Program Committee ****
David Abramson, Monash University, Australia
Kento Aida, National Institute of Informatics, Japan
Gabriel Antoniu	INRIA, France
Henri Bal, Vrije Universiteit, the Netherlands
Adam Barker, University of St Andrews, UK
Michela Becchi, University of Missouri - Columbia, USA
John Bent, EMC, USA
Ali Butt, Virginia Tech, USA
Kirk Cameron, Virginia Tech, USA
Franck Cappello, INRIA, France and University of Illinois at Urbana-Champaign, USA
Henri Casanova, University of Hawaii, USA
Abhishek Chandra, University of Minnesota, USA
Andrew Chien, University of Chicago and Argonne National Laboratory, USA
Paolo Costa, Imperial College London, UK
Peter Dinda, Northwestern University, USA
Gilles Fedak, INRIA, France
Ian Foster, University of Chicago and Argonne National Laboratory, USA
Clemens Grelck, University of Amsterdam, the Netherlands
Dean Hildebrand, IBM Research, USA
Fabrice Huet, INRIA-University of Nice, France
Adriana Iamnitchi, University of South Florida, USA
Alexandru Iosup, Delft University of Technology, the Netherlands
Kate Keahey, Argonne National Laboratory, USA
Thilo Kielmann, Vrije Universiteit, the Netherlands
Charles Kilian, Purdue University, USA
Zhiling Lan, Illinois Institute of Technology, USA
John Lange, University of Pittsburgh, USA
Barney Maccabe, Oak Ridge National Laboratory, USA
Carlos Maltzahn, University of California, Santa Cruz, USA
Naoya Maruyama, RIKEN Advanced Institute for Computational Science, Japan
Satoshi Matsuoka, Tokyo Institute of Technology, Japan
Manish Parashar, Rutgers University, USA
Judy Qiu, Indiana University, USA
Ioan Raicu, Illinois Institute of Technology, USA
Philip Rhodes, University of Mississippi, USA
Matei Ripeanu, University of British Columbia, Canada
Prasenjit Sarkar, IBM Research, USA
Daniele Scarpazza, D.E. Shaw Research, USA
Karsten Schwan, Georgia Institute of Technology, USA
Martin Swany, Indiana University, USA
Michela Taufer, University of Delaware, USA
Kenjiro Taura, University of Tokyo, Japan
Douglas Thain, University of Notre Dame, USA
Cristian Ungureanu, NEC Research, USA
Ana Varbanescu, Delft University of Technology, the Netherlands
Chuliang Weng, Shanghai Jiao Tong University, China
Jon Weissman, University of Minnesota, USA
Yongwei Wu, Tsinghua University, China
Dongyan Xu, Purdue University, USA
Ming Zhao, Florida International University, USA

**** Steering Committee ****
Henri Bal, Vrije Universiteit, the Netherlands
Andrew A. Chien, University of Chicago and Argonne National Laboratory, USA
Peter Dinda, Northwestern University, USA
Dick Epema, Delft University of Technology, the Netherlands
Ian Foster, University of Chicago and Argonne National Laboratory, USA
Salim Hariri, University of Arizona, USA
Thilo Kielmann, Vrije Universiteit, the Netherlands
Dieter Kranzlmueller, Ludwig-Maximilians-Universitaet Muenchen, Germany
Arthur "Barney" Maccabe, Oak Ridge National Laboratory, USA
Satoshi Matsuoka, Tokyo Institute of Technology, Japan
Manish Parashar, Rutgers University, USA
Matei Ripeanu, University of British Columbia, Canada
Karsten Schwan, Georgia Tech, USA
Doug Thain, University of Notre Dame, USA
Jon Weissman, University of Minnesota (Chair), USA

-- 
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor, Illinois Institute of Technology (IIT)
Guest Research Faculty, Argonne National Laboratory (ANL)
=================================================================
Data-Intensive Distributed Systems Laboratory, CS/IIT
Distributed Systems Laboratory, MCS/ANL
=================================================================
Cel:    1-847-722-0876
Office: 1-312-567-5704
Email:  iraicu at cs.iit.edu
Web:    http://www.cs.iit.edu/~iraicu/
Web:    http://datasys.cs.iit.edu/
=================================================================
=================================================================


From iraicu at cs.iit.edu  Fri Oct 26 13:26:57 2012
From: iraicu at cs.iit.edu (Ioan Raicu)
Date: Fri, 26 Oct 2012 14:26:57 -0400
Subject: [Swift-devel] CFP: The 13th IEEE/ACM International Symposium on
 Cluster, Cloud and Grid Computing (CCGrid 2013)
Message-ID: <508AD5F1.2020301@cs.iit.edu>

                       **** CALL FOR PAPERS ****

                 The 13th IEEE/ACM International Symposium on
                      Cluster, Cloud and Grid Computing
                             (CCGrid 2013)

           Delft University of Technology, Delft, the Netherlands

                            May 13-16, 2013

                http://www.pds.ewi.tudelft.nl/ccgrid2013


Rapid advances in architectures, networks, and systems and middleware technologies
are leading to new concepts in and platforms for computing, ranging from Clusters and
Grids to Clouds and Datacenters. CCGrid is a series of very successful conferences,
sponsored by the IEEE Computer Society Technical Committee on Scalable Computing (TCSC)
and the ACM, with the overarching goal of bringing together international researchers,
developers, and users to provide an international forum to present leading research
activities and results on a broad range of topics related to these concepts and platforms,
and their applications. The conference features keynotes, technical presentations,
workshops, tutorials, and posters, as well as the SCALE challenge featuring live demonstrations.

In 2013, CCGrid will come to the Netherlands for the first time, and will be held in Delft,
a historical, picturesque city that is less than one hour away from Amsterdam-Schiphol airport.
The main conference will be held on May 14-16 (Tuesday to Thursday), with tutorials and
affiliated workshops taking place on May 13 (Monday).


**** IMPORTANT DATES ****
Papers Due:				12 November 2012
Author Notifications:			24 January 2013
Final Papers Due:			22 February 2013
  

**** TOPICS OF INTEREST ****
CCGrid 2013 will have a focus on important and immediate issues that are significantly
influencing all aspects of cluster, cloud and grid computing. Topics of interest include,
but are not limited to:

* Applications and Experiences: Applications to real and complex problems in science,
   engineering, business, and society; User studies; Experiences with large-scale deployments,
   systems, or applications
* Architecture: System architectures, design and deployment; Power and cooling; Security
   and reliability; High availability solutions
* Autonomic Computing and Cyberinfrastructure: Self-managed behavior, models and technologies;
   Autonomic paradigms and systems (control-based, bio-inspired, emergent, etc.); Bio-inspired
   optimizations and computing
* Cloud Computing: Cloud architectures; Software tools and techniques for clouds
* Multicore and Accelerator-based Computing: Software and application techniques to utilize
   multicore architectures and accelerators in clusters, grids, and clouds
* Performance Modeling and Evaluation: Performance prediction and modeling; Monitoring and
   evaluation tools; Analysis of system and application performance; Benchmarks and testbeds
* Programming Models, Systems, and Fault-Tolerant Computing: Programming models and
   environments for cluster, cloud, and grid computing; Fault-tolerant systems, programs and
   algorithms; Systems software to support efficient computing
* Scheduling and Resource Management: Techniques to schedule jobs and resources on cluster,
   cloud, and grid computing platforms; SLA definition and enforcement


**** PAPER SUBMISSION GUIDELINES ****
Authors are invited to submit papers electronically in PDF format. Submitted manuscripts
should be structured as technical papers and may not exceed 8 letter-size (8.5 x 11) pages
including figures, tables and references using the IEEE format for conference proceedings.
Submissions not conforming to these guidelines may be returned without review. Authors
should make sure that their file will print on a printer that uses letter-size (8.5 x 11)
paper. The official language of the conference is English. All manuscripts will be reviewed
and will be judged on correctness, originality, technical strength, significance, quality of
presentation, and interest and relevance to the conference attendees.

Submitted papers must represent original unpublished research that is not currently under
review for any other conference or journal. Papers not following these guidelines will be
rejected without review and further action may be taken, including (but not limited to)
notifications sent to the heads of the institutions of the authors and sponsors of the
conference. Submissions received after the due date, exceeding the page limit, or not
appropriately structured may not be considered. Authors may contact the conference chairs
for more information. The proceedings will be published through the IEEE Computer Society
Press, USA, and will be made available online through the IEEE Digital Library.


**** CALL FOR TUTORIAL AND WORKSHOP PROPOSALS ****
Tutorials and workshops affiliated with CCGrid 2013 will be held on May 13 (Monday). For
more information on the tutorials and workshops and for the complete Call for Tutorial and
Workshop Proposals, please see the conference website.


**** GENERAL CHAIR ****
Dick Epema, Delft University of Technology, the Netherlands


**** PROGRAM CHAIR ****
Thomas Fahringer, University of Innsbruck, Austria


**** PROGRAM VICE-CHAIRS ****
Rosa Badia, Barcelona Supercomputing Center, Spain
Henri Bal, Vrije Universiteit, the Netherlands
Marios Dikaiakos, University of Cyprus, Cyprus
Kirk Cameron, VirginiaTech, USA
Daniel Katz, University of Chicago & Argonne Nat Lab, USA
Kate Keahey, Argonne National Laboratory, USA
Martin Schulz, Lawrence Livermore National Laboratory, USA
Douglas Thain, University of Notre Dame, USA
Cheng-Zhong Xu, Shenzhen Inst. of Advanced Techn, China


**** WORKSHOPS CO-CHAIRS ****
Shantenu Jha, Rutgers and Louisana State University, USA
Ioan Raicu, Illinois Institute of Technology, USA


**** TOTORIALS CHAIR ****
Radu Prodan, University of Innsbruck, Austria


**** DOCTORAL SYMPOSIUM CO-CHAIRS ****
Yogesh Simmhan, University of Southern California, USA
Ana Varbanescu, Delft Univ of Technology, the Netherlands


**** SUBMISSIONS AND PROCEEDINGS CHAIR ****
Pavan Balaji, Argonne National Laboratory, USA


**** FINANCE AND REGISTRATION CHAIR ****
Alexandru Iosup, Delft Univ of Technology, the Netherlands


**** PUBLICITY CHAIRS ****
Nazareno Andrade, University Federal de Campina Grance, Brazil
Gabriel Antoniu, INRIA, France
Bahman Javadi, University of Western Sysney, Australia
Ioan Raicu, Illinois Institute of Technology and Argonne National Laboratory, USA
Kin Choong Yow, Shenzhen Inst. of Advanced Technology, China


**** CYBER CHAIR ****
Stephen van der Laan, Delft University of Technology, the Netherlands

-- 
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor, Illinois Institute of Technology (IIT)
Guest Research Faculty, Argonne National Laboratory (ANL)
=================================================================
Data-Intensive Distributed Systems Laboratory, CS/IIT
Distributed Systems Laboratory, MCS/ANL
=================================================================
Cel:    1-847-722-0876
Office: 1-312-567-5704
Email:  iraicu at cs.iit.edu
Web:    http://www.cs.iit.edu/~iraicu/
Web:    http://datasys.cs.iit.edu/
=================================================================
=================================================================